For widths of 32 pixels and more, loop first horizontally,
then vertically.
Previously, this function would process a 16 pixel wide slice
of the block, looping vertically. After processing the whole
height, it would backtrack and process the next 16 pixel wide
slice.
When doing 8tap filtering horizontally, the function must load
7 more pixels (in practice, 8) following the actual inputs, and
this was done for each slice.
By iterating first horizontally throughout each line, then
vertically, we access data in a more cache friendly order, and
we don't need to reload data unnecessarily.
Keep the original order in put_hevc_\type\()_h12_8_neon; the
only suboptimal case there is for width=24. But specializing
an optimal variant for that would require more code, which
might not be worth it.
For the h16 case, this implementation would give a slowdown,
as it now loads the first 8 pixels separately from the rest, but
for larger widths, it is a gain. Therefore, keep the h16 case
as it was (but remove the outer loop), and create a new specialized
version for horizontal looping with 16 pixels at a time.
Before: Cortex A53 A72 A73 Graviton 3
put_hevc_qpel_h16_8_neon: 710.5 667.7 692.5 211.0
put_hevc_qpel_h32_8_neon: 2791.5 2643.5 2732.0 883.5
put_hevc_qpel_h64_8_neon: 10954.0 10657.0 10874.2 3241.5
After:
put_hevc_qpel_h16_8_neon: 697.5 663.5 705.7 212.5
put_hevc_qpel_h32_8_neon: 2767.2 2684.5 2791.2 920.5
put_hevc_qpel_h64_8_neon: 10559.2 10471.5 10932.2 3051.7
Signed-off-by: Martin Storsjö <martin@martin.st>
This gets rid of a couple instructions, but the actual performance
is almost identical on Cortex A72/A73. On Cortex A53, it is a
handful of cycles faster.
Signed-off-by: Martin Storsjö <martin@martin.st>
Many of the routines within hevcdsp_epel_neon and hevcdsp_qpel_neon
store temporary buffers on the stack. When consuming it,
many of these functions use the stack pointer as incremental pointer
for reading the data (instead of storing it in another register),
which is rather unusual.
Technically, this is fine as long as the pointer remains properly
aligned.
However in the case of ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm,
after incrementing sp when reading data (within each 16 pixel
wide stripe) it would then reset the stack pointer back to a lower
value, for reading the next 16 pixel wide stripe, expecting the
data to remain untouched.
This can't be assumed; data on the stack below the stack pointer
can be clobbered (e.g. by a signal handler). Some OS ABIs
allow for a little margin that won't be touched, aka a red zone,
but not all do. The ones that do, guarantee 16 or 128 bytes, not
9 KB.
Convert this function to use a separate pointer register to
iterate through the data, retaining the stack pointer to point
at the bottom of the data we require to remain untouched.
Signed-off-by: Martin Storsjö <martin@martin.st>
Building iOS platform with arm64, the compiler has a warning: "instruction movi.2d with immediate #0 may not function correctly on this CPU, converting to movi.16b"
Signed-off-by: xufuji456 <839789740@qq.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
This eases actual development of the assembly functions, by only
allowing extension instructions within the sections that explicitly
enable them, instead of having all extensions enabled everywhere.
Signed-off-by: Martin Storsjö <martin@martin.st>
Some functions have slightly different indentation styles; try
to match the surrounding code.
libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally
uses a layered indentation style to visually show how different
unrolled/interleaved phases fit together.
Signed-off-by: Martin Storsjö <martin@martin.st>