Move the loop counter decrement further from the branch instruction,
this hides the latency of the decrement.
In loops that first load, then store (the horizontal prediction cases),
do the decrement after the load (where the next instruction would
stall a bit anyway, waiting for the result of the load).
In loops that store twice using the same destination register,
also do the decrement between the two stores (as the second store
would need to wait for the updated destination register from the
first instruction).
In loops that store twice to two different destination registers,
do the decrement before both stores, to do it as soon before the
branch as possible.
This gives minor (1-2 cycle) speedups in most cases (modulo measurement
noise), but the horizontal prediction functions get a rather notable
speedup on the Cortex A53.
Before: Cortex A53 A72 A73
pred8x8_dc_8_neon: 60.7 46.2 39.2
pred8x8_dc_128_8_neon: 30.7 18.0 14.0
pred8x8_horizontal_8_neon: 42.2 29.2 18.5
pred8x8_left_dc_8_neon: 52.7 36.2 32.2
pred8x8_mad_cow_dc_0l0_8_neon: 48.2 27.7 25.7
pred8x8_mad_cow_dc_0lt_8_neon: 52.5 33.2 34.7
pred8x8_mad_cow_dc_l0t_8_neon: 52.5 31.7 33.2
pred8x8_mad_cow_dc_l00_8_neon: 43.2 27.0 25.5
pred8x8_plane_8_neon: 112.2 86.2 88.2
pred8x8_top_dc_8_neon: 40.7 23.0 21.2
pred8x8_vertical_8_neon: 27.2 15.5 14.0
pred16x16_dc_8_neon: 91.0 73.2 70.5
pred16x16_dc_128_8_neon: 43.0 34.7 30.7
pred16x16_horizontal_8_neon: 86.0 49.7 44.7
pred16x16_left_dc_8_neon: 87.0 67.2 67.5
pred16x16_plane_8_neon: 236.0 175.7 173.0
pred16x16_top_dc_8_neon: 53.2 39.0 41.7
pred16x16_vertical_8_neon: 41.7 29.7 31.0
After:
pred8x8_dc_8_neon: 59.0 46.7 42.5
pred8x8_dc_128_8_neon: 28.2 18.0 14.0
pred8x8_horizontal_8_neon: 34.2 29.2 18.5
pred8x8_left_dc_8_neon: 51.0 38.2 32.7
pred8x8_mad_cow_dc_0l0_8_neon: 46.7 28.2 26.2
pred8x8_mad_cow_dc_0lt_8_neon: 55.2 33.7 37.5
pred8x8_mad_cow_dc_l0t_8_neon: 51.2 31.7 37.2
pred8x8_mad_cow_dc_l00_8_neon: 41.7 27.5 26.0
pred8x8_plane_8_neon: 111.5 86.5 89.5
pred8x8_top_dc_8_neon: 39.0 23.2 21.0
pred8x8_vertical_8_neon: 27.2 16.0 14.0
pred16x16_dc_8_neon: 85.0 70.2 70.5
pred16x16_dc_128_8_neon: 42.0 30.0 30.7
pred16x16_horizontal_8_neon: 66.5 49.5 42.5
pred16x16_left_dc_8_neon: 81.0 66.5 67.5
pred16x16_plane_8_neon: 235.0 175.7 173.0
pred16x16_top_dc_8_neon: 52.0 39.0 41.7
pred16x16_vertical_8_neon: 40.2 33.2 31.0
Despite this, a number of these functions still are slower than
what e.g. GCC 7 generates - this shows the relative speedup of the
neon codepaths over the compiler generated ones:
Cortex A53 A72 A73
pred8x8_dc_8_neon: 0.86 0.65 1.04
pred8x8_dc_128_8_neon: 0.59 0.44 0.62
pred8x8_horizontal_8_neon: 1.51 0.58 1.30
pred8x8_left_dc_8_neon: 0.72 0.56 0.89
pred8x8_mad_cow_dc_0l0_8_neon: 0.93 0.93 1.37
pred8x8_mad_cow_dc_0lt_8_neon: 1.37 1.41 1.68
pred8x8_mad_cow_dc_l0t_8_neon: 1.21 1.17 1.32
pred8x8_mad_cow_dc_l00_8_neon: 1.24 1.19 1.60
pred8x8_plane_8_neon: 3.36 3.58 3.76
pred8x8_top_dc_8_neon: 0.97 0.99 1.43
pred8x8_vertical_8_neon: 0.86 0.78 1.18
pred16x16_dc_8_neon: 1.20 1.06 1.49
pred16x16_dc_128_8_neon: 0.83 0.95 0.99
pred16x16_horizontal_8_neon: 1.78 0.96 1.59
pred16x16_left_dc_8_neon: 1.06 0.96 1.32
pred16x16_plane_8_neon: 5.78 6.49 7.19
pred16x16_top_dc_8_neon: 1.48 1.53 1.94
pred16x16_vertical_8_neon: 1.39 1.34 1.98
In particular, on Cortex A72, many of these functions are slower
than the compiler generated code, while they're more beneficial on
e.g. the Cortex A73.
Signed-off-by: Martin Storsjö <martin@martin.st>
Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth
available on aarch64.
For a UHD HDR (10 bit) sample video these were consuming the most time
and this optimization reduced overall decode time from 19.4s to 16.4s,
approximately 15% speedup.
Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts",
running on Apple M1.
Signed-off-by: Josh Dekker <josh@itanimul.li>
153372 UNITS in postfilter_c, 65536 runs, 0 skips
73164 UNITS in postfilter_neon, 65536 runs, 0 skips -> 2.1x speedup
80591 UNITS in deemphasis_c, 131072 runs, 0 skips
43969 UNITS in deemphasis_neon, 131072 runs, 0 skips -> 1.83x speedup
Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x realtime)
Deemphasis SIMD based on the following unrolling:
const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
float state = coeff;
for (int i = 0; i < len; i += 4) {
y[0] = x[0] + c1*state;
y[1] = x[1] + c2*state + c1*x[0];
y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];
state = y[3];
y += 4;
x += 4;
}
Unlike the x86 version, duplication is used instead of pslldq so
the structure and tables are different.
* commit '49f9c4272c4029b57ff300d908ba03c6332fc9c4':
aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon
Merged-by: James Almer <jamrial@gmail.com>
* commit 'cc7ba00c35faf0478f1f56215e926f70ccb31282':
aarch64: vp8: Port missing epel8 functions from arm version
Merged-by: James Almer <jamrial@gmail.com>
* commit '52c9b0a6c0d02cff6caebcf6989e565e05b55200':
aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version
Merged-by: James Almer <jamrial@gmail.com>
* commit 'f1011ea28a4048ddec97794ca3e9901474fe055f':
aarch64: vp8: Reorder the function pointer inits to match the arm original
Merged-by: James Almer <jamrial@gmail.com>
* commit '85bfaa4949f4afcde19061def3e8a18988964858':
aarch64: vp8: Use the proper aarch64 form for conditional branches
Merged-by: James Almer <jamrial@gmail.com>
* commit '0801853e640624537db386727b36fa97aa6258e7':
libavcodec: vp8 neon optimizations for aarch64
See 833fed5253
Merged-by: James Almer <jamrial@gmail.com>
Even if NEON would be disabled, the init functions should be built
as they are called as long as ARCH_AARCH64 is set.
These functions are part of a generic DSP subsytem, not tied directly
to one decoder. (They should be built if the vp7 decoder is enabled,
even if the vp8 decoder is disabled.)
Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit b4b27dce95)
This also partially fixes assembling with MS armasm64 (via
gas-preprocessor).
The movrel macro invocations need to pass the offset via a separate
parameter. Mach-o and COFF relocations don't allow a negative
offset to a symbol, which is handled properly if the offset is passed
via the parameter. If no offset parameter is given, the macro
evaluates to something like "adrp x17, subpel_filters-16+(0)", which
older clang versions also fail to parse (the older clang versions
only support one single offset term, although it can be a parenthesis.
Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit 26d7af4c38)
The previous version was a pretty exact translation of the arm
version. This version does do some unnecessary arithemetic (it does
more operations on vectors that are only half filled; it does 4
uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead
of packing data together (which could be done for free in the arm
version).
This gives a decent speedup on Cortex A53, a minor speedup on
A72 and a very minor slowdown on Cortex A73.
Before: Cortex A53 A72 A73
vp8_idct_add_neon: 79.7 67.5 65.0
After:
vp8_idct_add_neon: 67.7 64.8 66.7
Signed-off-by: Martin Storsjö <martin@martin.st>
The original arm version didn't do saturation here. This probably
doesn't make any difference for performance, but reduces the
differences.
Signed-off-by: Martin Storsjö <martin@martin.st>
This makes it similar to put_epel16_v6, and gives a large speedup
on Cortex A53, a minor speedup on A72 and a very minor slowdown on
A73.
Before: Cortex A53 A72 A73
vp8_put_epel16_h6v6_neon: 2211.4 1586.5 1431.7
After:
vp8_put_epel16_h6v6_neon: 1736.9 1522.0 1448.1
Signed-off-by: Martin Storsjö <martin@martin.st>
Even if NEON would be disabled, the init functions should be built
as they are called as long as ARCH_AARCH64 is set.
These functions are part of a generic DSP subsytem, not tied directly
to one decoder. (They should be built if the vp7 decoder is enabled,
even if the vp8 decoder is disabled.)
Signed-off-by: Martin Storsjö <martin@martin.st>
The previous form also does seem to assemble on current tools,
but I think it might fail on some older aarch64 tools.
Signed-off-by: Martin Storsjö <martin@martin.st>