Add Branch Target Identifiers (BTIs) to all functions defined in
AArch64 assembly files. Most of the BTI landing pads are added
automatically by the 'function' macro.
BTI support is turned on or off at compile time based on the presence
of the __ARM_FEATURE_BTI_DEFAULT feature macro.
A binary compiled with BTI support can be executed on an Armv8-A
processor without BTI support because the instructions are defined in
NOP space.
Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Elijah Ahmad <elijah.ahmad@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Change AArch64 assembly code to use:
ret x<n>
instead of:
br x<n>
"ret x<n>" is already used in a lot of places so this patch makes it
consistent across the code base. This does not change behavior or
performance.
In addition, this change reduces the number of landing pads needed in
a subsequent patch to support the Armv8.5-A Branch Target
Identification (BTI) security feature.
Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Don't use the loaded registers directly, avoiding stalls on in
order cores. Use vrhadd.u8 with q registers where easily possible.
Signed-off-by: Martin Storsjö <martin@martin.st>
transpose_4x8H was declared in vp9lpf_16bpp_neon, however this macro is
not unique to vp9 and could be used elsewhere.
Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
These became unnecessary when the stride arguments were changed from
int to ptrdiff_t in bc26fe8927
(0576ef466d) and
d5d699ab6e
(aa844dc46f).
Signed-off-by: Martin Storsjö <martin@martin.st>
This is marginally slower, but correct for all input values.
The previous implementation failed with certain input seeds, e.g.
"checkasm --test=hevc_idct 98".
Signed-off-by: Martin Storsjö <martin@martin.st>
Deprecated in commits 7fc329e2dd
and 31f6a4b4b8.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Signed-off-by: James Almer <jamrial@gmail.com>
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Move the loop counter decrement further from the branch instruction,
this hides the latency of the decrement.
In loops that first load, then store (the horizontal prediction cases),
do the decrement after the load (where the next instruction would
stall a bit anyway, waiting for the result of the load).
In loops that store twice using the same destination register,
also do the decrement between the two stores (as the second store
would need to wait for the updated destination register from the
first instruction).
In loops that store twice to two different destination registers,
do the decrement before both stores, to do it as soon before the
branch as possible.
This gives minor (1-2 cycle) speedups in most cases (modulo measurement
noise), but the horizontal prediction functions get a rather notable
speedup on the Cortex A53.
Before: Cortex A53 A72 A73
pred8x8_dc_8_neon: 60.7 46.2 39.2
pred8x8_dc_128_8_neon: 30.7 18.0 14.0
pred8x8_horizontal_8_neon: 42.2 29.2 18.5
pred8x8_left_dc_8_neon: 52.7 36.2 32.2
pred8x8_mad_cow_dc_0l0_8_neon: 48.2 27.7 25.7
pred8x8_mad_cow_dc_0lt_8_neon: 52.5 33.2 34.7
pred8x8_mad_cow_dc_l0t_8_neon: 52.5 31.7 33.2
pred8x8_mad_cow_dc_l00_8_neon: 43.2 27.0 25.5
pred8x8_plane_8_neon: 112.2 86.2 88.2
pred8x8_top_dc_8_neon: 40.7 23.0 21.2
pred8x8_vertical_8_neon: 27.2 15.5 14.0
pred16x16_dc_8_neon: 91.0 73.2 70.5
pred16x16_dc_128_8_neon: 43.0 34.7 30.7
pred16x16_horizontal_8_neon: 86.0 49.7 44.7
pred16x16_left_dc_8_neon: 87.0 67.2 67.5
pred16x16_plane_8_neon: 236.0 175.7 173.0
pred16x16_top_dc_8_neon: 53.2 39.0 41.7
pred16x16_vertical_8_neon: 41.7 29.7 31.0
After:
pred8x8_dc_8_neon: 59.0 46.7 42.5
pred8x8_dc_128_8_neon: 28.2 18.0 14.0
pred8x8_horizontal_8_neon: 34.2 29.2 18.5
pred8x8_left_dc_8_neon: 51.0 38.2 32.7
pred8x8_mad_cow_dc_0l0_8_neon: 46.7 28.2 26.2
pred8x8_mad_cow_dc_0lt_8_neon: 55.2 33.7 37.5
pred8x8_mad_cow_dc_l0t_8_neon: 51.2 31.7 37.2
pred8x8_mad_cow_dc_l00_8_neon: 41.7 27.5 26.0
pred8x8_plane_8_neon: 111.5 86.5 89.5
pred8x8_top_dc_8_neon: 39.0 23.2 21.0
pred8x8_vertical_8_neon: 27.2 16.0 14.0
pred16x16_dc_8_neon: 85.0 70.2 70.5
pred16x16_dc_128_8_neon: 42.0 30.0 30.7
pred16x16_horizontal_8_neon: 66.5 49.5 42.5
pred16x16_left_dc_8_neon: 81.0 66.5 67.5
pred16x16_plane_8_neon: 235.0 175.7 173.0
pred16x16_top_dc_8_neon: 52.0 39.0 41.7
pred16x16_vertical_8_neon: 40.2 33.2 31.0
Despite this, a number of these functions still are slower than
what e.g. GCC 7 generates - this shows the relative speedup of the
neon codepaths over the compiler generated ones:
Cortex A53 A72 A73
pred8x8_dc_8_neon: 0.86 0.65 1.04
pred8x8_dc_128_8_neon: 0.59 0.44 0.62
pred8x8_horizontal_8_neon: 1.51 0.58 1.30
pred8x8_left_dc_8_neon: 0.72 0.56 0.89
pred8x8_mad_cow_dc_0l0_8_neon: 0.93 0.93 1.37
pred8x8_mad_cow_dc_0lt_8_neon: 1.37 1.41 1.68
pred8x8_mad_cow_dc_l0t_8_neon: 1.21 1.17 1.32
pred8x8_mad_cow_dc_l00_8_neon: 1.24 1.19 1.60
pred8x8_plane_8_neon: 3.36 3.58 3.76
pred8x8_top_dc_8_neon: 0.97 0.99 1.43
pred8x8_vertical_8_neon: 0.86 0.78 1.18
pred16x16_dc_8_neon: 1.20 1.06 1.49
pred16x16_dc_128_8_neon: 0.83 0.95 0.99
pred16x16_horizontal_8_neon: 1.78 0.96 1.59
pred16x16_left_dc_8_neon: 1.06 0.96 1.32
pred16x16_plane_8_neon: 5.78 6.49 7.19
pred16x16_top_dc_8_neon: 1.48 1.53 1.94
pred16x16_vertical_8_neon: 1.39 1.34 1.98
In particular, on Cortex A72, many of these functions are slower
than the compiler generated code, while they're more beneficial on
e.g. the Cortex A73.
Signed-off-by: Martin Storsjö <martin@martin.st>
Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth
available on aarch64.
For a UHD HDR (10 bit) sample video these were consuming the most time
and this optimization reduced overall decode time from 19.4s to 16.4s,
approximately 15% speedup.
Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts",
running on Apple M1.
Signed-off-by: Josh Dekker <josh@itanimul.li>
153372 UNITS in postfilter_c, 65536 runs, 0 skips
73164 UNITS in postfilter_neon, 65536 runs, 0 skips -> 2.1x speedup
80591 UNITS in deemphasis_c, 131072 runs, 0 skips
43969 UNITS in deemphasis_neon, 131072 runs, 0 skips -> 1.83x speedup
Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x realtime)
Deemphasis SIMD based on the following unrolling:
const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
float state = coeff;
for (int i = 0; i < len; i += 4) {
y[0] = x[0] + c1*state;
y[1] = x[1] + c2*state + c1*x[0];
y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];
state = y[3];
y += 4;
x += 4;
}
Unlike the x86 version, duplication is used instead of pslldq so
the structure and tables are different.
* commit '49f9c4272c4029b57ff300d908ba03c6332fc9c4':
aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon
Merged-by: James Almer <jamrial@gmail.com>
* commit 'cc7ba00c35faf0478f1f56215e926f70ccb31282':
aarch64: vp8: Port missing epel8 functions from arm version
Merged-by: James Almer <jamrial@gmail.com>
* commit '52c9b0a6c0d02cff6caebcf6989e565e05b55200':
aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version
Merged-by: James Almer <jamrial@gmail.com>
* commit 'f1011ea28a4048ddec97794ca3e9901474fe055f':
aarch64: vp8: Reorder the function pointer inits to match the arm original
Merged-by: James Almer <jamrial@gmail.com>
* commit '85bfaa4949f4afcde19061def3e8a18988964858':
aarch64: vp8: Use the proper aarch64 form for conditional branches
Merged-by: James Almer <jamrial@gmail.com>
* commit '0801853e640624537db386727b36fa97aa6258e7':
libavcodec: vp8 neon optimizations for aarch64
See 833fed5253
Merged-by: James Almer <jamrial@gmail.com>