This work is sponsored by, and copyright, Google.
These are ported from the ARM version; thanks to the larger
amount of registers available, we can do the loop filters with
16 pixels at a time. The implementation is fully templated, with
a single macro which can generate versions for both 8 and
16 pixels wide, for both 4, 8 and 16 pixels loop filters
(and the 4/8 mixed versions as well).
For the 8 pixel wide versions, it is pretty close in speed (the
v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
and h_8_8 filters seem to get some gain in the load/transpose/store
part). For the 16 pixels wide ones, we get a speedup of around
1.2-1.4x compared to the 32 bit version.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_loop_filter_h_4_8_neon: 144.0 127.2
vp9_loop_filter_h_8_8_neon: 207.0 182.5
vp9_loop_filter_h_16_8_neon: 415.0 328.7
vp9_loop_filter_h_16_16_neon: 672.0 558.6
vp9_loop_filter_mix2_h_44_16_neon: 302.0 203.5
vp9_loop_filter_mix2_h_48_16_neon: 365.0 305.2
vp9_loop_filter_mix2_h_84_16_neon: 365.0 305.2
vp9_loop_filter_mix2_h_88_16_neon: 376.0 305.2
vp9_loop_filter_mix2_v_44_16_neon: 193.2 128.2
vp9_loop_filter_mix2_v_48_16_neon: 246.7 218.4
vp9_loop_filter_mix2_v_84_16_neon: 248.0 218.5
vp9_loop_filter_mix2_v_88_16_neon: 302.0 218.2
vp9_loop_filter_v_4_8_neon: 89.0 88.7
vp9_loop_filter_v_8_8_neon: 141.0 137.7
vp9_loop_filter_v_16_8_neon: 295.0 272.7
vp9_loop_filter_v_16_16_neon: 546.0 453.7
The speedup vs C code in checkasm tests is around 2-7x, which is
pretty much the same as for the 32 bit version. Even if these functions
are faster than their 32 bit equivalent, the C version that we compare
to also became around 1.3-1.7x faster than the C version in 32 bit.
Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 4-5x.
Examples of runtimes vs C on a Cortex A57 (for a slightly older version
of the patch):
A57 gcc-5.3 neon
loop_filter_h_4_8_neon: 256.6 93.4
loop_filter_h_8_8_neon: 307.3 139.1
loop_filter_h_16_8_neon: 340.1 254.1
loop_filter_h_16_16_neon: 827.0 407.9
loop_filter_mix2_h_44_16_neon: 524.5 155.4
loop_filter_mix2_h_48_16_neon: 644.5 173.3
loop_filter_mix2_h_84_16_neon: 630.5 222.0
loop_filter_mix2_h_88_16_neon: 697.3 222.0
loop_filter_mix2_v_44_16_neon: 598.5 100.6
loop_filter_mix2_v_48_16_neon: 651.5 127.0
loop_filter_mix2_v_84_16_neon: 591.5 167.1
loop_filter_mix2_v_88_16_neon: 855.1 166.7
loop_filter_v_4_8_neon: 271.7 65.3
loop_filter_v_8_8_neon: 312.5 106.9
loop_filter_v_16_8_neon: 473.3 206.5
loop_filter_v_16_16_neon: 976.1 327.8
The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
is again 30-50% faster than the cortex-a53.
This is an adapted cherry-pick from libav commits
9d2afd1eb8c5cc0633062430e66326dbf98c99e0 and
31756abe29eb039a11c59a42cb12e0cc2aef3b97.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
This work is sponsored by, and copyright, Google.
These are ported from the ARM version; thanks to the larger
amount of registers available, we can do the 16x16 and 32x32
transforms in slices 8 pixels wide instead of 4. This gives
a speedup of around 1.4x compared to the 32 bit version.
The fact that aarch64 doesn't have the same d/q register
aliasing makes some of the macros quite a bit simpler as well.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_inv_adst_adst_4x4_add_neon: 90.0 87.7
vp9_inv_adst_adst_8x8_add_neon: 400.0 354.7
vp9_inv_adst_adst_16x16_add_neon: 2526.5 1827.2
vp9_inv_dct_dct_4x4_add_neon: 74.0 72.7
vp9_inv_dct_dct_8x8_add_neon: 271.0 256.7
vp9_inv_dct_dct_16x16_add_neon: 1960.7 1372.7
vp9_inv_dct_dct_32x32_add_neon: 11988.9 8088.3
vp9_inv_wht_wht_4x4_add_neon: 63.0 57.7
The speedup vs C code (2-4x) is smaller than in the 32 bit case,
mostly because the C code ends up significantly faster (around
1.6x faster, with GCC 5.4) when built for aarch64.
Examples of runtimes vs C on a Cortex A57 (for a slightly older version
of the patch):
A57 gcc-5.3 neon
vp9_inv_adst_adst_4x4_add_neon: 152.2 60.0
vp9_inv_adst_adst_8x8_add_neon: 948.2 288.0
vp9_inv_adst_adst_16x16_add_neon: 4830.4 1380.5
vp9_inv_dct_dct_4x4_add_neon: 153.0 58.6
vp9_inv_dct_dct_8x8_add_neon: 789.2 180.2
vp9_inv_dct_dct_16x16_add_neon: 3639.6 917.1
vp9_inv_dct_dct_32x32_add_neon: 20462.1 4985.0
vp9_inv_wht_wht_4x4_add_neon: 91.0 49.8
The asm is around factor 3-4 faster than C on the cortex-a57 and the asm
is around 30-50% faster on the a57 compared to the a53.
This is an adapted cherry-pick from libav commit
3c9546dfafcdfe8e7860aff9ebbf609318220f29.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
This work is sponsored by, and copyright, Google.
These are ported from the ARM version; it is essentially a 1:1
port with no extra added features, but with some hand tuning
(especially for the plain copy/avg functions). The ARM version
isn't very register starved to begin with, so there's not much
to be gained from having more spare registers here - we only
avoid having to clobber callee-saved registers.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_avg4_neon: 27.2 23.7
vp9_avg8_neon: 56.5 54.7
vp9_avg16_neon: 169.9 167.4
vp9_avg32_neon: 585.8 585.2
vp9_avg64_neon: 2460.3 2294.7
vp9_avg_8tap_smooth_4h_neon: 132.7 125.2
vp9_avg_8tap_smooth_4hv_neon: 478.8 442.0
vp9_avg_8tap_smooth_4v_neon: 126.0 93.7
vp9_avg_8tap_smooth_8h_neon: 241.7 234.2
vp9_avg_8tap_smooth_8hv_neon: 690.9 646.5
vp9_avg_8tap_smooth_8v_neon: 245.0 205.5
vp9_avg_8tap_smooth_64h_neon: 11273.2 11280.1
vp9_avg_8tap_smooth_64hv_neon: 22980.6 22184.1
vp9_avg_8tap_smooth_64v_neon: 11549.7 10781.1
vp9_put4_neon: 18.0 17.2
vp9_put8_neon: 40.2 37.7
vp9_put16_neon: 97.4 99.5
vp9_put32_neon/armv8: 346.0 307.4
vp9_put64_neon/armv8: 1319.0 1107.5
vp9_put_8tap_smooth_4h_neon: 126.7 118.2
vp9_put_8tap_smooth_4hv_neon: 465.7 434.0
vp9_put_8tap_smooth_4v_neon: 113.0 86.5
vp9_put_8tap_smooth_8h_neon: 229.7 221.6
vp9_put_8tap_smooth_8hv_neon: 658.9 621.3
vp9_put_8tap_smooth_8v_neon: 215.0 187.5
vp9_put_8tap_smooth_64h_neon: 10636.7 10627.8
vp9_put_8tap_smooth_64hv_neon: 21076.8 21026.9
vp9_put_8tap_smooth_64v_neon: 9635.0 9632.4
These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.
The speedup vs C code is pretty much the same as for the 32 bit
case; on the A53 it's around 6-13x for ther larger 8tap filters.
The exact speedup varies a little, since the C versions generally
don't end up exactly as slow/fast as on 32 bit.
This is an adapted cherry-pick from libav commit
383d96aa2229f644d9bd77b821ed3a309da5e9fc.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
* commit 'cdb1665f70def544ddab3e3ed3763ef99c8b3873':
aarch64: Make transpose_4x4H do a regular transpose
Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
Restore alphabetical order in lists, break overly long lines, do some
prettyprinting, add some explanatory section comments, group parts
together that belong together logically.
Previously, ff_h264_idct_add_neon (originally in the arm version) used
a non-regular transpose in order to be able to use more instructions
that deal with registers as 128 bit register pairs. The aarch64
translation doesn't do it to the same extent, but brought along the
same structure since it was a straight translation.
This reshuffles ff_h264_idct_add_neon, bringing it closer to
the C implementation, making the transpose_4x4H macro do a regular
transpose, usable for other algorithms as well.
Previously, the third and fourth output from transpose_4x4H were
swapped, and prior to cc29d96d5a, the same inputs as well. In
addition to just swapping the outputs, also renumber the intermediate
registers for better readability (making the register order match
transpose_4x8B).
This runs with the same number of cycles as before.
Signed-off-by: Martin Storsjö <martin@martin.st>
* commit '2008f76054906e9ff6bf744800af0e5a5bfe61be':
dca: remove unused decode_hf function and quant_d tables
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
Fix related register order issue in ff_h264_idct_add_neon.
Found-by: zjh8890 <243186085@qq.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
~25% faster dts decoding overall. The checkasm CPU cycles numbers are
not that useful since synth_filter_float() calls FFTContext.imdct_half().
cortex-a57 cortex-a53
synth_filter_float_c: 1866.2 3490.9
synth_filter_float_neon: 915.0 1531.5
With fftc.imdct_half forced to imdct_half_neon:
cortex-a57 cortex-a53
synth_filter_float_c: 1718.4 3025.3
synth_filter_float_neon: 926.2 1530.1
The transpose_4x4H is wrong which cost me much time to find this bug. The orders of r2 and r3 are wrong,
this bug waste me much time while I make aarch64 arm instruction which used the function.
* commit '3d5d46233cd81f78138a6d7418d480af04d3f6c8':
opus: Factor out imdct15 into a standalone component
Conflicts:
configure
libavcodec/opus_celt.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '780cd20b00a69e26bbfffbb8eec16fbe999ea793':
aarch64: Use .data.rel.ro for const data with relocations
Merged-by: Michael Niedermayer <michaelni@gmx.at>
This reverts commit c00365b46d464ce47716315c1801818d811bdb9a
in addition to using a different section.
Signed-off-by: Martin Storsjö <martin@martin.st>
* commit 'c00365b46d464ce47716315c1801818d811bdb9a':
aarch64: Make the function pointer tables position independent
Merged-by: Michael Niedermayer <michaelni@gmx.at>
This allows running the code on android, where 64 bit binaries with
text relocations aren't allowed to be loaded.
Signed-off-by: Martin Storsjö <martin@martin.st>
llvm's integrated assembler does not accept spaces as macro argument
delimiter when targeting darwin. Using a explicit delimiter is a good
idea in principle since it makes case like 'macro 4 -2' vs 'macro 4 - 2'
clear.
* commit 'f23d26a6864128001b03876b0b92fffe131f2060':
h264: avoid using uninitialized memory in NEON chroma mc
Merged-by: Michael Niedermayer <michaelni@gmx.at>