1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-12 19:18:44 +02:00
Commit Graph

282 Commits

Author SHA1 Message Date
Andre Kempe
861285c146 arm64: Fix wrong BTI landing pad
This patch fixes a wrong type of BTI landing pad when branching to
functions instantiated via the fft*_neon macro.

Although the previously employed paciasp instruction serves as a landing
pad, for the ways that this function is invoked it is the wrong type, resulting
in an unexpected termination of the running process.

Signed-off-by: André Kempe <andre.kempe@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-26 10:26:49 +03:00
Ben Avison
6eee650289 avcodec/vc1: Arm 64-bit NEON unescape fast path
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

vc1dsp.vc1_unescape_buffer_c: 655617.7
vc1dsp.vc1_unescape_buffer_neon: 118237.0

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:34 +03:00
Ben Avison
5379412ed0 avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

idctdsp.add_pixels_clamped_c: 313.3
idctdsp.add_pixels_clamped_neon: 24.3
idctdsp.put_pixels_clamped_c: 220.3
idctdsp.put_pixels_clamped_neon: 15.5
idctdsp.put_signed_pixels_clamped_c: 210.5
idctdsp.put_signed_pixels_clamped_neon: 19.5

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:34 +03:00
Ben Avison
501fdc017d avcodec/vc1: Arm 64-bit NEON inverse transform fast paths
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

vc1dsp.vc1_inv_trans_4x4_c: 158.2
vc1dsp.vc1_inv_trans_4x4_neon: 65.7
vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5
vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5
vc1dsp.vc1_inv_trans_4x8_c: 335.2
vc1dsp.vc1_inv_trans_4x8_neon: 106.2
vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2
vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5
vc1dsp.vc1_inv_trans_8x4_c: 365.7
vc1dsp.vc1_inv_trans_8x4_neon: 97.2
vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7
vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5
vc1dsp.vc1_inv_trans_8x8_c: 547.7
vc1dsp.vc1_inv_trans_8x8_neon: 137.0
vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2
vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:34 +03:00
Ben Avison
c62bbd4d20 avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C
version can still outperform the NEON version in specific cases. The balance
between different code paths is stream-dependent, but in practice the best
case happens about 5% of the time, the worst case happens about 40% of the
time, and the complexity of the remaining cases fall somewhere in between.
Therefore, taking the average of the best and worst case timings is
probably a conservative estimate of the degree by which the NEON code
improves performance.

vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7
vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5
vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5
vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7
vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2
vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2
vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2
vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2
vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0
vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7
vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7
vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5
vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7
vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0
vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7
vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0
vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2
vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7
vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0
vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2
vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0
vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2
vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:33 +03:00
Martin Storsjö
a78f136f3f configure: Use a separate config_components.h header for $ALL_COMPONENTS
This avoids unnecessary rebuilds of most source files if only the
list of enabled components has changed, but not the other properties
of the build, set in config.h.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-03-16 14:12:49 +02:00
Andre Kempe
248986a0db arm64: Add Armv8.3-A PAC support to assembly files
This patch adds optional support for Arm Pointer Authentication Codes.

PAC support is turned on or off at compile time using additional
compiler flags. Unless any of these is enabled explicitly, no additional
code will be emitted at all.

Signed-off-by: André Kempe <andre.kempe@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-03-09 15:04:25 +02:00
Andreas Rheinhardt
52e9113695 avcodec/aarch64/idct: Add missing stddef
Fixes checkheaders on aarch64.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-02-21 13:10:04 +01:00
Martin Storsjö
402784ba9f aarch64: h264dsp: Fix incorrectly indented code
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-02-11 10:49:12 +02:00
Martin Storsjö
24b93022fe aarch64: Disable ff_hevc_sao_band_filter_8x8_8_neon out of precaution
While this function on its own passes all of fate-hevc, there's
indications that the function might need to handle widths that
aren't a multiple of 8 (noted in commit
f63f9be37c, which later was
reverted).

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:27 +02:00
Martin Storsjö
16fba44b4d Revert "lavc/aarch64: add hevc sao edge 16x16"
This reverts commit a9214a2ca3, as
it breaks fate-hevc.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:23 +02:00
Martin Storsjö
df48b1d06f Revert "lavc/aarch64: add hevc sao edge 8x8"
This reverts commit c97ffc1a77, as
it breaks fate-hevc.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:19 +02:00
Martin Storsjö
cafed377eb Revert "lavc/aarch64: add hevc sao band 8x8 tiling"
This reverts commit f63f9be37c, as
it breaks fate-hevc.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:14 +02:00
J. Dekker
f63f9be37c lavc/aarch64: add hevc sao band 8x8 tiling
bench on AWS Graviton:

hevc_sao_band_8x8_8_c: 317.5
hevc_sao_band_8x8_8_neon: 97.5
hevc_sao_band_16x16_8_c: 1115.0
hevc_sao_band_16x16_8_neon: 322.7
hevc_sao_band_32x32_8_c: 4599.2
hevc_sao_band_32x32_8_neon: 1246.2
hevc_sao_band_48x48_8_c: 10021.7
hevc_sao_band_48x48_8_neon: 2740.5
hevc_sao_band_64x64_8_c: 17635.0
hevc_sao_band_64x64_8_neon: 4875.7

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:32:26 +01:00
J. Dekker
89a2ed4a8b lavc/aarch64: clean-up sao band 8x8 function formatting
Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:31:56 +01:00
J. Dekker
c97ffc1a77 lavc/aarch64: add hevc sao edge 8x8
bench on AWS Graviton:

hevc_sao_edge_8x8_8_c: 516.0
hevc_sao_edge_8x8_8_neon: 81.0

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:31:56 +01:00
J. Dekker
a9214a2ca3 lavc/aarch64: add hevc sao edge 16x16
bench on AWS Graviton:

hevc_sao_edge_16x16_8_c: 1857.0
hevc_sao_edge_16x16_8_neon: 211.0
hevc_sao_edge_32x32_8_c: 7802.2
hevc_sao_edge_32x32_8_neon: 808.2
hevc_sao_edge_48x48_8_c: 16764.2
hevc_sao_edge_48x48_8_neon: 1796.5
hevc_sao_edge_64x64_8_c: 32647.5
hevc_sao_edge_64x64_8_neon: 3118.5

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:31:56 +01:00
Jonathan Wright
08b4716a9e aarch64: Add Armv8.5-A BTI support
Add Branch Target Identifiers (BTIs) to all functions defined in
AArch64 assembly files. Most of the BTI landing pads are added
automatically by the 'function' macro.

BTI support is turned on or off at compile time based on the presence
of the __ARM_FEATURE_BTI_DEFAULT feature macro.

A binary compiled with BTI support can be executed on an Armv8-A
processor without BTI support because the instructions are defined in
NOP space.

Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Elijah Ahmad <elijah.ahmad@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-11-16 13:43:56 +02:00
Jonathan Wright
6f04cf54f5 aarch64: Use ret x<n> instead of br x<n> where possible
Change AArch64 assembly code to use:
  ret     x<n>
instead of:
  br      x<n>

"ret x<n>" is already used in a lot of places so this patch makes it
consistent across the code base. This does not change behavior or
performance.

In addition, this change reduces the number of landing pads needed in
a subsequent patch to support the Armv8.5-A Branch Target
Identification (BTI) security feature.

Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-11-16 13:43:56 +02:00
Martin Storsjö
fd3bd5c492 aarch64: h264qpel: Do vertical filtering without transposing
This gives rather big speedups on these functions:

Before:
put_h264_qpel_8_mc01_8_neon:     241.0   131.5   138.7
put_h264_qpel_8_mc02_8_neon:     214.7   121.2   127.5
put_h264_qpel_8_mc03_8_neon:     242.5   131.2   135.7
put_h264_qpel_8_mc11_8_neon:     421.2   218.7   251.0
put_h264_qpel_8_mc12_8_neon:     878.0   509.5   537.5
put_h264_qpel_8_mc13_8_neon:     423.7   217.0   252.0
put_h264_qpel_8_mc21_8_neon:     858.2   479.5   514.0
put_h264_qpel_8_mc22_8_neon:     649.7   385.2   403.0
put_h264_qpel_8_mc23_8_neon:     860.2   476.5   517.7
put_h264_qpel_8_mc31_8_neon:     437.2   219.5   252.5
put_h264_qpel_8_mc32_8_neon:     892.5   510.5   546.0
put_h264_qpel_8_mc33_8_neon:     438.2   218.5   257.0
put_h264_qpel_16_mc01_8_neon:    944.2   509.7   546.7
put_h264_qpel_16_mc02_8_neon:    878.7   469.5   509.7
put_h264_qpel_16_mc03_8_neon:    945.7   510.7   557.0
put_h264_qpel_16_mc11_8_neon:   1663.2   858.5   979.5
put_h264_qpel_16_mc12_8_neon:   3510.2  2027.7  2112.7
put_h264_qpel_16_mc13_8_neon:   1664.7   857.5   980.5
put_h264_qpel_16_mc21_8_neon:   3366.2  1928.5  2030.5
put_h264_qpel_16_mc22_8_neon:   2584.7  1514.7  1590.2
put_h264_qpel_16_mc23_8_neon:   3367.7  1927.7  2035.0
put_h264_qpel_16_mc31_8_neon:   1716.7   849.7   997.0
put_h264_qpel_16_mc32_8_neon:   3564.0  2044.2  3835.2
put_h264_qpel_16_mc33_8_neon:   1717.7   863.0   989.5

After:
put_h264_qpel_8_mc01_8_neon:     136.0    73.7    76.0
put_h264_qpel_8_mc02_8_neon:     108.7    65.0    64.0
put_h264_qpel_8_mc03_8_neon:     137.5    72.7    73.0
put_h264_qpel_8_mc11_8_neon:     316.2   159.0   188.5
put_h264_qpel_8_mc12_8_neon:     653.0   375.5   384.7
put_h264_qpel_8_mc13_8_neon:     318.7   165.5   189.5
put_h264_qpel_8_mc21_8_neon:     739.2   385.7   432.5
put_h264_qpel_8_mc22_8_neon:     530.7   295.5   309.5
put_h264_qpel_8_mc23_8_neon:     741.2   393.7   421.0
put_h264_qpel_8_mc31_8_neon:     332.2   162.5   190.0
put_h264_qpel_8_mc32_8_neon:     667.5   378.2   390.5
put_h264_qpel_8_mc33_8_neon:     332.7   166.5   195.5
put_h264_qpel_16_mc01_8_neon:    524.2   285.2   294.0
put_h264_qpel_16_mc02_8_neon:    454.7   252.2   250.2
put_h264_qpel_16_mc03_8_neon:    525.7   286.0   283.0
put_h264_qpel_16_mc11_8_neon:   1243.2   630.7   726.7
put_h264_qpel_16_mc12_8_neon:   2610.2  1479.7  1481.2
put_h264_qpel_16_mc13_8_neon:   1250.5   631.7   727.7
put_h264_qpel_16_mc21_8_neon:   2890.2  1571.2  1679.7
put_h264_qpel_16_mc22_8_neon:   2108.7  1177.5  1223.5
put_h264_qpel_16_mc23_8_neon:   2891.7  1578.7  1667.7
put_h264_qpel_16_mc31_8_neon:   1296.7   630.5   752.5
put_h264_qpel_16_mc32_8_neon:   2664.0  1483.2  1503.5
put_h264_qpel_16_mc33_8_neon:   1297.7   632.5   747.2

I.e. overall a 20%-60% reduction in runtime of these
functions.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-10-18 14:27:58 +03:00
Martin Storsjö
2d5a7f6d00 arm/aarch64: Improve scheduling in the avg form of h264_qpel
Don't use the loaded registers directly, avoiding stalls on in
order cores. Use vrhadd.u8 with q registers where easily possible.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-10-18 14:27:36 +03:00
Zhao Zhili
378ad2f8fd lavc/aarch64: fix relocation out of range error
Use a temporary label instead of global function symbol for b.gt.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-09-25 21:55:29 +03:00
Mikhail Nitenko
d3e56b56ae lavc/aarch64: add pred functions for 10-bit
Benchmarks:                        A53     A72
pred8x8_dc_10_c:                   64.2    49.5
pred8x8_dc_10_neon:                62.0    53.7
pred8x8_dc_128_10_c:               26.0    14.0
pred8x8_dc_128_10_neon:            30.7    17.5
pred8x8_horizontal_10_c:           60.0    27.7
pred8x8_horizontal_10_neon:        38.0    34.0
pred8x8_left_dc_10_c:              42.5    27.5
pred8x8_left_dc_10_neon:           51.0    41.2
pred8x8_mad_cow_dc_0l0_10_c:       55.7    37.2
pred8x8_mad_cow_dc_0l0_10_neon:    50.2    35.2
pred8x8_mad_cow_dc_0lt_10_c:       89.2    67.0
pred8x8_mad_cow_dc_0lt_10_neon:    52.2    46.7
pred8x8_mad_cow_dc_l0t_10_c:       74.7    51.0
pred8x8_mad_cow_dc_l0t_10_neon:    50.5    45.2
pred8x8_mad_cow_dc_l00_10_c:       58.0    38.0
pred8x8_mad_cow_dc_l00_10_neon:    42.5    37.5
pred8x8_plane_10_c:               354.0   288.7
pred8x8_plane_10_neon:            141.0   101.2
pred8x8_top_dc_10_c:               44.5    30.5
pred8x8_top_dc_10_neon:            40.0    31.0
pred8x8_vertical_10_c:             27.5    14.5
pred8x8_vertical_10_neon:          21.0    17.5
pred16x16_plane_10_c:            1242.0  1070.5
pred16x16_plane_10_neon:          324.0   196.7

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Mikhail Nitenko
43ca887bc2 lavc/aarch64: h264, add chroma loop filters for 10bit
Benchmarks:                                             A53     A72
h264_h_loop_filter_chroma422_10bpp_c:                  282.7   114.2
h264_h_loop_filter_chroma422_10bpp_neon:               109.5    78.5
h264_h_loop_filter_chroma_10bpp_c:                     165.0    81.5
h264_h_loop_filter_chroma_10bpp_neon:                  120.0    76.7
h264_h_loop_filter_chroma_intra422_10bpp_c:            323.7   124.2
h264_h_loop_filter_chroma_intra422_10bpp_neon:         155.0   102.7
h264_h_loop_filter_chroma_intra_10bpp_c:               121.0    49.5
h264_h_loop_filter_chroma_intra_10bpp_neon:             79.7    53.7
h264_h_loop_filter_chroma_mbaff422_10bpp_c:            188.5    75.0
h264_h_loop_filter_chroma_mbaff422_10bpp_neon:         120.0    75.5
h264_h_loop_filter_chroma_mbaff_intra422_10bpp_c:      116.7    46.0
h264_h_loop_filter_chroma_mbaff_intra422_10bpp_neon:    79.7    53.7
h264_h_loop_filter_chroma_mbaff_intra_10bpp_c:          63.0    27.2
h264_h_loop_filter_chroma_mbaff_intra_10bpp_neon:       48.5    34.0
h264_v_loop_filter_chroma_10bpp_c:                     258.7   135.5
h264_v_loop_filter_chroma_10bpp_neon:                   71.2    51.0
h264_v_loop_filter_chroma_intra_10bpp_c:               158.0    70.7
h264_v_loop_filter_chroma_intra_10bpp_neon:             48.7    31.5

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Mikhail Nitenko
756d2e087a lavc/aarch64: move transpose_4x8H to neon.S
transpose_4x8H was declared in vp9lpf_16bpp_neon, however this macro is
not unique to vp9 and could be used elsewhere.

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Martin Storsjö
c60b76d0c8 aarch64: h264dsp: Fix indentation of some functions to match the rest
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-08 23:18:43 +03:00
Martin Storsjö
e86ec831b0 aarch64: h264dsp: Remove unnecessary sign extensions
These became unnecessary when the stride arguments were changed from
int to ptrdiff_t in bc26fe8927
(0576ef466d) and
d5d699ab6e
(aa844dc46f).

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-08 23:18:43 +03:00
Andreas Rheinhardt
afc95a10ac avcodec/h264dsp, h264idct: Fix lengths of array parameters
Fixes many -Warray-parameter warnings from GCC 11.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-08-08 17:44:57 +02:00
Martin Storsjö
f27e3ccf06 aarch64: hevc_idct: Fix overflows in idct_dc
This is marginally slower, but correct for all input values.
The previous implementation failed with certain input seeds, e.g.
"checkasm --test=hevc_idct 98".

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-05-22 00:08:03 +03:00
Andreas Rheinhardt
7c1f347b18 avcodec: Remove deprecated old encode/decode APIs
Deprecated in commits 7fc329e2dd
and 31f6a4b4b8.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2021-04-27 10:43:12 -03:00
Andreas Rheinhardt
f3c197b129 Include attributes.h directly
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-04-19 14:34:10 +02:00
Mikhail Nitenko
84ac1440b2 lavc/aarch64: add pred16x16 10-bit functions
Benchmarks:                      A53     A72
pred16x16_dc_10_c:              136.0   124.0
pred16x16_dc_10_neon:           121.2   106.0
pred16x16_horizontal_10_c:      155.0    73.2
pred16x16_horizontal_10_neon:    82.2    67.7
pred16x16_top_dc_10_c:          106.0    93.7
pred16x16_top_dc_10_neon:        87.7    77.2
pred16x16_vertical_10_c:         83.0    67.7
pred16x16_vertical_10_neon:      54.2    61.7

Some functions work slower than C and are left commented out.
2021-04-19 09:01:14 +02:00
Mikhail Nitenko
6b2e7dc828 lavc/aarch64: change h264pred_init structure
Change structure to allow the addition of other bit depths.
2021-04-19 09:00:58 +02:00
Martin Storsjö
870bfe16a1 aarch64: h264pred: Optimize the inner loop of existing 8 bit functions
Move the loop counter decrement further from the branch instruction,
this hides the latency of the decrement.

In loops that first load, then store (the horizontal prediction cases),
do the decrement after the load (where the next instruction would
stall a bit anyway, waiting for the result of the load).

In loops that store twice using the same destination register,
also do the decrement between the two stores (as the second store
would need to wait for the updated destination register from the
first instruction).

In loops that store twice to two different destination registers,
do the decrement before both stores, to do it as soon before the
branch as possible.

This gives minor (1-2 cycle) speedups in most cases (modulo measurement
noise), but the horizontal prediction functions get a rather notable
speedup on the Cortex A53.

Before:                     Cortex A53     A72     A73
pred8x8_dc_8_neon:                60.7    46.2    39.2
pred8x8_dc_128_8_neon:            30.7    18.0    14.0
pred8x8_horizontal_8_neon:        42.2    29.2    18.5
pred8x8_left_dc_8_neon:           52.7    36.2    32.2
pred8x8_mad_cow_dc_0l0_8_neon:    48.2    27.7    25.7
pred8x8_mad_cow_dc_0lt_8_neon:    52.5    33.2    34.7
pred8x8_mad_cow_dc_l0t_8_neon:    52.5    31.7    33.2
pred8x8_mad_cow_dc_l00_8_neon:    43.2    27.0    25.5
pred8x8_plane_8_neon:            112.2    86.2    88.2
pred8x8_top_dc_8_neon:            40.7    23.0    21.2
pred8x8_vertical_8_neon:          27.2    15.5    14.0
pred16x16_dc_8_neon:              91.0    73.2    70.5
pred16x16_dc_128_8_neon:          43.0    34.7    30.7
pred16x16_horizontal_8_neon:      86.0    49.7    44.7
pred16x16_left_dc_8_neon:         87.0    67.2    67.5
pred16x16_plane_8_neon:          236.0   175.7   173.0
pred16x16_top_dc_8_neon:          53.2    39.0    41.7
pred16x16_vertical_8_neon:        41.7    29.7    31.0

After:
pred8x8_dc_8_neon:                59.0    46.7    42.5
pred8x8_dc_128_8_neon:            28.2    18.0    14.0
pred8x8_horizontal_8_neon:        34.2    29.2    18.5
pred8x8_left_dc_8_neon:           51.0    38.2    32.7
pred8x8_mad_cow_dc_0l0_8_neon:    46.7    28.2    26.2
pred8x8_mad_cow_dc_0lt_8_neon:    55.2    33.7    37.5
pred8x8_mad_cow_dc_l0t_8_neon:    51.2    31.7    37.2
pred8x8_mad_cow_dc_l00_8_neon:    41.7    27.5    26.0
pred8x8_plane_8_neon:            111.5    86.5    89.5
pred8x8_top_dc_8_neon:            39.0    23.2    21.0
pred8x8_vertical_8_neon:          27.2    16.0    14.0
pred16x16_dc_8_neon:              85.0    70.2    70.5
pred16x16_dc_128_8_neon:          42.0    30.0    30.7
pred16x16_horizontal_8_neon:      66.5    49.5    42.5
pred16x16_left_dc_8_neon:         81.0    66.5    67.5
pred16x16_plane_8_neon:          235.0   175.7   173.0
pred16x16_top_dc_8_neon:          52.0    39.0    41.7
pred16x16_vertical_8_neon:        40.2    33.2    31.0

Despite this, a number of these functions still are slower than
what e.g. GCC 7 generates - this shows the relative speedup of the
neon codepaths over the compiler generated ones:

                           Cortex A53    A72    A73
pred8x8_dc_8_neon:               0.86   0.65   1.04
pred8x8_dc_128_8_neon:           0.59   0.44   0.62
pred8x8_horizontal_8_neon:       1.51   0.58   1.30
pred8x8_left_dc_8_neon:          0.72   0.56   0.89
pred8x8_mad_cow_dc_0l0_8_neon:   0.93   0.93   1.37
pred8x8_mad_cow_dc_0lt_8_neon:   1.37   1.41   1.68
pred8x8_mad_cow_dc_l0t_8_neon:   1.21   1.17   1.32
pred8x8_mad_cow_dc_l00_8_neon:   1.24   1.19   1.60
pred8x8_plane_8_neon:            3.36   3.58   3.76
pred8x8_top_dc_8_neon:           0.97   0.99   1.43
pred8x8_vertical_8_neon:         0.86   0.78   1.18
pred16x16_dc_8_neon:             1.20   1.06   1.49
pred16x16_dc_128_8_neon:         0.83   0.95   0.99
pred16x16_horizontal_8_neon:     1.78   0.96   1.59
pred16x16_left_dc_8_neon:        1.06   0.96   1.32
pred16x16_plane_8_neon:          5.78   6.49   7.19
pred16x16_top_dc_8_neon:         1.48   1.53   1.94
pred16x16_vertical_8_neon:       1.39   1.34   1.98

In particular, on Cortex A72, many of these functions are slower
than the compiler generated code, while they're more beneficial on
e.g. the Cortex A73.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-04-14 15:23:44 +03:00
James Almer
f1a894f9d3 avcodec: add missing FF_API_OLD_ENCDEC wrappers to xmm clobber functions
Signed-off-by: James Almer <jamrial@gmail.com>
2021-02-26 19:26:31 -03:00
Josh Dekker
7ac41e0db2 lavc/aarch64: add HEVC sao_band NEON
Only works for 8x8.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:12:01 +01:00
Josh Dekker
75c2ddfa61 lavc/aarch64: add HEVC idct_dc NEON
Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:12:01 +01:00
Reimar Döffinger
00c916ef61 lavc/aarch64: port HEVC add_residual NEON
Speedup is fairly small, around 1.5%, but these are fairly simple.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:11:57 +01:00
Reimar Döffinger
30f80d855b lavc/aarch64: port HEVC SIMD idct NEON
Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth
available on aarch64.
For a UHD HDR (10 bit) sample video these were consuming the most time
and this optimization reduced overall decode time from 19.4s to 16.4s,
approximately 15% speedup.
Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts",
running on Apple M1.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:11:53 +01:00
Anton Khirnov
c8c2dfbc37 lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h
That is a more appropriate place for it.
2021-01-01 14:11:01 +01:00
Martin Storsjö
7168adedbc libavcodec: aarch64: Add a NEON implementation of pixblockdsp
Cortex A53    A72    A73
get_pixels_c:                140.7   87.7   72.5
get_pixels_neon:              46.0   20.0   19.5
get_pixels_unaligned_c:      140.7   87.7   73.0
get_pixels_unaligned_neon:    49.2   20.2   26.2
diff_pixels_c:               209.7  133.7  138.7
diff_pixels_neon:             54.2   31.7   23.5
diff_pixels_unaligned_c:     209.7  134.2  139.0
diff_pixels_unaligned_neon:   68.0   27.7   41.7

Signed-off-by: Martin Storsjö <martin@martin.st>
2020-05-15 23:37:55 +03:00
Carl Eugen Hoyos
34d7c8d942 lavc/aarch64: Remove unneeded file vp9mc_aarch64.c 2020-03-11 14:36:07 +01:00
Carl Eugen Hoyos
951bd25572 lavc/aarch64: Fix suffix of new file vp9mc_aarch64. 2020-03-11 14:29:22 +01:00
Carl Eugen Hoyos
213c796561 lavc/aarch64: Fix compilation with --disable-neon
Fixes ticket #8565.
2020-03-11 14:16:48 +01:00
Carl Eugen Hoyos
9a21754904 lavc/aarch64: Move non-neon vp9 copy functions out of neon source file.
Fixes part of ticket #8565.
2020-03-11 14:16:40 +01:00
Lynne
aac382e9e5 aarch64/opusdsp: do not clobber register v8
A part of v8-v15 needs to be preserved across calls.
2019-08-15 13:29:22 +01:00
Lynne
f62ee527cb aarch64/asm-offsets: remove old CELT offsets
They're not used and they're incorrect.
2019-05-14 23:41:24 +01:00
Lynne
4d2f62150d aarch64/opusdsp: implement NEON accelerated postfilter and deemphasis
153372 UNITS in postfilter_c,   65536 runs,      0 skips
73164 UNITS in postfilter_neon,   65536 runs,      0 skips -> 2.1x speedup

80591 UNITS in deemphasis_c,  131072 runs,      0 skips
43969 UNITS in deemphasis_neon,  131072 runs,      0 skips -> 1.83x speedup

Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x realtime)

Deemphasis SIMD based on the following unrolling:
const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
float state = coeff;

for (int i = 0; i < len; i += 4) {
    y[0] = x[0] + c1*state;
    y[1] = x[1] + c2*state + c1*x[0];
    y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
    y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];

    state = y[3];
    y += 4;
    x += 4;
}

Unlike the x86 version, duplication is used instead of pslldq so
the structure and tables are different.
2019-04-10 01:08:54 +02:00
James Almer
92219ef4ac Merge commit '186bd30aa3b6c2b29b4dbf18278700b572068b1e'
* commit '186bd30aa3b6c2b29b4dbf18278700b572068b1e':
  h264/arm64: implement missing 4:2:2 chroma loop filter neon functions

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:29:41 -03:00
James Almer
5c363d3e59 Merge commit '7e42d5f0ab2aeac811fd01e122627c9198b13f01'
* commit '7e42d5f0ab2aeac811fd01e122627c9198b13f01':
  aarch64: vp8: Optimize vp8_idct_add_neon for aarch64

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:22:29 -03:00