1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-23 12:43:46 +02:00
Commit Graph

392 Commits

Author SHA1 Message Date
Rémi Denis-Courmont
82cb4b1c05 lavc/aarch64: remove bogus HAVE_VFP guard
The IMDCT offset is only relevant for NEON optimisations. There are no
VFP optimisations here that would justify the HAVE_VFP flag. In
practice, this makes no difference because HAVE_NEON is practically
always true for standard Armv8 platforms.
2023-07-15 22:56:30 +03:00
Logan Lyu
9557bf26b3 lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_hv
put_hevc_epel_uni_w_hv4_8_c: 254.6
put_hevc_epel_uni_w_hv4_8_i8mm: 102.9
put_hevc_epel_uni_w_hv6_8_c: 411.6
put_hevc_epel_uni_w_hv6_8_i8mm: 221.6
put_hevc_epel_uni_w_hv8_8_c: 669.4
put_hevc_epel_uni_w_hv8_8_i8mm: 214.9
put_hevc_epel_uni_w_hv12_8_c: 1412.6
put_hevc_epel_uni_w_hv12_8_i8mm: 481.4
put_hevc_epel_uni_w_hv16_8_c: 2425.4
put_hevc_epel_uni_w_hv16_8_i8mm: 647.4
put_hevc_epel_uni_w_hv24_8_c: 5384.1
put_hevc_epel_uni_w_hv24_8_i8mm: 1450.6
put_hevc_epel_uni_w_hv32_8_c: 9470.9
put_hevc_epel_uni_w_hv32_8_i8mm: 2497.1
put_hevc_epel_uni_w_hv48_8_c: 20930.1
put_hevc_epel_uni_w_hv48_8_i8mm: 5635.9
put_hevc_epel_uni_w_hv64_8_c: 36682.9
put_hevc_epel_uni_w_hv64_8_i8mm: 9712.6

Signed-off-by: Martin Storsjö <martin@martin.st>
2023-07-14 21:19:12 +03:00
Logan Lyu
d48c89701c lavc/aarch64: new optimization for 8-bit hevc_epel_h
put_hevc_epel_h4_8_c: 67.1
put_hevc_epel_h4_8_i8mm: 21.1
put_hevc_epel_h6_8_c: 147.1
put_hevc_epel_h6_8_i8mm: 45.1
put_hevc_epel_h8_8_c: 237.4
put_hevc_epel_h8_8_i8mm: 72.1
put_hevc_epel_h12_8_c: 527.4
put_hevc_epel_h12_8_i8mm: 115.4
put_hevc_epel_h16_8_c: 943.6
put_hevc_epel_h16_8_i8mm: 153.9
put_hevc_epel_h24_8_c: 2105.4
put_hevc_epel_h24_8_i8mm: 384.4
put_hevc_epel_h32_8_c: 3631.4
put_hevc_epel_h32_8_i8mm: 519.9
put_hevc_epel_h48_8_c: 8082.1
put_hevc_epel_h48_8_i8mm: 1110.4
put_hevc_epel_h64_8_c: 14400.6
put_hevc_epel_h64_8_i8mm: 2057.1

put_hevc_qpel_h4_8_c: 124.9
put_hevc_qpel_h4_8_neon: 43.1
put_hevc_qpel_h4_8_i8mm: 33.1
put_hevc_qpel_h6_8_c: 269.4
put_hevc_qpel_h6_8_neon: 90.6
put_hevc_qpel_h6_8_i8mm: 61.4
put_hevc_qpel_h8_8_c: 477.6
put_hevc_qpel_h8_8_neon: 82.1
put_hevc_qpel_h8_8_i8mm: 99.9
put_hevc_qpel_h12_8_c: 1062.4
put_hevc_qpel_h12_8_neon: 226.9
put_hevc_qpel_h12_8_i8mm: 170.9
put_hevc_qpel_h16_8_c: 1880.6
put_hevc_qpel_h16_8_neon: 302.9
put_hevc_qpel_h16_8_i8mm: 251.4
put_hevc_qpel_h24_8_c: 4221.9
put_hevc_qpel_h24_8_neon: 893.9
put_hevc_qpel_h24_8_i8mm: 626.1
put_hevc_qpel_h32_8_c: 7437.6
put_hevc_qpel_h32_8_neon: 1189.9
put_hevc_qpel_h32_8_i8mm: 959.1
put_hevc_qpel_h48_8_c: 16838.4
put_hevc_qpel_h48_8_neon: 2727.9
put_hevc_qpel_h48_8_i8mm: 2163.9
put_hevc_qpel_h64_8_c: 29982.1
put_hevc_qpel_h64_8_neon: 4777.6

Signed-off-by: Martin Storsjö <martin@martin.st>
2023-07-14 21:19:12 +03:00
Logan Lyu
668eb4c00e lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_v
put_hevc_epel_uni_w_v4_8_c: 116.1
put_hevc_epel_uni_w_v4_8_neon: 48.6
put_hevc_epel_uni_w_v6_8_c: 248.9
put_hevc_epel_uni_w_v6_8_neon: 80.6
put_hevc_epel_uni_w_v8_8_c: 383.9
put_hevc_epel_uni_w_v8_8_neon: 91.9
put_hevc_epel_uni_w_v12_8_c: 806.1
put_hevc_epel_uni_w_v12_8_neon: 202.9
put_hevc_epel_uni_w_v16_8_c: 1411.1
put_hevc_epel_uni_w_v16_8_neon: 289.9
put_hevc_epel_uni_w_v24_8_c: 3168.9
put_hevc_epel_uni_w_v24_8_neon: 619.4
put_hevc_epel_uni_w_v32_8_c: 5632.9
put_hevc_epel_uni_w_v32_8_neon: 1161.1
put_hevc_epel_uni_w_v48_8_c: 12406.1
put_hevc_epel_uni_w_v48_8_neon: 2476.4
put_hevc_epel_uni_w_v64_8_c: 22001.4
put_hevc_epel_uni_w_v64_8_neon: 4343.9

Signed-off-by: Martin Storsjö <martin@martin.st>
2023-07-14 21:19:12 +03:00
Logan Lyu
0c604b1913 lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_h
put_hevc_epel_uni_w_h4_8_c: 126.1
put_hevc_epel_uni_w_h4_8_i8mm: 41.6
put_hevc_epel_uni_w_h6_8_c: 222.9
put_hevc_epel_uni_w_h6_8_i8mm: 91.4
put_hevc_epel_uni_w_h8_8_c: 374.4
put_hevc_epel_uni_w_h8_8_i8mm: 102.1
put_hevc_epel_uni_w_h12_8_c: 806.1
put_hevc_epel_uni_w_h12_8_i8mm: 225.6
put_hevc_epel_uni_w_h16_8_c: 1414.4
put_hevc_epel_uni_w_h16_8_i8mm: 333.4
put_hevc_epel_uni_w_h24_8_c: 3128.6
put_hevc_epel_uni_w_h24_8_i8mm: 713.1
put_hevc_epel_uni_w_h32_8_c: 5519.1
put_hevc_epel_uni_w_h32_8_i8mm: 1118.1
put_hevc_epel_uni_w_h48_8_c: 12364.4
put_hevc_epel_uni_w_h48_8_i8mm: 2541.1
put_hevc_epel_uni_w_h64_8_c: 21925.9
put_hevc_epel_uni_w_h64_8_i8mm: 4383.6

Signed-off-by: Martin Storsjö <martin@martin.st>
2023-07-14 21:19:12 +03:00
Logan Lyu
e652e7dcda lavc/aarch64: new optimization for 8-bit hevc_pel_uni_pixels
put_hevc_pel_uni_pixels4_8_c: 35.9
put_hevc_pel_uni_pixels4_8_neon: 7.6
put_hevc_pel_uni_pixels6_8_c: 46.1
put_hevc_pel_uni_pixels6_8_neon: 20.6
put_hevc_pel_uni_pixels8_8_c: 53.4
put_hevc_pel_uni_pixels8_8_neon: 11.6
put_hevc_pel_uni_pixels12_8_c: 89.1
put_hevc_pel_uni_pixels12_8_neon: 25.9
put_hevc_pel_uni_pixels16_8_c: 106.4
put_hevc_pel_uni_pixels16_8_neon: 20.4
put_hevc_pel_uni_pixels24_8_c: 137.6
put_hevc_pel_uni_pixels24_8_neon: 47.1
put_hevc_pel_uni_pixels32_8_c: 173.6
put_hevc_pel_uni_pixels32_8_neon: 54.1
put_hevc_pel_uni_pixels48_8_c: 268.1
put_hevc_pel_uni_pixels48_8_neon: 117.1
put_hevc_pel_uni_pixels64_8_c: 346.1
put_hevc_pel_uni_pixels64_8_neon: 205.9

Signed-off-by: Martin Storsjö <martin@martin.st>
2023-07-14 21:19:12 +03:00
Logan Lyu
e79686be96 lavc/aarch64: new optimization for 8-bit hevc_qpel_h hevc_qpel_uni_w_hv
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-06-06 12:50:18 +03:00
Logan Lyu
15972cce8c lavc/aarch64: new optimization for 8-bit hevc_qpel_uni_w_h
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-06-06 12:50:18 +03:00
Logan Lyu
0b7356c1b4 lavc/aarch64: new optimization for 8-bit hevc_pel_uni_w_pixels and qpel_uni_w_v
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-06-06 12:50:18 +03:00
xufuji456
bd2f00f665 codec/aarch64/hevc: add transform_luma_neon
got 56% speed up (run_count=1000, CPU=Cortex A53)
transform_4x4_luma_neon: 45 transform_4x4_luma_c: 103

Signed-off-by: xufuji456 <839789740@qq.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-04-14 12:07:57 +03:00
xufuji456
00a062b8d5 codec/aarch64/hevc:add idct_32x32_neon
got 73% speed up (run_count=1000, CPU=Cortex A53)
idct_32x32_neon: 4826 idct_32x32_c: 18236
idct_32x32_neon: 4824 idct_32x32_c: 18149
idct_32x32_neon: 4937 idct_32x32_c: 18333

Signed-off-by: Martin Storsjö <martin@martin.st>
2023-04-12 15:58:09 +03:00
J. Dekker
b564ad8eac lavc/aarch64: add hevc deblock chroma 8-12bit
Benched on Ampere Altra:

hevc_h_loop_filter_chroma8_c: 367.7
hevc_h_loop_filter_chroma8_neon: 31.0
hevc_h_loop_filter_chroma10_c: 396.7
hevc_h_loop_filter_chroma10_neon: 27.5
hevc_h_loop_filter_chroma12_c: 377.0
hevc_h_loop_filter_chroma12_neon: 31.7
hevc_v_loop_filter_chroma8_c: 369.0
hevc_v_loop_filter_chroma8_neon: 55.0
hevc_v_loop_filter_chroma10_c: 389.0
hevc_v_loop_filter_chroma10_neon: 54.0
hevc_v_loop_filter_chroma12_c: 389.5
hevc_v_loop_filter_chroma12_neon: 53.0

Signed-off-by: J. Dekker <jdek@itanimul.li>
2023-04-06 06:54:26 +02:00
J. Dekker
37cde570bc lavc/aarch64: add clip N macro
Signed-off-by: J. Dekker <jdek@itanimul.li>
2023-03-22 14:48:13 +01:00
xufuji456
4b4de07721 libavcodec/hevc: add hevc idct4x4 neon of aarch64
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-02-28 13:12:52 +02:00
Martin Storsjö
ec7fa13eb0 aarch64: hevcdsp_idct: Reuse preexisting macros for transposes
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-02-28 11:48:54 +02:00
Lynne
e0661fc805
dca_core: convert to lavu/tx
Thanks to Martin Storsjö <martin@martin.st> for fixing and testing the
arm32 and aarch64 changes.
2022-11-06 14:39:36 +01:00
J. Dekker
9bed814e1d lavc/aarch64: add hevc horizontal qpel/uni/bi
checkasm --benchmark on Ampere Altra (Neoverse N1):

put_hevc_qpel_bi_h4_8_c: 170.7
put_hevc_qpel_bi_h4_8_neon: 64.5
put_hevc_qpel_bi_h6_8_c: 373.7
put_hevc_qpel_bi_h6_8_neon: 130.2
put_hevc_qpel_bi_h8_8_c: 662.0
put_hevc_qpel_bi_h8_8_neon: 138.5
put_hevc_qpel_bi_h12_8_c: 1529.5
put_hevc_qpel_bi_h12_8_neon: 422.0
put_hevc_qpel_bi_h16_8_c: 2735.5
put_hevc_qpel_bi_h16_8_neon: 560.5
put_hevc_qpel_bi_h24_8_c: 6015.7
put_hevc_qpel_bi_h24_8_neon: 1636.0
put_hevc_qpel_bi_h32_8_c: 10779.0
put_hevc_qpel_bi_h32_8_neon: 2204.5
put_hevc_qpel_bi_h48_8_c: 24375.0
put_hevc_qpel_bi_h48_8_neon: 4984.0
put_hevc_qpel_bi_h64_8_c: 42768.0
put_hevc_qpel_bi_h64_8_neon: 8795.7
put_hevc_qpel_h4_8_c: 149.0
put_hevc_qpel_h4_8_neon: 55.7
put_hevc_qpel_h6_8_c: 321.2
put_hevc_qpel_h6_8_neon: 106.0
put_hevc_qpel_h8_8_c: 578.7
put_hevc_qpel_h8_8_neon: 133.2
put_hevc_qpel_h12_8_c: 1279.0
put_hevc_qpel_h12_8_neon: 391.7
put_hevc_qpel_h16_8_c: 2286.2
put_hevc_qpel_h16_8_neon: 519.7
put_hevc_qpel_h24_8_c: 5100.7
put_hevc_qpel_h24_8_neon: 1546.2
put_hevc_qpel_h32_8_c: 9022.0
put_hevc_qpel_h32_8_neon: 2060.2
put_hevc_qpel_h48_8_c: 20293.5
put_hevc_qpel_h48_8_neon: 4656.7
put_hevc_qpel_h64_8_c: 36037.0
put_hevc_qpel_h64_8_neon: 8262.7
put_hevc_qpel_uni_h4_8_c: 162.2
put_hevc_qpel_uni_h4_8_neon: 61.7
put_hevc_qpel_uni_h6_8_c: 355.2
put_hevc_qpel_uni_h6_8_neon: 114.2
put_hevc_qpel_uni_h8_8_c: 651.0
put_hevc_qpel_uni_h8_8_neon: 135.7
put_hevc_qpel_uni_h12_8_c: 1412.5
put_hevc_qpel_uni_h12_8_neon: 402.7
put_hevc_qpel_uni_h16_8_c: 2551.0
put_hevc_qpel_uni_h16_8_neon: 533.5
put_hevc_qpel_uni_h24_8_c: 5782.2
put_hevc_qpel_uni_h24_8_neon: 1578.7
put_hevc_qpel_uni_h32_8_c: 10586.5
put_hevc_qpel_uni_h32_8_neon: 2102.2
put_hevc_qpel_uni_h48_8_c: 23812.0
put_hevc_qpel_uni_h48_8_neon: 4739.5
put_hevc_qpel_uni_h64_8_c: 42958.7
put_hevc_qpel_uni_h64_8_neon: 8366.5

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-10-25 14:56:38 +02:00
Reimar Döffinger
38cd829dce
aarch64: Implement stack spilling in a consistent way.
Currently it is done in several different ways, which
might cause needless dependencies or in case of
tx_float_neon.S is incorrect.

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
2022-10-11 09:12:02 +02:00
Grzegorz Bernacki
8f4b000c37 lavc/aarch64: Add neon implementation for vsse_intra8
Provide optimized implementation for vsse_intra8 for arm64.

Performance tests are shown below.
- vsse_5_c: 87.7
- vsse_5_neon: 26.2

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Co-authored-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-10-04 13:24:20 +03:00
Grzegorz Bernacki
bad67cb9fd lavc/aarch64: Provide optimized implementation of vsse8 for arm64.
Provide optimized implementation of vsse8 for arm64.

Performance comparison tests are shown below.
- vsse_1_c: 141.5
- vsse_1_neon: 32.5

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-10-04 13:24:20 +03:00
Grzegorz Bernacki
faea56c9c7 lavc/aarch64: Provide neon implementation of nsse8
Add vectorized implementation of nsse8 function.

Performance comparison tests are shown below.
- nsse_1_c: 256.0
- nsse_1_neon: 82.7

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-10-04 13:24:20 +03:00
Grzegorz Bernacki
f401a2af21 lavc/aarch64: Add neon implementation for pix_abs8 functions.
Provide optimized implementation of pix_abs8 function for arm64.

Performance comparison tests are shown below:
pix_abs_1_1_c: 162.5
pix_abs_1_1_neon: 27.0
pix_abs_1_2_c: 174.0
pix_abs_1_2_neon: 23.5
pix_abs_1_3_c: 203.2
pix_abs_1_3_neon: 34.7

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Co-authored-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-10-04 13:24:04 +03:00
Martin Storsjö
8089fe072e aarch64: me_cmp: Avoid using the non-unrolled codepath for the minimum unroll size
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-29 10:29:11 +03:00
Martin Storsjö
6f2ad7f951 aarch64: me_cmp: Avoid redundant loads in ff_pix_abs16_y2_neon
This avoids one redundant load per row; pix3 from the previous
iteration can be used as pix2 in the next one.

Before:       Cortex A53    A72    A73
pix_abs_0_2_neon:  138.0   59.7   48.0
After:
pix_abs_0_2_neon:  109.7   50.2   39.5

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-29 10:29:10 +03:00
Andreas Rheinhardt
9beba05311 avcodec/fmtconvert: Remove unused AVCodecContext parameter
Unused since d74a8cb7e4.

Reviewed-by: Rémi Denis-Courmont <remi@remlab.net>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-09-21 20:26:40 +02:00
Hubert Mazur
b2732115dd lavc/aarch64: Add neon implementation for pix_median_abs8
Provide optimized implementation for pix_median_abs8 function.

Performance comparison tests are shown below.
- median_sad_1_c: 277.0
- median_sad_1_neon: 82.0

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-21 12:57:56 +03:00
Hubert Mazur
e9a6170213 lavc/aarch64: Add neon implementation for vsad8_intra
Provide optimized implementation for vsad8_intra function.

Performance comparison tests are shown below.
- vsad_5_c: 94.7
- vsad_5_neon: 20.7

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-21 12:57:56 +03:00
Hubert Mazur
0ee535b1db lavc/aarch64: Add neon implementation for pix_median_abs16
Provide optimized implementation for pix_median_abs16 function.

Performance comparison tests are shown below.
 - median_sad_0_c: 720.5
 - median_sad_0_neon: 127.2

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-21 12:57:56 +03:00
Rémi Denis-Courmont
b52034270a lavc/vorbisdsp: use ptrdiff_t rather than intptr_t
... for a difference between pointers.
2022-09-19 13:51:00 -03:00
Andreas Rheinhardt
a54e53a1c4 avcodec/vp8dsp: Constify src in vp8_mc_func
Reviewed-by: Peter Ross <pross@xvid.org>
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-09-11 20:57:51 +02:00
Hubert Mazur
06b98e396a lavc/aarch64: Provide neon implementation of nsse16
Add vectorized implementation of nsse16 function.

Performance comparison tests are shown below.
- nsse_0_c: 682.2
- nsse_0_neon: 116.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Co-authored-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-09 10:19:46 +03:00
Hubert Mazur
908abe8032 lavc/aarch64: Add neon implementation for vsse_intra16
Provide optimized implementation for vsse_intra16 for arm64.

Performance tests are shown below.
- vsse_4_c: 155.2
- vsse_4_neon: 36.2

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-09 10:19:46 +03:00
Hubert Mazur
ce03ea3e79 lavc/aarch64: Add neon implementation for vsad_intra16
Provide optimized implementation for vsad_intra16 function for arm64.

Performance comparison tests are shown below.
- vsad_4_c: 177.5
- vsad_4_neon: 23.5

Benchmarks and tests are run with checkasm tool on AWS Gravtion 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-09 10:19:46 +03:00
Hubert Mazur
c495a4b32d lavc/aarch64: Add neon implementation of vsse16
Provide optimized implementation of vsse16 for arm64.

Performance comparison tests are shown below.
- vsse_0_c: 257.7
- vsse_0_neon: 59.2

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-09 10:19:46 +03:00
Hubert Mazur
200f5e578f lavc/aarch64: Add neon implementation for vsad16
Provide optimized implementation of vsad16 function for arm64.

Performance comparison tests are shown below.
- vsad_0_c: 285.2
- vsad_0_neon: 39.5

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Co-authored-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-09 10:19:46 +03:00
Lynne
f99d15cca0
arm/fft: disable NEON optimizations for 131072pt transforms
This has been broken since the start, and it was only discovered
when I started testing my replacement for the FFT.
Disable it, since there's no point in fixing slower code that's about
to be removed anyway.

The vfp version is not affected.
2022-08-29 07:13:43 +02:00
J. Dekker
ce2f47318b lavc/aarch64: hevc_add_res add 12bit variants
hevc_add_res_4x4_12_c: 46.0
hevc_add_res_4x4_12_neon: 18.7
hevc_add_res_8x8_12_c: 194.7
hevc_add_res_8x8_12_neon: 25.2
hevc_add_res_16x16_12_c: 716.0
hevc_add_res_16x16_12_neon: 69.7
hevc_add_res_32x32_12_c: 3820.7
hevc_add_res_32x32_12_neon: 261.0

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-08-18 15:04:43 +02:00
Martin Storsjö
48be6616d0 aarch64: me_cmp: Remove a leftover unnecessary instruction
This was missed in a2e45ad407.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:14:53 +03:00
Hubert Mazur
70efa4d011 lavc/aarch64: Add neon implementation for pix_abs8
Provide optimized implementation of pix_abs8 function for arm64.

Performance comparison tests are shown below.
- pix_abs_1_0_c: 101.2
- pix_abs_1_0_neon: 22.5
- sad_1_c: 101.2
- sad_1_neon: 22.5

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Hubert Mazur
74312e80d7 lavc/aarch64: Add neon implementation for sse8
Provide optimized implementation of sse8 function for arm64.

Performance comparison tests are shown below.
- sse_1_c: 130.7
- sse_1_neon: 29.7

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Hubert Mazur
a2e45ad407 lavc/aarch64: Add neon implementation for pix_abs16_y2
Provide optimized implementation of pix_abs16_y2 function for arm64.

Performance comparison tests are shown below.
pix_abs_0_2_c: 317.2
pix_abs_0_2_neon: 37.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Hubert Mazur
d7abb7d143 lavc/aarch64: Add neon implementation for sse4
Provide neon implementation for sse4 function.

Performance comparison tests are shown below.
- sse_2_c: 80.7
- sse_2_neon: 31.0

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Hubert Mazur
ad251fd262 lavc/aarch64: Add neon implementation for sse16
Provide neon implementation for sse16 function.

Performance comparison tests are shown below.
- sse_0_c: 268.2
- sse_0_neon: 43.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Martin Storsjö
60109d5b3d aarch64: me_cmp: Fix the indentation of function declarations
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
J. Dekker
aa9eabb7a5 lavc/aarch64: reformat add_res funcs
Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-08-16 14:00:34 +02:00
Andreas Rheinhardt
333b32af8e avcodec/h264chroma: Constify src in h264_chroma_mc_func
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-08-05 03:02:13 +02:00
Andreas Rheinhardt
b3bbbb14d0 avcodec/hevcdsp: Constify src pointers
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-08-05 02:54:04 +02:00
Andreas Rheinhardt
abb85429f3 avcodec/me_cmp: Constify me_cmp_func buffer parameters
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-07-31 03:31:53 +02:00
Andreas Rheinhardt
af43da3e4d avcodec/videodsp: Constify buf in VideoDSPContext.prefetch
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-07-31 03:14:34 +02:00
Martin Storsjö
4136405c86 aarch64: me_cmp: Don't do uaddlv once per iteration
The max height is currently documented as 16; the max difference per
pixel is 255, and a .8h element can easily contain 16*255, thus keep
accumulating in two .8h vectors, and just do the final accumulationat the
end. This should work for heights up to 256.

This requires a minor register renumbering in ff_pix_abs16_xy2_neon.

Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_0_neon:   97.7   47.0   37.5   22.7
pix_abs_0_1_neon:  154.0   59.0   52.0   25.0
pix_abs_0_3_neon:  179.7   96.7   87.5   41.2
After:
pix_abs_0_0_neon:   96.0   39.2   31.2   22.0
pix_abs_0_1_neon:  150.7   59.7   46.2   23.7
pix_abs_0_3_neon:  175.7   83.7   81.7   38.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:26:17 +03:00
Martin Storsjö
68a03f6424 aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neon
Using absolute-difference-accumulate does use twice the amount of
absolute-difference instructions, but avoids the need for the
uaddl and add instructions, reducing the total number of instructions
by 3.

These can be interleaved in the rest of the calculation, to avoid
tight dependencies at the end. Unfortunately, this is marginally
slower on Cortex A53, but faster on A72 and A73.

Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_3_neon:  175.7  109.2   92.0   41.2
After:
pix_abs_0_3_neon:  179.7   96.7   87.5   41.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:25:54 +03:00
Martin Storsjö
b46de9aba4 aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon
Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_3_neon:  183.7  112.7   97.5   41.2
After:
pix_abs_0_3_neon:  175.7  109.2   92.0   41.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:25:44 +03:00
Martin Storsjö
02e7853fd9 libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon
Checkasm doesn't currently test this codepath.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:25:11 +03:00
Hubert Mazur
01e190dc99 lavc/aarch64: Add pix_abs16_x2 neon implementation
Provide neon implementation for pix_abs16_x2 function.

Performance tests of implementation are below.
 - pix_abs_0_1_c: 283.5
 - pix_abs_0_1_neon: 39.0

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-13 23:25:22 +03:00
Hubert Mazur
eb7ab3928f lavc/aarch64: Hook up the existing ff_pix_abs16_neon to the sad[0] function pointer
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-11 23:58:28 +03:00
Swinney, Jonathan
c471cc7474 lavc/aarch64: motion estimation functions in neon
- ff_pix_abs16_neon
 - ff_pix_abs16_xy2_neon

In direct micro benchmarks of these ff functions verses their C implementations,
these functions performed as follows on AWS Graviton 3.

ff_pix_abs16_neon:
pix_abs_0_0_c: 141.1
pix_abs_0_0_neon: 19.6

ff_pix_abs16_xy2_neon:
pix_abs_0_3_c: 269.1
pix_abs_0_3_neon: 39.3

Tested with:
./tests/checkasm/checkasm --test=motion --bench --disable-linux-perf

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-06-28 00:51:39 +03:00
J. Dekker
3c694967f8 lavc/aarch64: hevc_sao reschedule slightly
Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-05-26 08:10:41 +02:00
J. Dekker
2e832be322 lavc/aarch64: add hevc sao edge 8x8
bench on AWS Graviton:

hevc_sao_edge_8x8_8_c: 516.0
hevc_sao_edge_8x8_8_neon: 81.0

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-05-25 08:04:46 +02:00
J. Dekker
92f67e4017 lavc/aarch64: add hevc sao edge 16x16
bench on AWS Graviton:

hevc_sao_edge_16x16_8_c: 1857.0
hevc_sao_edge_16x16_8_neon: 211.0
hevc_sao_edge_32x32_8_c: 7802.2
hevc_sao_edge_32x32_8_neon: 808.2
hevc_sao_edge_48x48_8_c: 16764.2
hevc_sao_edge_48x48_8_neon: 1796.5
hevc_sao_edge_64x64_8_c: 32647.5
hevc_sao_edge_64x64_8_neon: 3118.5

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-05-25 08:04:39 +02:00
J. Dekker
d957ee34a6 lavc/aarch64: fix hevc sao band filter
The SAO band filter can be called with non-multiples of 8, we round up
to the nearest multiple of 8 to account for this.

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-05-25 08:04:35 +02:00
Andre Kempe
861285c146 arm64: Fix wrong BTI landing pad
This patch fixes a wrong type of BTI landing pad when branching to
functions instantiated via the fft*_neon macro.

Although the previously employed paciasp instruction serves as a landing
pad, for the ways that this function is invoked it is the wrong type, resulting
in an unexpected termination of the running process.

Signed-off-by: André Kempe <andre.kempe@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-26 10:26:49 +03:00
Ben Avison
6eee650289 avcodec/vc1: Arm 64-bit NEON unescape fast path
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

vc1dsp.vc1_unescape_buffer_c: 655617.7
vc1dsp.vc1_unescape_buffer_neon: 118237.0

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:34 +03:00
Ben Avison
5379412ed0 avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

idctdsp.add_pixels_clamped_c: 313.3
idctdsp.add_pixels_clamped_neon: 24.3
idctdsp.put_pixels_clamped_c: 220.3
idctdsp.put_pixels_clamped_neon: 15.5
idctdsp.put_signed_pixels_clamped_c: 210.5
idctdsp.put_signed_pixels_clamped_neon: 19.5

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:34 +03:00
Ben Avison
501fdc017d avcodec/vc1: Arm 64-bit NEON inverse transform fast paths
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

vc1dsp.vc1_inv_trans_4x4_c: 158.2
vc1dsp.vc1_inv_trans_4x4_neon: 65.7
vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5
vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5
vc1dsp.vc1_inv_trans_4x8_c: 335.2
vc1dsp.vc1_inv_trans_4x8_neon: 106.2
vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2
vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5
vc1dsp.vc1_inv_trans_8x4_c: 365.7
vc1dsp.vc1_inv_trans_8x4_neon: 97.2
vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7
vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5
vc1dsp.vc1_inv_trans_8x8_c: 547.7
vc1dsp.vc1_inv_trans_8x8_neon: 137.0
vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2
vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:34 +03:00
Ben Avison
c62bbd4d20 avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C
version can still outperform the NEON version in specific cases. The balance
between different code paths is stream-dependent, but in practice the best
case happens about 5% of the time, the worst case happens about 40% of the
time, and the complexity of the remaining cases fall somewhere in between.
Therefore, taking the average of the best and worst case timings is
probably a conservative estimate of the degree by which the NEON code
improves performance.

vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7
vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5
vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5
vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7
vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2
vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2
vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2
vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2
vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0
vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7
vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7
vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5
vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7
vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0
vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7
vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0
vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2
vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7
vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0
vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2
vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0
vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2
vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:33 +03:00
Martin Storsjö
a78f136f3f configure: Use a separate config_components.h header for $ALL_COMPONENTS
This avoids unnecessary rebuilds of most source files if only the
list of enabled components has changed, but not the other properties
of the build, set in config.h.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-03-16 14:12:49 +02:00
Andre Kempe
248986a0db arm64: Add Armv8.3-A PAC support to assembly files
This patch adds optional support for Arm Pointer Authentication Codes.

PAC support is turned on or off at compile time using additional
compiler flags. Unless any of these is enabled explicitly, no additional
code will be emitted at all.

Signed-off-by: André Kempe <andre.kempe@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-03-09 15:04:25 +02:00
Andreas Rheinhardt
52e9113695 avcodec/aarch64/idct: Add missing stddef
Fixes checkheaders on aarch64.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-02-21 13:10:04 +01:00
Martin Storsjö
402784ba9f aarch64: h264dsp: Fix incorrectly indented code
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-02-11 10:49:12 +02:00
Martin Storsjö
24b93022fe aarch64: Disable ff_hevc_sao_band_filter_8x8_8_neon out of precaution
While this function on its own passes all of fate-hevc, there's
indications that the function might need to handle widths that
aren't a multiple of 8 (noted in commit
f63f9be37c, which later was
reverted).

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:27 +02:00
Martin Storsjö
16fba44b4d Revert "lavc/aarch64: add hevc sao edge 16x16"
This reverts commit a9214a2ca3, as
it breaks fate-hevc.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:23 +02:00
Martin Storsjö
df48b1d06f Revert "lavc/aarch64: add hevc sao edge 8x8"
This reverts commit c97ffc1a77, as
it breaks fate-hevc.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:19 +02:00
Martin Storsjö
cafed377eb Revert "lavc/aarch64: add hevc sao band 8x8 tiling"
This reverts commit f63f9be37c, as
it breaks fate-hevc.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:14 +02:00
J. Dekker
f63f9be37c lavc/aarch64: add hevc sao band 8x8 tiling
bench on AWS Graviton:

hevc_sao_band_8x8_8_c: 317.5
hevc_sao_band_8x8_8_neon: 97.5
hevc_sao_band_16x16_8_c: 1115.0
hevc_sao_band_16x16_8_neon: 322.7
hevc_sao_band_32x32_8_c: 4599.2
hevc_sao_band_32x32_8_neon: 1246.2
hevc_sao_band_48x48_8_c: 10021.7
hevc_sao_band_48x48_8_neon: 2740.5
hevc_sao_band_64x64_8_c: 17635.0
hevc_sao_band_64x64_8_neon: 4875.7

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:32:26 +01:00
J. Dekker
89a2ed4a8b lavc/aarch64: clean-up sao band 8x8 function formatting
Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:31:56 +01:00
J. Dekker
c97ffc1a77 lavc/aarch64: add hevc sao edge 8x8
bench on AWS Graviton:

hevc_sao_edge_8x8_8_c: 516.0
hevc_sao_edge_8x8_8_neon: 81.0

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:31:56 +01:00
J. Dekker
a9214a2ca3 lavc/aarch64: add hevc sao edge 16x16
bench on AWS Graviton:

hevc_sao_edge_16x16_8_c: 1857.0
hevc_sao_edge_16x16_8_neon: 211.0
hevc_sao_edge_32x32_8_c: 7802.2
hevc_sao_edge_32x32_8_neon: 808.2
hevc_sao_edge_48x48_8_c: 16764.2
hevc_sao_edge_48x48_8_neon: 1796.5
hevc_sao_edge_64x64_8_c: 32647.5
hevc_sao_edge_64x64_8_neon: 3118.5

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:31:56 +01:00
Jonathan Wright
08b4716a9e aarch64: Add Armv8.5-A BTI support
Add Branch Target Identifiers (BTIs) to all functions defined in
AArch64 assembly files. Most of the BTI landing pads are added
automatically by the 'function' macro.

BTI support is turned on or off at compile time based on the presence
of the __ARM_FEATURE_BTI_DEFAULT feature macro.

A binary compiled with BTI support can be executed on an Armv8-A
processor without BTI support because the instructions are defined in
NOP space.

Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Elijah Ahmad <elijah.ahmad@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-11-16 13:43:56 +02:00
Jonathan Wright
6f04cf54f5 aarch64: Use ret x<n> instead of br x<n> where possible
Change AArch64 assembly code to use:
  ret     x<n>
instead of:
  br      x<n>

"ret x<n>" is already used in a lot of places so this patch makes it
consistent across the code base. This does not change behavior or
performance.

In addition, this change reduces the number of landing pads needed in
a subsequent patch to support the Armv8.5-A Branch Target
Identification (BTI) security feature.

Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-11-16 13:43:56 +02:00
Martin Storsjö
fd3bd5c492 aarch64: h264qpel: Do vertical filtering without transposing
This gives rather big speedups on these functions:

Before:
put_h264_qpel_8_mc01_8_neon:     241.0   131.5   138.7
put_h264_qpel_8_mc02_8_neon:     214.7   121.2   127.5
put_h264_qpel_8_mc03_8_neon:     242.5   131.2   135.7
put_h264_qpel_8_mc11_8_neon:     421.2   218.7   251.0
put_h264_qpel_8_mc12_8_neon:     878.0   509.5   537.5
put_h264_qpel_8_mc13_8_neon:     423.7   217.0   252.0
put_h264_qpel_8_mc21_8_neon:     858.2   479.5   514.0
put_h264_qpel_8_mc22_8_neon:     649.7   385.2   403.0
put_h264_qpel_8_mc23_8_neon:     860.2   476.5   517.7
put_h264_qpel_8_mc31_8_neon:     437.2   219.5   252.5
put_h264_qpel_8_mc32_8_neon:     892.5   510.5   546.0
put_h264_qpel_8_mc33_8_neon:     438.2   218.5   257.0
put_h264_qpel_16_mc01_8_neon:    944.2   509.7   546.7
put_h264_qpel_16_mc02_8_neon:    878.7   469.5   509.7
put_h264_qpel_16_mc03_8_neon:    945.7   510.7   557.0
put_h264_qpel_16_mc11_8_neon:   1663.2   858.5   979.5
put_h264_qpel_16_mc12_8_neon:   3510.2  2027.7  2112.7
put_h264_qpel_16_mc13_8_neon:   1664.7   857.5   980.5
put_h264_qpel_16_mc21_8_neon:   3366.2  1928.5  2030.5
put_h264_qpel_16_mc22_8_neon:   2584.7  1514.7  1590.2
put_h264_qpel_16_mc23_8_neon:   3367.7  1927.7  2035.0
put_h264_qpel_16_mc31_8_neon:   1716.7   849.7   997.0
put_h264_qpel_16_mc32_8_neon:   3564.0  2044.2  3835.2
put_h264_qpel_16_mc33_8_neon:   1717.7   863.0   989.5

After:
put_h264_qpel_8_mc01_8_neon:     136.0    73.7    76.0
put_h264_qpel_8_mc02_8_neon:     108.7    65.0    64.0
put_h264_qpel_8_mc03_8_neon:     137.5    72.7    73.0
put_h264_qpel_8_mc11_8_neon:     316.2   159.0   188.5
put_h264_qpel_8_mc12_8_neon:     653.0   375.5   384.7
put_h264_qpel_8_mc13_8_neon:     318.7   165.5   189.5
put_h264_qpel_8_mc21_8_neon:     739.2   385.7   432.5
put_h264_qpel_8_mc22_8_neon:     530.7   295.5   309.5
put_h264_qpel_8_mc23_8_neon:     741.2   393.7   421.0
put_h264_qpel_8_mc31_8_neon:     332.2   162.5   190.0
put_h264_qpel_8_mc32_8_neon:     667.5   378.2   390.5
put_h264_qpel_8_mc33_8_neon:     332.7   166.5   195.5
put_h264_qpel_16_mc01_8_neon:    524.2   285.2   294.0
put_h264_qpel_16_mc02_8_neon:    454.7   252.2   250.2
put_h264_qpel_16_mc03_8_neon:    525.7   286.0   283.0
put_h264_qpel_16_mc11_8_neon:   1243.2   630.7   726.7
put_h264_qpel_16_mc12_8_neon:   2610.2  1479.7  1481.2
put_h264_qpel_16_mc13_8_neon:   1250.5   631.7   727.7
put_h264_qpel_16_mc21_8_neon:   2890.2  1571.2  1679.7
put_h264_qpel_16_mc22_8_neon:   2108.7  1177.5  1223.5
put_h264_qpel_16_mc23_8_neon:   2891.7  1578.7  1667.7
put_h264_qpel_16_mc31_8_neon:   1296.7   630.5   752.5
put_h264_qpel_16_mc32_8_neon:   2664.0  1483.2  1503.5
put_h264_qpel_16_mc33_8_neon:   1297.7   632.5   747.2

I.e. overall a 20%-60% reduction in runtime of these
functions.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-10-18 14:27:58 +03:00
Martin Storsjö
2d5a7f6d00 arm/aarch64: Improve scheduling in the avg form of h264_qpel
Don't use the loaded registers directly, avoiding stalls on in
order cores. Use vrhadd.u8 with q registers where easily possible.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-10-18 14:27:36 +03:00
Zhao Zhili
378ad2f8fd lavc/aarch64: fix relocation out of range error
Use a temporary label instead of global function symbol for b.gt.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-09-25 21:55:29 +03:00
Mikhail Nitenko
d3e56b56ae lavc/aarch64: add pred functions for 10-bit
Benchmarks:                        A53     A72
pred8x8_dc_10_c:                   64.2    49.5
pred8x8_dc_10_neon:                62.0    53.7
pred8x8_dc_128_10_c:               26.0    14.0
pred8x8_dc_128_10_neon:            30.7    17.5
pred8x8_horizontal_10_c:           60.0    27.7
pred8x8_horizontal_10_neon:        38.0    34.0
pred8x8_left_dc_10_c:              42.5    27.5
pred8x8_left_dc_10_neon:           51.0    41.2
pred8x8_mad_cow_dc_0l0_10_c:       55.7    37.2
pred8x8_mad_cow_dc_0l0_10_neon:    50.2    35.2
pred8x8_mad_cow_dc_0lt_10_c:       89.2    67.0
pred8x8_mad_cow_dc_0lt_10_neon:    52.2    46.7
pred8x8_mad_cow_dc_l0t_10_c:       74.7    51.0
pred8x8_mad_cow_dc_l0t_10_neon:    50.5    45.2
pred8x8_mad_cow_dc_l00_10_c:       58.0    38.0
pred8x8_mad_cow_dc_l00_10_neon:    42.5    37.5
pred8x8_plane_10_c:               354.0   288.7
pred8x8_plane_10_neon:            141.0   101.2
pred8x8_top_dc_10_c:               44.5    30.5
pred8x8_top_dc_10_neon:            40.0    31.0
pred8x8_vertical_10_c:             27.5    14.5
pred8x8_vertical_10_neon:          21.0    17.5
pred16x16_plane_10_c:            1242.0  1070.5
pred16x16_plane_10_neon:          324.0   196.7

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Mikhail Nitenko
43ca887bc2 lavc/aarch64: h264, add chroma loop filters for 10bit
Benchmarks:                                             A53     A72
h264_h_loop_filter_chroma422_10bpp_c:                  282.7   114.2
h264_h_loop_filter_chroma422_10bpp_neon:               109.5    78.5
h264_h_loop_filter_chroma_10bpp_c:                     165.0    81.5
h264_h_loop_filter_chroma_10bpp_neon:                  120.0    76.7
h264_h_loop_filter_chroma_intra422_10bpp_c:            323.7   124.2
h264_h_loop_filter_chroma_intra422_10bpp_neon:         155.0   102.7
h264_h_loop_filter_chroma_intra_10bpp_c:               121.0    49.5
h264_h_loop_filter_chroma_intra_10bpp_neon:             79.7    53.7
h264_h_loop_filter_chroma_mbaff422_10bpp_c:            188.5    75.0
h264_h_loop_filter_chroma_mbaff422_10bpp_neon:         120.0    75.5
h264_h_loop_filter_chroma_mbaff_intra422_10bpp_c:      116.7    46.0
h264_h_loop_filter_chroma_mbaff_intra422_10bpp_neon:    79.7    53.7
h264_h_loop_filter_chroma_mbaff_intra_10bpp_c:          63.0    27.2
h264_h_loop_filter_chroma_mbaff_intra_10bpp_neon:       48.5    34.0
h264_v_loop_filter_chroma_10bpp_c:                     258.7   135.5
h264_v_loop_filter_chroma_10bpp_neon:                   71.2    51.0
h264_v_loop_filter_chroma_intra_10bpp_c:               158.0    70.7
h264_v_loop_filter_chroma_intra_10bpp_neon:             48.7    31.5

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Mikhail Nitenko
756d2e087a lavc/aarch64: move transpose_4x8H to neon.S
transpose_4x8H was declared in vp9lpf_16bpp_neon, however this macro is
not unique to vp9 and could be used elsewhere.

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Martin Storsjö
c60b76d0c8 aarch64: h264dsp: Fix indentation of some functions to match the rest
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-08 23:18:43 +03:00
Martin Storsjö
e86ec831b0 aarch64: h264dsp: Remove unnecessary sign extensions
These became unnecessary when the stride arguments were changed from
int to ptrdiff_t in bc26fe8927
(0576ef466d) and
d5d699ab6e
(aa844dc46f).

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-08 23:18:43 +03:00
Andreas Rheinhardt
afc95a10ac avcodec/h264dsp, h264idct: Fix lengths of array parameters
Fixes many -Warray-parameter warnings from GCC 11.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-08-08 17:44:57 +02:00
Martin Storsjö
f27e3ccf06 aarch64: hevc_idct: Fix overflows in idct_dc
This is marginally slower, but correct for all input values.
The previous implementation failed with certain input seeds, e.g.
"checkasm --test=hevc_idct 98".

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-05-22 00:08:03 +03:00
Andreas Rheinhardt
7c1f347b18 avcodec: Remove deprecated old encode/decode APIs
Deprecated in commits 7fc329e2dd
and 31f6a4b4b8.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2021-04-27 10:43:12 -03:00
Andreas Rheinhardt
f3c197b129 Include attributes.h directly
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-04-19 14:34:10 +02:00
Mikhail Nitenko
84ac1440b2 lavc/aarch64: add pred16x16 10-bit functions
Benchmarks:                      A53     A72
pred16x16_dc_10_c:              136.0   124.0
pred16x16_dc_10_neon:           121.2   106.0
pred16x16_horizontal_10_c:      155.0    73.2
pred16x16_horizontal_10_neon:    82.2    67.7
pred16x16_top_dc_10_c:          106.0    93.7
pred16x16_top_dc_10_neon:        87.7    77.2
pred16x16_vertical_10_c:         83.0    67.7
pred16x16_vertical_10_neon:      54.2    61.7

Some functions work slower than C and are left commented out.
2021-04-19 09:01:14 +02:00
Mikhail Nitenko
6b2e7dc828 lavc/aarch64: change h264pred_init structure
Change structure to allow the addition of other bit depths.
2021-04-19 09:00:58 +02:00
Martin Storsjö
870bfe16a1 aarch64: h264pred: Optimize the inner loop of existing 8 bit functions
Move the loop counter decrement further from the branch instruction,
this hides the latency of the decrement.

In loops that first load, then store (the horizontal prediction cases),
do the decrement after the load (where the next instruction would
stall a bit anyway, waiting for the result of the load).

In loops that store twice using the same destination register,
also do the decrement between the two stores (as the second store
would need to wait for the updated destination register from the
first instruction).

In loops that store twice to two different destination registers,
do the decrement before both stores, to do it as soon before the
branch as possible.

This gives minor (1-2 cycle) speedups in most cases (modulo measurement
noise), but the horizontal prediction functions get a rather notable
speedup on the Cortex A53.

Before:                     Cortex A53     A72     A73
pred8x8_dc_8_neon:                60.7    46.2    39.2
pred8x8_dc_128_8_neon:            30.7    18.0    14.0
pred8x8_horizontal_8_neon:        42.2    29.2    18.5
pred8x8_left_dc_8_neon:           52.7    36.2    32.2
pred8x8_mad_cow_dc_0l0_8_neon:    48.2    27.7    25.7
pred8x8_mad_cow_dc_0lt_8_neon:    52.5    33.2    34.7
pred8x8_mad_cow_dc_l0t_8_neon:    52.5    31.7    33.2
pred8x8_mad_cow_dc_l00_8_neon:    43.2    27.0    25.5
pred8x8_plane_8_neon:            112.2    86.2    88.2
pred8x8_top_dc_8_neon:            40.7    23.0    21.2
pred8x8_vertical_8_neon:          27.2    15.5    14.0
pred16x16_dc_8_neon:              91.0    73.2    70.5
pred16x16_dc_128_8_neon:          43.0    34.7    30.7
pred16x16_horizontal_8_neon:      86.0    49.7    44.7
pred16x16_left_dc_8_neon:         87.0    67.2    67.5
pred16x16_plane_8_neon:          236.0   175.7   173.0
pred16x16_top_dc_8_neon:          53.2    39.0    41.7
pred16x16_vertical_8_neon:        41.7    29.7    31.0

After:
pred8x8_dc_8_neon:                59.0    46.7    42.5
pred8x8_dc_128_8_neon:            28.2    18.0    14.0
pred8x8_horizontal_8_neon:        34.2    29.2    18.5
pred8x8_left_dc_8_neon:           51.0    38.2    32.7
pred8x8_mad_cow_dc_0l0_8_neon:    46.7    28.2    26.2
pred8x8_mad_cow_dc_0lt_8_neon:    55.2    33.7    37.5
pred8x8_mad_cow_dc_l0t_8_neon:    51.2    31.7    37.2
pred8x8_mad_cow_dc_l00_8_neon:    41.7    27.5    26.0
pred8x8_plane_8_neon:            111.5    86.5    89.5
pred8x8_top_dc_8_neon:            39.0    23.2    21.0
pred8x8_vertical_8_neon:          27.2    16.0    14.0
pred16x16_dc_8_neon:              85.0    70.2    70.5
pred16x16_dc_128_8_neon:          42.0    30.0    30.7
pred16x16_horizontal_8_neon:      66.5    49.5    42.5
pred16x16_left_dc_8_neon:         81.0    66.5    67.5
pred16x16_plane_8_neon:          235.0   175.7   173.0
pred16x16_top_dc_8_neon:          52.0    39.0    41.7
pred16x16_vertical_8_neon:        40.2    33.2    31.0

Despite this, a number of these functions still are slower than
what e.g. GCC 7 generates - this shows the relative speedup of the
neon codepaths over the compiler generated ones:

                           Cortex A53    A72    A73
pred8x8_dc_8_neon:               0.86   0.65   1.04
pred8x8_dc_128_8_neon:           0.59   0.44   0.62
pred8x8_horizontal_8_neon:       1.51   0.58   1.30
pred8x8_left_dc_8_neon:          0.72   0.56   0.89
pred8x8_mad_cow_dc_0l0_8_neon:   0.93   0.93   1.37
pred8x8_mad_cow_dc_0lt_8_neon:   1.37   1.41   1.68
pred8x8_mad_cow_dc_l0t_8_neon:   1.21   1.17   1.32
pred8x8_mad_cow_dc_l00_8_neon:   1.24   1.19   1.60
pred8x8_plane_8_neon:            3.36   3.58   3.76
pred8x8_top_dc_8_neon:           0.97   0.99   1.43
pred8x8_vertical_8_neon:         0.86   0.78   1.18
pred16x16_dc_8_neon:             1.20   1.06   1.49
pred16x16_dc_128_8_neon:         0.83   0.95   0.99
pred16x16_horizontal_8_neon:     1.78   0.96   1.59
pred16x16_left_dc_8_neon:        1.06   0.96   1.32
pred16x16_plane_8_neon:          5.78   6.49   7.19
pred16x16_top_dc_8_neon:         1.48   1.53   1.94
pred16x16_vertical_8_neon:       1.39   1.34   1.98

In particular, on Cortex A72, many of these functions are slower
than the compiler generated code, while they're more beneficial on
e.g. the Cortex A73.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-04-14 15:23:44 +03:00
James Almer
f1a894f9d3 avcodec: add missing FF_API_OLD_ENCDEC wrappers to xmm clobber functions
Signed-off-by: James Almer <jamrial@gmail.com>
2021-02-26 19:26:31 -03:00
Josh Dekker
7ac41e0db2 lavc/aarch64: add HEVC sao_band NEON
Only works for 8x8.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:12:01 +01:00
Josh Dekker
75c2ddfa61 lavc/aarch64: add HEVC idct_dc NEON
Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:12:01 +01:00
Reimar Döffinger
00c916ef61 lavc/aarch64: port HEVC add_residual NEON
Speedup is fairly small, around 1.5%, but these are fairly simple.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:11:57 +01:00
Reimar Döffinger
30f80d855b lavc/aarch64: port HEVC SIMD idct NEON
Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth
available on aarch64.
For a UHD HDR (10 bit) sample video these were consuming the most time
and this optimization reduced overall decode time from 19.4s to 16.4s,
approximately 15% speedup.
Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts",
running on Apple M1.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:11:53 +01:00
Anton Khirnov
c8c2dfbc37 lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h
That is a more appropriate place for it.
2021-01-01 14:11:01 +01:00