FFmpeg

mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-23 12:43:46 +02:00

Author	SHA1	Message	Date
Rémi Denis-Courmont	82cb4b1c05	lavc/aarch64: remove bogus HAVE_VFP guard The IMDCT offset is only relevant for NEON optimisations. There are no VFP optimisations here that would justify the HAVE_VFP flag. In practice, this makes no difference because HAVE_NEON is practically always true for standard Armv8 platforms.	2023-07-15 22:56:30 +03:00
Logan Lyu	9557bf26b3	lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_hv put_hevc_epel_uni_w_hv4_8_c: 254.6 put_hevc_epel_uni_w_hv4_8_i8mm: 102.9 put_hevc_epel_uni_w_hv6_8_c: 411.6 put_hevc_epel_uni_w_hv6_8_i8mm: 221.6 put_hevc_epel_uni_w_hv8_8_c: 669.4 put_hevc_epel_uni_w_hv8_8_i8mm: 214.9 put_hevc_epel_uni_w_hv12_8_c: 1412.6 put_hevc_epel_uni_w_hv12_8_i8mm: 481.4 put_hevc_epel_uni_w_hv16_8_c: 2425.4 put_hevc_epel_uni_w_hv16_8_i8mm: 647.4 put_hevc_epel_uni_w_hv24_8_c: 5384.1 put_hevc_epel_uni_w_hv24_8_i8mm: 1450.6 put_hevc_epel_uni_w_hv32_8_c: 9470.9 put_hevc_epel_uni_w_hv32_8_i8mm: 2497.1 put_hevc_epel_uni_w_hv48_8_c: 20930.1 put_hevc_epel_uni_w_hv48_8_i8mm: 5635.9 put_hevc_epel_uni_w_hv64_8_c: 36682.9 put_hevc_epel_uni_w_hv64_8_i8mm: 9712.6 Signed-off-by: Martin Storsjö <martin@martin.st>	2023-07-14 21:19:12 +03:00
Logan Lyu	d48c89701c	lavc/aarch64: new optimization for 8-bit hevc_epel_h put_hevc_epel_h4_8_c: 67.1 put_hevc_epel_h4_8_i8mm: 21.1 put_hevc_epel_h6_8_c: 147.1 put_hevc_epel_h6_8_i8mm: 45.1 put_hevc_epel_h8_8_c: 237.4 put_hevc_epel_h8_8_i8mm: 72.1 put_hevc_epel_h12_8_c: 527.4 put_hevc_epel_h12_8_i8mm: 115.4 put_hevc_epel_h16_8_c: 943.6 put_hevc_epel_h16_8_i8mm: 153.9 put_hevc_epel_h24_8_c: 2105.4 put_hevc_epel_h24_8_i8mm: 384.4 put_hevc_epel_h32_8_c: 3631.4 put_hevc_epel_h32_8_i8mm: 519.9 put_hevc_epel_h48_8_c: 8082.1 put_hevc_epel_h48_8_i8mm: 1110.4 put_hevc_epel_h64_8_c: 14400.6 put_hevc_epel_h64_8_i8mm: 2057.1 put_hevc_qpel_h4_8_c: 124.9 put_hevc_qpel_h4_8_neon: 43.1 put_hevc_qpel_h4_8_i8mm: 33.1 put_hevc_qpel_h6_8_c: 269.4 put_hevc_qpel_h6_8_neon: 90.6 put_hevc_qpel_h6_8_i8mm: 61.4 put_hevc_qpel_h8_8_c: 477.6 put_hevc_qpel_h8_8_neon: 82.1 put_hevc_qpel_h8_8_i8mm: 99.9 put_hevc_qpel_h12_8_c: 1062.4 put_hevc_qpel_h12_8_neon: 226.9 put_hevc_qpel_h12_8_i8mm: 170.9 put_hevc_qpel_h16_8_c: 1880.6 put_hevc_qpel_h16_8_neon: 302.9 put_hevc_qpel_h16_8_i8mm: 251.4 put_hevc_qpel_h24_8_c: 4221.9 put_hevc_qpel_h24_8_neon: 893.9 put_hevc_qpel_h24_8_i8mm: 626.1 put_hevc_qpel_h32_8_c: 7437.6 put_hevc_qpel_h32_8_neon: 1189.9 put_hevc_qpel_h32_8_i8mm: 959.1 put_hevc_qpel_h48_8_c: 16838.4 put_hevc_qpel_h48_8_neon: 2727.9 put_hevc_qpel_h48_8_i8mm: 2163.9 put_hevc_qpel_h64_8_c: 29982.1 put_hevc_qpel_h64_8_neon: 4777.6 Signed-off-by: Martin Storsjö <martin@martin.st>	2023-07-14 21:19:12 +03:00
Logan Lyu	668eb4c00e	lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_v put_hevc_epel_uni_w_v4_8_c: 116.1 put_hevc_epel_uni_w_v4_8_neon: 48.6 put_hevc_epel_uni_w_v6_8_c: 248.9 put_hevc_epel_uni_w_v6_8_neon: 80.6 put_hevc_epel_uni_w_v8_8_c: 383.9 put_hevc_epel_uni_w_v8_8_neon: 91.9 put_hevc_epel_uni_w_v12_8_c: 806.1 put_hevc_epel_uni_w_v12_8_neon: 202.9 put_hevc_epel_uni_w_v16_8_c: 1411.1 put_hevc_epel_uni_w_v16_8_neon: 289.9 put_hevc_epel_uni_w_v24_8_c: 3168.9 put_hevc_epel_uni_w_v24_8_neon: 619.4 put_hevc_epel_uni_w_v32_8_c: 5632.9 put_hevc_epel_uni_w_v32_8_neon: 1161.1 put_hevc_epel_uni_w_v48_8_c: 12406.1 put_hevc_epel_uni_w_v48_8_neon: 2476.4 put_hevc_epel_uni_w_v64_8_c: 22001.4 put_hevc_epel_uni_w_v64_8_neon: 4343.9 Signed-off-by: Martin Storsjö <martin@martin.st>	2023-07-14 21:19:12 +03:00
Logan Lyu	0c604b1913	lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_h put_hevc_epel_uni_w_h4_8_c: 126.1 put_hevc_epel_uni_w_h4_8_i8mm: 41.6 put_hevc_epel_uni_w_h6_8_c: 222.9 put_hevc_epel_uni_w_h6_8_i8mm: 91.4 put_hevc_epel_uni_w_h8_8_c: 374.4 put_hevc_epel_uni_w_h8_8_i8mm: 102.1 put_hevc_epel_uni_w_h12_8_c: 806.1 put_hevc_epel_uni_w_h12_8_i8mm: 225.6 put_hevc_epel_uni_w_h16_8_c: 1414.4 put_hevc_epel_uni_w_h16_8_i8mm: 333.4 put_hevc_epel_uni_w_h24_8_c: 3128.6 put_hevc_epel_uni_w_h24_8_i8mm: 713.1 put_hevc_epel_uni_w_h32_8_c: 5519.1 put_hevc_epel_uni_w_h32_8_i8mm: 1118.1 put_hevc_epel_uni_w_h48_8_c: 12364.4 put_hevc_epel_uni_w_h48_8_i8mm: 2541.1 put_hevc_epel_uni_w_h64_8_c: 21925.9 put_hevc_epel_uni_w_h64_8_i8mm: 4383.6 Signed-off-by: Martin Storsjö <martin@martin.st>	2023-07-14 21:19:12 +03:00
Logan Lyu	e652e7dcda	lavc/aarch64: new optimization for 8-bit hevc_pel_uni_pixels put_hevc_pel_uni_pixels4_8_c: 35.9 put_hevc_pel_uni_pixels4_8_neon: 7.6 put_hevc_pel_uni_pixels6_8_c: 46.1 put_hevc_pel_uni_pixels6_8_neon: 20.6 put_hevc_pel_uni_pixels8_8_c: 53.4 put_hevc_pel_uni_pixels8_8_neon: 11.6 put_hevc_pel_uni_pixels12_8_c: 89.1 put_hevc_pel_uni_pixels12_8_neon: 25.9 put_hevc_pel_uni_pixels16_8_c: 106.4 put_hevc_pel_uni_pixels16_8_neon: 20.4 put_hevc_pel_uni_pixels24_8_c: 137.6 put_hevc_pel_uni_pixels24_8_neon: 47.1 put_hevc_pel_uni_pixels32_8_c: 173.6 put_hevc_pel_uni_pixels32_8_neon: 54.1 put_hevc_pel_uni_pixels48_8_c: 268.1 put_hevc_pel_uni_pixels48_8_neon: 117.1 put_hevc_pel_uni_pixels64_8_c: 346.1 put_hevc_pel_uni_pixels64_8_neon: 205.9 Signed-off-by: Martin Storsjö <martin@martin.st>	2023-07-14 21:19:12 +03:00
Logan Lyu	e79686be96	lavc/aarch64: new optimization for 8-bit hevc_qpel_h hevc_qpel_uni_w_hv Signed-off-by: Martin Storsjö <martin@martin.st>	2023-06-06 12:50:18 +03:00
Logan Lyu	15972cce8c	lavc/aarch64: new optimization for 8-bit hevc_qpel_uni_w_h Signed-off-by: Martin Storsjö <martin@martin.st>	2023-06-06 12:50:18 +03:00
Logan Lyu	0b7356c1b4	lavc/aarch64: new optimization for 8-bit hevc_pel_uni_w_pixels and qpel_uni_w_v Signed-off-by: Martin Storsjö <martin@martin.st>	2023-06-06 12:50:18 +03:00
xufuji456	bd2f00f665	codec/aarch64/hevc: add transform_luma_neon got 56% speed up (run_count=1000, CPU=Cortex A53) transform_4x4_luma_neon: 45 transform_4x4_luma_c: 103 Signed-off-by: xufuji456 <839789740@qq.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-04-14 12:07:57 +03:00
xufuji456	00a062b8d5	codec/aarch64/hevc:add idct_32x32_neon got 73% speed up (run_count=1000, CPU=Cortex A53) idct_32x32_neon: 4826 idct_32x32_c: 18236 idct_32x32_neon: 4824 idct_32x32_c: 18149 idct_32x32_neon: 4937 idct_32x32_c: 18333 Signed-off-by: Martin Storsjö <martin@martin.st>	2023-04-12 15:58:09 +03:00
J. Dekker	b564ad8eac	lavc/aarch64: add hevc deblock chroma 8-12bit Benched on Ampere Altra: hevc_h_loop_filter_chroma8_c: 367.7 hevc_h_loop_filter_chroma8_neon: 31.0 hevc_h_loop_filter_chroma10_c: 396.7 hevc_h_loop_filter_chroma10_neon: 27.5 hevc_h_loop_filter_chroma12_c: 377.0 hevc_h_loop_filter_chroma12_neon: 31.7 hevc_v_loop_filter_chroma8_c: 369.0 hevc_v_loop_filter_chroma8_neon: 55.0 hevc_v_loop_filter_chroma10_c: 389.0 hevc_v_loop_filter_chroma10_neon: 54.0 hevc_v_loop_filter_chroma12_c: 389.5 hevc_v_loop_filter_chroma12_neon: 53.0 Signed-off-by: J. Dekker <jdek@itanimul.li>	2023-04-06 06:54:26 +02:00
J. Dekker	37cde570bc	lavc/aarch64: add clip N macro Signed-off-by: J. Dekker <jdek@itanimul.li>	2023-03-22 14:48:13 +01:00
xufuji456	4b4de07721	libavcodec/hevc: add hevc idct4x4 neon of aarch64 Signed-off-by: Martin Storsjö <martin@martin.st>	2023-02-28 13:12:52 +02:00
Martin Storsjö	ec7fa13eb0	aarch64: hevcdsp_idct: Reuse preexisting macros for transposes Signed-off-by: Martin Storsjö <martin@martin.st>	2023-02-28 11:48:54 +02:00
Lynne	e0661fc805	dca_core: convert to lavu/tx Thanks to Martin Storsjö <martin@martin.st> for fixing and testing the arm32 and aarch64 changes.	2022-11-06 14:39:36 +01:00
J. Dekker	9bed814e1d	lavc/aarch64: add hevc horizontal qpel/uni/bi checkasm --benchmark on Ampere Altra (Neoverse N1): put_hevc_qpel_bi_h4_8_c: 170.7 put_hevc_qpel_bi_h4_8_neon: 64.5 put_hevc_qpel_bi_h6_8_c: 373.7 put_hevc_qpel_bi_h6_8_neon: 130.2 put_hevc_qpel_bi_h8_8_c: 662.0 put_hevc_qpel_bi_h8_8_neon: 138.5 put_hevc_qpel_bi_h12_8_c: 1529.5 put_hevc_qpel_bi_h12_8_neon: 422.0 put_hevc_qpel_bi_h16_8_c: 2735.5 put_hevc_qpel_bi_h16_8_neon: 560.5 put_hevc_qpel_bi_h24_8_c: 6015.7 put_hevc_qpel_bi_h24_8_neon: 1636.0 put_hevc_qpel_bi_h32_8_c: 10779.0 put_hevc_qpel_bi_h32_8_neon: 2204.5 put_hevc_qpel_bi_h48_8_c: 24375.0 put_hevc_qpel_bi_h48_8_neon: 4984.0 put_hevc_qpel_bi_h64_8_c: 42768.0 put_hevc_qpel_bi_h64_8_neon: 8795.7 put_hevc_qpel_h4_8_c: 149.0 put_hevc_qpel_h4_8_neon: 55.7 put_hevc_qpel_h6_8_c: 321.2 put_hevc_qpel_h6_8_neon: 106.0 put_hevc_qpel_h8_8_c: 578.7 put_hevc_qpel_h8_8_neon: 133.2 put_hevc_qpel_h12_8_c: 1279.0 put_hevc_qpel_h12_8_neon: 391.7 put_hevc_qpel_h16_8_c: 2286.2 put_hevc_qpel_h16_8_neon: 519.7 put_hevc_qpel_h24_8_c: 5100.7 put_hevc_qpel_h24_8_neon: 1546.2 put_hevc_qpel_h32_8_c: 9022.0 put_hevc_qpel_h32_8_neon: 2060.2 put_hevc_qpel_h48_8_c: 20293.5 put_hevc_qpel_h48_8_neon: 4656.7 put_hevc_qpel_h64_8_c: 36037.0 put_hevc_qpel_h64_8_neon: 8262.7 put_hevc_qpel_uni_h4_8_c: 162.2 put_hevc_qpel_uni_h4_8_neon: 61.7 put_hevc_qpel_uni_h6_8_c: 355.2 put_hevc_qpel_uni_h6_8_neon: 114.2 put_hevc_qpel_uni_h8_8_c: 651.0 put_hevc_qpel_uni_h8_8_neon: 135.7 put_hevc_qpel_uni_h12_8_c: 1412.5 put_hevc_qpel_uni_h12_8_neon: 402.7 put_hevc_qpel_uni_h16_8_c: 2551.0 put_hevc_qpel_uni_h16_8_neon: 533.5 put_hevc_qpel_uni_h24_8_c: 5782.2 put_hevc_qpel_uni_h24_8_neon: 1578.7 put_hevc_qpel_uni_h32_8_c: 10586.5 put_hevc_qpel_uni_h32_8_neon: 2102.2 put_hevc_qpel_uni_h48_8_c: 23812.0 put_hevc_qpel_uni_h48_8_neon: 4739.5 put_hevc_qpel_uni_h64_8_c: 42958.7 put_hevc_qpel_uni_h64_8_neon: 8366.5 Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-10-25 14:56:38 +02:00
Reimar Döffinger	38cd829dce	aarch64: Implement stack spilling in a consistent way. Currently it is done in several different ways, which might cause needless dependencies or in case of tx_float_neon.S is incorrect. Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>	2022-10-11 09:12:02 +02:00
Grzegorz Bernacki	8f4b000c37	lavc/aarch64: Add neon implementation for vsse_intra8 Provide optimized implementation for vsse_intra8 for arm64. Performance tests are shown below. - vsse_5_c: 87.7 - vsse_5_neon: 26.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-10-04 13:24:20 +03:00
Grzegorz Bernacki	bad67cb9fd	lavc/aarch64: Provide optimized implementation of vsse8 for arm64. Provide optimized implementation of vsse8 for arm64. Performance comparison tests are shown below. - vsse_1_c: 141.5 - vsse_1_neon: 32.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-10-04 13:24:20 +03:00
Grzegorz Bernacki	faea56c9c7	lavc/aarch64: Provide neon implementation of nsse8 Add vectorized implementation of nsse8 function. Performance comparison tests are shown below. - nsse_1_c: 256.0 - nsse_1_neon: 82.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-10-04 13:24:20 +03:00
Grzegorz Bernacki	f401a2af21	lavc/aarch64: Add neon implementation for pix_abs8 functions. Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below: pix_abs_1_1_c: 162.5 pix_abs_1_1_neon: 27.0 pix_abs_1_2_c: 174.0 pix_abs_1_2_neon: 23.5 pix_abs_1_3_c: 203.2 pix_abs_1_3_neon: 34.7 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-10-04 13:24:04 +03:00
Martin Storsjö	8089fe072e	aarch64: me_cmp: Avoid using the non-unrolled codepath for the minimum unroll size Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-29 10:29:11 +03:00
Martin Storsjö	6f2ad7f951	aarch64: me_cmp: Avoid redundant loads in ff_pix_abs16_y2_neon This avoids one redundant load per row; pix3 from the previous iteration can be used as pix2 in the next one. Before: Cortex A53 A72 A73 pix_abs_0_2_neon: 138.0 59.7 48.0 After: pix_abs_0_2_neon: 109.7 50.2 39.5 Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-29 10:29:10 +03:00
Andreas Rheinhardt	9beba05311	avcodec/fmtconvert: Remove unused AVCodecContext parameter Unused since `d74a8cb7e4`. Reviewed-by: Rémi Denis-Courmont <remi@remlab.net> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-09-21 20:26:40 +02:00
Hubert Mazur	b2732115dd	lavc/aarch64: Add neon implementation for pix_median_abs8 Provide optimized implementation for pix_median_abs8 function. Performance comparison tests are shown below. - median_sad_1_c: 277.0 - median_sad_1_neon: 82.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-21 12:57:56 +03:00
Hubert Mazur	e9a6170213	lavc/aarch64: Add neon implementation for vsad8_intra Provide optimized implementation for vsad8_intra function. Performance comparison tests are shown below. - vsad_5_c: 94.7 - vsad_5_neon: 20.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-21 12:57:56 +03:00
Hubert Mazur	0ee535b1db	lavc/aarch64: Add neon implementation for pix_median_abs16 Provide optimized implementation for pix_median_abs16 function. Performance comparison tests are shown below. - median_sad_0_c: 720.5 - median_sad_0_neon: 127.2 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-21 12:57:56 +03:00
Rémi Denis-Courmont	b52034270a	lavc/vorbisdsp: use ptrdiff_t rather than intptr_t ... for a difference between pointers.	2022-09-19 13:51:00 -03:00
Andreas Rheinhardt	a54e53a1c4	avcodec/vp8dsp: Constify src in vp8_mc_func Reviewed-by: Peter Ross <pross@xvid.org> Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-09-11 20:57:51 +02:00
Hubert Mazur	06b98e396a	lavc/aarch64: Provide neon implementation of nsse16 Add vectorized implementation of nsse16 function. Performance comparison tests are shown below. - nsse_0_c: 682.2 - nsse_0_neon: 116.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-09 10:19:46 +03:00
Hubert Mazur	908abe8032	lavc/aarch64: Add neon implementation for vsse_intra16 Provide optimized implementation for vsse_intra16 for arm64. Performance tests are shown below. - vsse_4_c: 155.2 - vsse_4_neon: 36.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-09 10:19:46 +03:00
Hubert Mazur	ce03ea3e79	lavc/aarch64: Add neon implementation for vsad_intra16 Provide optimized implementation for vsad_intra16 function for arm64. Performance comparison tests are shown below. - vsad_4_c: 177.5 - vsad_4_neon: 23.5 Benchmarks and tests are run with checkasm tool on AWS Gravtion 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-09 10:19:46 +03:00
Hubert Mazur	c495a4b32d	lavc/aarch64: Add neon implementation of vsse16 Provide optimized implementation of vsse16 for arm64. Performance comparison tests are shown below. - vsse_0_c: 257.7 - vsse_0_neon: 59.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-09 10:19:46 +03:00
Hubert Mazur	200f5e578f	lavc/aarch64: Add neon implementation for vsad16 Provide optimized implementation of vsad16 function for arm64. Performance comparison tests are shown below. - vsad_0_c: 285.2 - vsad_0_neon: 39.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-09 10:19:46 +03:00
Lynne	f99d15cca0	arm/fft: disable NEON optimizations for 131072pt transforms This has been broken since the start, and it was only discovered when I started testing my replacement for the FFT. Disable it, since there's no point in fixing slower code that's about to be removed anyway. The vfp version is not affected.	2022-08-29 07:13:43 +02:00
J. Dekker	ce2f47318b	lavc/aarch64: hevc_add_res add 12bit variants hevc_add_res_4x4_12_c: 46.0 hevc_add_res_4x4_12_neon: 18.7 hevc_add_res_8x8_12_c: 194.7 hevc_add_res_8x8_12_neon: 25.2 hevc_add_res_16x16_12_c: 716.0 hevc_add_res_16x16_12_neon: 69.7 hevc_add_res_32x32_12_c: 3820.7 hevc_add_res_32x32_12_neon: 261.0 Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-08-18 15:04:43 +02:00
Martin Storsjö	48be6616d0	aarch64: me_cmp: Remove a leftover unnecessary instruction This was missed in `a2e45ad407`. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:14:53 +03:00
Hubert Mazur	70efa4d011	lavc/aarch64: Add neon implementation for pix_abs8 Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below. - pix_abs_1_0_c: 101.2 - pix_abs_1_0_neon: 22.5 - sad_1_c: 101.2 - sad_1_neon: 22.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
Hubert Mazur	74312e80d7	lavc/aarch64: Add neon implementation for sse8 Provide optimized implementation of sse8 function for arm64. Performance comparison tests are shown below. - sse_1_c: 130.7 - sse_1_neon: 29.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
Hubert Mazur	a2e45ad407	lavc/aarch64: Add neon implementation for pix_abs16_y2 Provide optimized implementation of pix_abs16_y2 function for arm64. Performance comparison tests are shown below. pix_abs_0_2_c: 317.2 pix_abs_0_2_neon: 37.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
Hubert Mazur	d7abb7d143	lavc/aarch64: Add neon implementation for sse4 Provide neon implementation for sse4 function. Performance comparison tests are shown below. - sse_2_c: 80.7 - sse_2_neon: 31.0 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
Hubert Mazur	ad251fd262	lavc/aarch64: Add neon implementation for sse16 Provide neon implementation for sse16 function. Performance comparison tests are shown below. - sse_0_c: 268.2 - sse_0_neon: 43.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
Martin Storsjö	60109d5b3d	aarch64: me_cmp: Fix the indentation of function declarations Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
J. Dekker	aa9eabb7a5	lavc/aarch64: reformat add_res funcs Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-08-16 14:00:34 +02:00
Andreas Rheinhardt	333b32af8e	avcodec/h264chroma: Constify src in h264_chroma_mc_func Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-08-05 03:02:13 +02:00
Andreas Rheinhardt	b3bbbb14d0	avcodec/hevcdsp: Constify src pointers Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-08-05 02:54:04 +02:00
Andreas Rheinhardt	abb85429f3	avcodec/me_cmp: Constify me_cmp_func buffer parameters Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-07-31 03:31:53 +02:00
Andreas Rheinhardt	af43da3e4d	avcodec/videodsp: Constify buf in VideoDSPContext.prefetch Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-07-31 03:14:34 +02:00
Martin Storsjö	4136405c86	aarch64: me_cmp: Don't do uaddlv once per iteration The max height is currently documented as 16; the max difference per pixel is 255, and a .8h element can easily contain 16*255, thus keep accumulating in two .8h vectors, and just do the final accumulationat the end. This should work for heights up to 256. This requires a minor register renumbering in ff_pix_abs16_xy2_neon. Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_0_neon: 97.7 47.0 37.5 22.7 pix_abs_0_1_neon: 154.0 59.0 52.0 25.0 pix_abs_0_3_neon: 179.7 96.7 87.5 41.2 After: pix_abs_0_0_neon: 96.0 39.2 31.2 22.0 pix_abs_0_1_neon: 150.7 59.7 46.2 23.7 pix_abs_0_3_neon: 175.7 83.7 81.7 38.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-16 17:26:17 +03:00
Martin Storsjö	68a03f6424	aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neon Using absolute-difference-accumulate does use twice the amount of absolute-difference instructions, but avoids the need for the uaddl and add instructions, reducing the total number of instructions by 3. These can be interleaved in the rest of the calculation, to avoid tight dependencies at the end. Unfortunately, this is marginally slower on Cortex A53, but faster on A72 and A73. Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_3_neon: 175.7 109.2 92.0 41.2 After: pix_abs_0_3_neon: 179.7 96.7 87.5 41.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-16 17:25:54 +03:00
Martin Storsjö	b46de9aba4	aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_3_neon: 183.7 112.7 97.5 41.2 After: pix_abs_0_3_neon: 175.7 109.2 92.0 41.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-16 17:25:44 +03:00
Martin Storsjö	02e7853fd9	libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Checkasm doesn't currently test this codepath. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-16 17:25:11 +03:00
Hubert Mazur	01e190dc99	lavc/aarch64: Add pix_abs16_x2 neon implementation Provide neon implementation for pix_abs16_x2 function. Performance tests of implementation are below. - pix_abs_0_1_c: 283.5 - pix_abs_0_1_neon: 39.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-13 23:25:22 +03:00
Hubert Mazur	eb7ab3928f	lavc/aarch64: Hook up the existing ff_pix_abs16_neon to the sad[0] function pointer Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-11 23:58:28 +03:00
Swinney, Jonathan	c471cc7474	lavc/aarch64: motion estimation functions in neon - ff_pix_abs16_neon - ff_pix_abs16_xy2_neon In direct micro benchmarks of these ff functions verses their C implementations, these functions performed as follows on AWS Graviton 3. ff_pix_abs16_neon: pix_abs_0_0_c: 141.1 pix_abs_0_0_neon: 19.6 ff_pix_abs16_xy2_neon: pix_abs_0_3_c: 269.1 pix_abs_0_3_neon: 39.3 Tested with: ./tests/checkasm/checkasm --test=motion --bench --disable-linux-perf Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-06-28 00:51:39 +03:00
J. Dekker	3c694967f8	lavc/aarch64: hevc_sao reschedule slightly Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-05-26 08:10:41 +02:00
J. Dekker	2e832be322	lavc/aarch64: add hevc sao edge 8x8 bench on AWS Graviton: hevc_sao_edge_8x8_8_c: 516.0 hevc_sao_edge_8x8_8_neon: 81.0 Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-05-25 08:04:46 +02:00
J. Dekker	92f67e4017	lavc/aarch64: add hevc sao edge 16x16 bench on AWS Graviton: hevc_sao_edge_16x16_8_c: 1857.0 hevc_sao_edge_16x16_8_neon: 211.0 hevc_sao_edge_32x32_8_c: 7802.2 hevc_sao_edge_32x32_8_neon: 808.2 hevc_sao_edge_48x48_8_c: 16764.2 hevc_sao_edge_48x48_8_neon: 1796.5 hevc_sao_edge_64x64_8_c: 32647.5 hevc_sao_edge_64x64_8_neon: 3118.5 Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-05-25 08:04:39 +02:00
J. Dekker	d957ee34a6	lavc/aarch64: fix hevc sao band filter The SAO band filter can be called with non-multiples of 8, we round up to the nearest multiple of 8 to account for this. Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-05-25 08:04:35 +02:00
Andre Kempe	861285c146	arm64: Fix wrong BTI landing pad This patch fixes a wrong type of BTI landing pad when branching to functions instantiated via the fft*_neon macro. Although the previously employed paciasp instruction serves as a landing pad, for the ways that this function is invoked it is the wrong type, resulting in an unexpected termination of the running process. Signed-off-by: André Kempe <andre.kempe@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-04-26 10:26:49 +03:00
Ben Avison	6eee650289	avcodec/vc1: Arm 64-bit NEON unescape fast path checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_unescape_buffer_c: 655617.7 vc1dsp.vc1_unescape_buffer_neon: 118237.0 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-04-01 10:03:34 +03:00
Ben Avison	5379412ed0	avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. idctdsp.add_pixels_clamped_c: 313.3 idctdsp.add_pixels_clamped_neon: 24.3 idctdsp.put_pixels_clamped_c: 220.3 idctdsp.put_pixels_clamped_neon: 15.5 idctdsp.put_signed_pixels_clamped_c: 210.5 idctdsp.put_signed_pixels_clamped_neon: 19.5 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-04-01 10:03:34 +03:00
Ben Avison	501fdc017d	avcodec/vc1: Arm 64-bit NEON inverse transform fast paths checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_inv_trans_4x4_c: 158.2 vc1dsp.vc1_inv_trans_4x4_neon: 65.7 vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5 vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5 vc1dsp.vc1_inv_trans_4x8_c: 335.2 vc1dsp.vc1_inv_trans_4x8_neon: 106.2 vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2 vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5 vc1dsp.vc1_inv_trans_8x4_c: 365.7 vc1dsp.vc1_inv_trans_8x4_neon: 97.2 vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7 vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5 vc1dsp.vc1_inv_trans_8x8_c: 547.7 vc1dsp.vc1_inv_trans_8x8_neon: 137.0 vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2 vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-04-01 10:03:34 +03:00
Ben Avison	c62bbd4d20	avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C version can still outperform the NEON version in specific cases. The balance between different code paths is stream-dependent, but in practice the best case happens about 5% of the time, the worst case happens about 40% of the time, and the complexity of the remaining cases fall somewhere in between. Therefore, taking the average of the best and worst case timings is probably a conservative estimate of the degree by which the NEON code improves performance. vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7 vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5 vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5 vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7 vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2 vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2 vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2 vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2 vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0 vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7 vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7 vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5 vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7 vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0 vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7 vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0 vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2 vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7 vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0 vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2 vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0 vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0 vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2 vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-04-01 10:03:33 +03:00
Martin Storsjö	a78f136f3f	configure: Use a separate config_components.h header for $ALL_COMPONENTS This avoids unnecessary rebuilds of most source files if only the list of enabled components has changed, but not the other properties of the build, set in config.h. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-03-16 14:12:49 +02:00
Andre Kempe	248986a0db	arm64: Add Armv8.3-A PAC support to assembly files This patch adds optional support for Arm Pointer Authentication Codes. PAC support is turned on or off at compile time using additional compiler flags. Unless any of these is enabled explicitly, no additional code will be emitted at all. Signed-off-by: André Kempe <andre.kempe@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-03-09 15:04:25 +02:00
Andreas Rheinhardt	52e9113695	avcodec/aarch64/idct: Add missing stddef Fixes checkheaders on aarch64. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-02-21 13:10:04 +01:00
Martin Storsjö	402784ba9f	aarch64: h264dsp: Fix incorrectly indented code Signed-off-by: Martin Storsjö <martin@martin.st>	2022-02-11 10:49:12 +02:00
Martin Storsjö	24b93022fe	aarch64: Disable ff_hevc_sao_band_filter_8x8_8_neon out of precaution While this function on its own passes all of fate-hevc, there's indications that the function might need to handle widths that aren't a multiple of 8 (noted in commit `f63f9be37c`, which later was reverted). Signed-off-by: Martin Storsjö <martin@martin.st>	2022-01-07 22:33:27 +02:00
Martin Storsjö	16fba44b4d	Revert "lavc/aarch64: add hevc sao edge 16x16" This reverts commit `a9214a2ca3`, as it breaks fate-hevc. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-01-07 22:33:23 +02:00
Martin Storsjö	df48b1d06f	Revert "lavc/aarch64: add hevc sao edge 8x8" This reverts commit `c97ffc1a77`, as it breaks fate-hevc. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-01-07 22:33:19 +02:00
Martin Storsjö	cafed377eb	Revert "lavc/aarch64: add hevc sao band 8x8 tiling" This reverts commit `f63f9be37c`, as it breaks fate-hevc. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-01-07 22:33:14 +02:00
J. Dekker	f63f9be37c	lavc/aarch64: add hevc sao band 8x8 tiling bench on AWS Graviton: hevc_sao_band_8x8_8_c: 317.5 hevc_sao_band_8x8_8_neon: 97.5 hevc_sao_band_16x16_8_c: 1115.0 hevc_sao_band_16x16_8_neon: 322.7 hevc_sao_band_32x32_8_c: 4599.2 hevc_sao_band_32x32_8_neon: 1246.2 hevc_sao_band_48x48_8_c: 10021.7 hevc_sao_band_48x48_8_neon: 2740.5 hevc_sao_band_64x64_8_c: 17635.0 hevc_sao_band_64x64_8_neon: 4875.7 Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-01-04 14:32:26 +01:00
J. Dekker	89a2ed4a8b	lavc/aarch64: clean-up sao band 8x8 function formatting Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-01-04 14:31:56 +01:00
J. Dekker	c97ffc1a77	lavc/aarch64: add hevc sao edge 8x8 bench on AWS Graviton: hevc_sao_edge_8x8_8_c: 516.0 hevc_sao_edge_8x8_8_neon: 81.0 Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-01-04 14:31:56 +01:00
J. Dekker	a9214a2ca3	lavc/aarch64: add hevc sao edge 16x16 bench on AWS Graviton: hevc_sao_edge_16x16_8_c: 1857.0 hevc_sao_edge_16x16_8_neon: 211.0 hevc_sao_edge_32x32_8_c: 7802.2 hevc_sao_edge_32x32_8_neon: 808.2 hevc_sao_edge_48x48_8_c: 16764.2 hevc_sao_edge_48x48_8_neon: 1796.5 hevc_sao_edge_64x64_8_c: 32647.5 hevc_sao_edge_64x64_8_neon: 3118.5 Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-01-04 14:31:56 +01:00
Jonathan Wright	08b4716a9e	aarch64: Add Armv8.5-A BTI support Add Branch Target Identifiers (BTIs) to all functions defined in AArch64 assembly files. Most of the BTI landing pads are added automatically by the 'function' macro. BTI support is turned on or off at compile time based on the presence of the __ARM_FEATURE_BTI_DEFAULT feature macro. A binary compiled with BTI support can be executed on an Armv8-A processor without BTI support because the instructions are defined in NOP space. Signed-off-by: Jonathan Wright <jonathan.wright@arm.com> Signed-off-by: Elijah Ahmad <elijah.ahmad@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2021-11-16 13:43:56 +02:00
Jonathan Wright	6f04cf54f5	aarch64: Use ret x<n> instead of br x<n> where possible Change AArch64 assembly code to use: ret x<n> instead of: br x<n> "ret x<n>" is already used in a lot of places so this patch makes it consistent across the code base. This does not change behavior or performance. In addition, this change reduces the number of landing pads needed in a subsequent patch to support the Armv8.5-A Branch Target Identification (BTI) security feature. Signed-off-by: Jonathan Wright <jonathan.wright@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2021-11-16 13:43:56 +02:00
Martin Storsjö	fd3bd5c492	aarch64: h264qpel: Do vertical filtering without transposing This gives rather big speedups on these functions: Before: put_h264_qpel_8_mc01_8_neon: 241.0 131.5 138.7 put_h264_qpel_8_mc02_8_neon: 214.7 121.2 127.5 put_h264_qpel_8_mc03_8_neon: 242.5 131.2 135.7 put_h264_qpel_8_mc11_8_neon: 421.2 218.7 251.0 put_h264_qpel_8_mc12_8_neon: 878.0 509.5 537.5 put_h264_qpel_8_mc13_8_neon: 423.7 217.0 252.0 put_h264_qpel_8_mc21_8_neon: 858.2 479.5 514.0 put_h264_qpel_8_mc22_8_neon: 649.7 385.2 403.0 put_h264_qpel_8_mc23_8_neon: 860.2 476.5 517.7 put_h264_qpel_8_mc31_8_neon: 437.2 219.5 252.5 put_h264_qpel_8_mc32_8_neon: 892.5 510.5 546.0 put_h264_qpel_8_mc33_8_neon: 438.2 218.5 257.0 put_h264_qpel_16_mc01_8_neon: 944.2 509.7 546.7 put_h264_qpel_16_mc02_8_neon: 878.7 469.5 509.7 put_h264_qpel_16_mc03_8_neon: 945.7 510.7 557.0 put_h264_qpel_16_mc11_8_neon: 1663.2 858.5 979.5 put_h264_qpel_16_mc12_8_neon: 3510.2 2027.7 2112.7 put_h264_qpel_16_mc13_8_neon: 1664.7 857.5 980.5 put_h264_qpel_16_mc21_8_neon: 3366.2 1928.5 2030.5 put_h264_qpel_16_mc22_8_neon: 2584.7 1514.7 1590.2 put_h264_qpel_16_mc23_8_neon: 3367.7 1927.7 2035.0 put_h264_qpel_16_mc31_8_neon: 1716.7 849.7 997.0 put_h264_qpel_16_mc32_8_neon: 3564.0 2044.2 3835.2 put_h264_qpel_16_mc33_8_neon: 1717.7 863.0 989.5 After: put_h264_qpel_8_mc01_8_neon: 136.0 73.7 76.0 put_h264_qpel_8_mc02_8_neon: 108.7 65.0 64.0 put_h264_qpel_8_mc03_8_neon: 137.5 72.7 73.0 put_h264_qpel_8_mc11_8_neon: 316.2 159.0 188.5 put_h264_qpel_8_mc12_8_neon: 653.0 375.5 384.7 put_h264_qpel_8_mc13_8_neon: 318.7 165.5 189.5 put_h264_qpel_8_mc21_8_neon: 739.2 385.7 432.5 put_h264_qpel_8_mc22_8_neon: 530.7 295.5 309.5 put_h264_qpel_8_mc23_8_neon: 741.2 393.7 421.0 put_h264_qpel_8_mc31_8_neon: 332.2 162.5 190.0 put_h264_qpel_8_mc32_8_neon: 667.5 378.2 390.5 put_h264_qpel_8_mc33_8_neon: 332.7 166.5 195.5 put_h264_qpel_16_mc01_8_neon: 524.2 285.2 294.0 put_h264_qpel_16_mc02_8_neon: 454.7 252.2 250.2 put_h264_qpel_16_mc03_8_neon: 525.7 286.0 283.0 put_h264_qpel_16_mc11_8_neon: 1243.2 630.7 726.7 put_h264_qpel_16_mc12_8_neon: 2610.2 1479.7 1481.2 put_h264_qpel_16_mc13_8_neon: 1250.5 631.7 727.7 put_h264_qpel_16_mc21_8_neon: 2890.2 1571.2 1679.7 put_h264_qpel_16_mc22_8_neon: 2108.7 1177.5 1223.5 put_h264_qpel_16_mc23_8_neon: 2891.7 1578.7 1667.7 put_h264_qpel_16_mc31_8_neon: 1296.7 630.5 752.5 put_h264_qpel_16_mc32_8_neon: 2664.0 1483.2 1503.5 put_h264_qpel_16_mc33_8_neon: 1297.7 632.5 747.2 I.e. overall a 20%-60% reduction in runtime of these functions. Signed-off-by: Martin Storsjö <martin@martin.st>	2021-10-18 14:27:58 +03:00
Martin Storsjö	2d5a7f6d00	arm/aarch64: Improve scheduling in the avg form of h264_qpel Don't use the loaded registers directly, avoiding stalls on in order cores. Use vrhadd.u8 with q registers where easily possible. Signed-off-by: Martin Storsjö <martin@martin.st>	2021-10-18 14:27:36 +03:00
Zhao Zhili	378ad2f8fd	lavc/aarch64: fix relocation out of range error Use a temporary label instead of global function symbol for b.gt. Signed-off-by: Martin Storsjö <martin@martin.st>	2021-09-25 21:55:29 +03:00
Mikhail Nitenko	d3e56b56ae	lavc/aarch64: add pred functions for 10-bit Benchmarks: A53 A72 pred8x8_dc_10_c: 64.2 49.5 pred8x8_dc_10_neon: 62.0 53.7 pred8x8_dc_128_10_c: 26.0 14.0 pred8x8_dc_128_10_neon: 30.7 17.5 pred8x8_horizontal_10_c: 60.0 27.7 pred8x8_horizontal_10_neon: 38.0 34.0 pred8x8_left_dc_10_c: 42.5 27.5 pred8x8_left_dc_10_neon: 51.0 41.2 pred8x8_mad_cow_dc_0l0_10_c: 55.7 37.2 pred8x8_mad_cow_dc_0l0_10_neon: 50.2 35.2 pred8x8_mad_cow_dc_0lt_10_c: 89.2 67.0 pred8x8_mad_cow_dc_0lt_10_neon: 52.2 46.7 pred8x8_mad_cow_dc_l0t_10_c: 74.7 51.0 pred8x8_mad_cow_dc_l0t_10_neon: 50.5 45.2 pred8x8_mad_cow_dc_l00_10_c: 58.0 38.0 pred8x8_mad_cow_dc_l00_10_neon: 42.5 37.5 pred8x8_plane_10_c: 354.0 288.7 pred8x8_plane_10_neon: 141.0 101.2 pred8x8_top_dc_10_c: 44.5 30.5 pred8x8_top_dc_10_neon: 40.0 31.0 pred8x8_vertical_10_c: 27.5 14.5 pred8x8_vertical_10_neon: 21.0 17.5 pred16x16_plane_10_c: 1242.0 1070.5 pred16x16_plane_10_neon: 324.0 196.7 Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2021-08-21 00:06:26 +03:00
Mikhail Nitenko	43ca887bc2	lavc/aarch64: h264, add chroma loop filters for 10bit Benchmarks: A53 A72 h264_h_loop_filter_chroma422_10bpp_c: 282.7 114.2 h264_h_loop_filter_chroma422_10bpp_neon: 109.5 78.5 h264_h_loop_filter_chroma_10bpp_c: 165.0 81.5 h264_h_loop_filter_chroma_10bpp_neon: 120.0 76.7 h264_h_loop_filter_chroma_intra422_10bpp_c: 323.7 124.2 h264_h_loop_filter_chroma_intra422_10bpp_neon: 155.0 102.7 h264_h_loop_filter_chroma_intra_10bpp_c: 121.0 49.5 h264_h_loop_filter_chroma_intra_10bpp_neon: 79.7 53.7 h264_h_loop_filter_chroma_mbaff422_10bpp_c: 188.5 75.0 h264_h_loop_filter_chroma_mbaff422_10bpp_neon: 120.0 75.5 h264_h_loop_filter_chroma_mbaff_intra422_10bpp_c: 116.7 46.0 h264_h_loop_filter_chroma_mbaff_intra422_10bpp_neon: 79.7 53.7 h264_h_loop_filter_chroma_mbaff_intra_10bpp_c: 63.0 27.2 h264_h_loop_filter_chroma_mbaff_intra_10bpp_neon: 48.5 34.0 h264_v_loop_filter_chroma_10bpp_c: 258.7 135.5 h264_v_loop_filter_chroma_10bpp_neon: 71.2 51.0 h264_v_loop_filter_chroma_intra_10bpp_c: 158.0 70.7 h264_v_loop_filter_chroma_intra_10bpp_neon: 48.7 31.5 Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2021-08-21 00:06:26 +03:00
Mikhail Nitenko	756d2e087a	lavc/aarch64: move transpose_4x8H to neon.S transpose_4x8H was declared in vp9lpf_16bpp_neon, however this macro is not unique to vp9 and could be used elsewhere. Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2021-08-21 00:06:26 +03:00
Martin Storsjö	c60b76d0c8	aarch64: h264dsp: Fix indentation of some functions to match the rest Signed-off-by: Martin Storsjö <martin@martin.st>	2021-08-08 23:18:43 +03:00
Martin Storsjö	e86ec831b0	aarch64: h264dsp: Remove unnecessary sign extensions These became unnecessary when the stride arguments were changed from int to ptrdiff_t in `bc26fe8927` (`0576ef466d`) and `d5d699ab6e` (`aa844dc46f`). Signed-off-by: Martin Storsjö <martin@martin.st>	2021-08-08 23:18:43 +03:00
Andreas Rheinhardt	afc95a10ac	avcodec/h264dsp, h264idct: Fix lengths of array parameters Fixes many -Warray-parameter warnings from GCC 11. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2021-08-08 17:44:57 +02:00
Martin Storsjö	f27e3ccf06	aarch64: hevc_idct: Fix overflows in idct_dc This is marginally slower, but correct for all input values. The previous implementation failed with certain input seeds, e.g. "checkasm --test=hevc_idct 98". Signed-off-by: Martin Storsjö <martin@martin.st>	2021-05-22 00:08:03 +03:00
Andreas Rheinhardt	7c1f347b18	avcodec: Remove deprecated old encode/decode APIs Deprecated in commits `7fc329e2dd` and `31f6a4b4b8`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com> Signed-off-by: James Almer <jamrial@gmail.com>	2021-04-27 10:43:12 -03:00
Andreas Rheinhardt	f3c197b129	Include attributes.h directly Some files currently rely on libavutil/cpu.h to include it for them; yet said file won't use include it any more after the currently deprecated functions are removed, so include attributes.h directly. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2021-04-19 14:34:10 +02:00
Mikhail Nitenko	84ac1440b2	lavc/aarch64: add pred16x16 10-bit functions Benchmarks: A53 A72 pred16x16_dc_10_c: 136.0 124.0 pred16x16_dc_10_neon: 121.2 106.0 pred16x16_horizontal_10_c: 155.0 73.2 pred16x16_horizontal_10_neon: 82.2 67.7 pred16x16_top_dc_10_c: 106.0 93.7 pred16x16_top_dc_10_neon: 87.7 77.2 pred16x16_vertical_10_c: 83.0 67.7 pred16x16_vertical_10_neon: 54.2 61.7 Some functions work slower than C and are left commented out.	2021-04-19 09:01:14 +02:00
Mikhail Nitenko	6b2e7dc828	lavc/aarch64: change h264pred_init structure Change structure to allow the addition of other bit depths.	2021-04-19 09:00:58 +02:00
Martin Storsjö	870bfe16a1	aarch64: h264pred: Optimize the inner loop of existing 8 bit functions Move the loop counter decrement further from the branch instruction, this hides the latency of the decrement. In loops that first load, then store (the horizontal prediction cases), do the decrement after the load (where the next instruction would stall a bit anyway, waiting for the result of the load). In loops that store twice using the same destination register, also do the decrement between the two stores (as the second store would need to wait for the updated destination register from the first instruction). In loops that store twice to two different destination registers, do the decrement before both stores, to do it as soon before the branch as possible. This gives minor (1-2 cycle) speedups in most cases (modulo measurement noise), but the horizontal prediction functions get a rather notable speedup on the Cortex A53. Before: Cortex A53 A72 A73 pred8x8_dc_8_neon: 60.7 46.2 39.2 pred8x8_dc_128_8_neon: 30.7 18.0 14.0 pred8x8_horizontal_8_neon: 42.2 29.2 18.5 pred8x8_left_dc_8_neon: 52.7 36.2 32.2 pred8x8_mad_cow_dc_0l0_8_neon: 48.2 27.7 25.7 pred8x8_mad_cow_dc_0lt_8_neon: 52.5 33.2 34.7 pred8x8_mad_cow_dc_l0t_8_neon: 52.5 31.7 33.2 pred8x8_mad_cow_dc_l00_8_neon: 43.2 27.0 25.5 pred8x8_plane_8_neon: 112.2 86.2 88.2 pred8x8_top_dc_8_neon: 40.7 23.0 21.2 pred8x8_vertical_8_neon: 27.2 15.5 14.0 pred16x16_dc_8_neon: 91.0 73.2 70.5 pred16x16_dc_128_8_neon: 43.0 34.7 30.7 pred16x16_horizontal_8_neon: 86.0 49.7 44.7 pred16x16_left_dc_8_neon: 87.0 67.2 67.5 pred16x16_plane_8_neon: 236.0 175.7 173.0 pred16x16_top_dc_8_neon: 53.2 39.0 41.7 pred16x16_vertical_8_neon: 41.7 29.7 31.0 After: pred8x8_dc_8_neon: 59.0 46.7 42.5 pred8x8_dc_128_8_neon: 28.2 18.0 14.0 pred8x8_horizontal_8_neon: 34.2 29.2 18.5 pred8x8_left_dc_8_neon: 51.0 38.2 32.7 pred8x8_mad_cow_dc_0l0_8_neon: 46.7 28.2 26.2 pred8x8_mad_cow_dc_0lt_8_neon: 55.2 33.7 37.5 pred8x8_mad_cow_dc_l0t_8_neon: 51.2 31.7 37.2 pred8x8_mad_cow_dc_l00_8_neon: 41.7 27.5 26.0 pred8x8_plane_8_neon: 111.5 86.5 89.5 pred8x8_top_dc_8_neon: 39.0 23.2 21.0 pred8x8_vertical_8_neon: 27.2 16.0 14.0 pred16x16_dc_8_neon: 85.0 70.2 70.5 pred16x16_dc_128_8_neon: 42.0 30.0 30.7 pred16x16_horizontal_8_neon: 66.5 49.5 42.5 pred16x16_left_dc_8_neon: 81.0 66.5 67.5 pred16x16_plane_8_neon: 235.0 175.7 173.0 pred16x16_top_dc_8_neon: 52.0 39.0 41.7 pred16x16_vertical_8_neon: 40.2 33.2 31.0 Despite this, a number of these functions still are slower than what e.g. GCC 7 generates - this shows the relative speedup of the neon codepaths over the compiler generated ones: Cortex A53 A72 A73 pred8x8_dc_8_neon: 0.86 0.65 1.04 pred8x8_dc_128_8_neon: 0.59 0.44 0.62 pred8x8_horizontal_8_neon: 1.51 0.58 1.30 pred8x8_left_dc_8_neon: 0.72 0.56 0.89 pred8x8_mad_cow_dc_0l0_8_neon: 0.93 0.93 1.37 pred8x8_mad_cow_dc_0lt_8_neon: 1.37 1.41 1.68 pred8x8_mad_cow_dc_l0t_8_neon: 1.21 1.17 1.32 pred8x8_mad_cow_dc_l00_8_neon: 1.24 1.19 1.60 pred8x8_plane_8_neon: 3.36 3.58 3.76 pred8x8_top_dc_8_neon: 0.97 0.99 1.43 pred8x8_vertical_8_neon: 0.86 0.78 1.18 pred16x16_dc_8_neon: 1.20 1.06 1.49 pred16x16_dc_128_8_neon: 0.83 0.95 0.99 pred16x16_horizontal_8_neon: 1.78 0.96 1.59 pred16x16_left_dc_8_neon: 1.06 0.96 1.32 pred16x16_plane_8_neon: 5.78 6.49 7.19 pred16x16_top_dc_8_neon: 1.48 1.53 1.94 pred16x16_vertical_8_neon: 1.39 1.34 1.98 In particular, on Cortex A72, many of these functions are slower than the compiler generated code, while they're more beneficial on e.g. the Cortex A73. Signed-off-by: Martin Storsjö <martin@martin.st>	2021-04-14 15:23:44 +03:00
James Almer	f1a894f9d3	avcodec: add missing FF_API_OLD_ENCDEC wrappers to xmm clobber functions Signed-off-by: James Almer <jamrial@gmail.com>	2021-02-26 19:26:31 -03:00
Josh Dekker	7ac41e0db2	lavc/aarch64: add HEVC sao_band NEON Only works for 8x8. Signed-off-by: Josh Dekker <josh@itanimul.li>	2021-02-18 14:12:01 +01:00
Josh Dekker	75c2ddfa61	lavc/aarch64: add HEVC idct_dc NEON Signed-off-by: Josh Dekker <josh@itanimul.li>	2021-02-18 14:12:01 +01:00
Reimar Döffinger	00c916ef61	lavc/aarch64: port HEVC add_residual NEON Speedup is fairly small, around 1.5%, but these are fairly simple. Signed-off-by: Josh Dekker <josh@itanimul.li>	2021-02-18 14:11:57 +01:00
Reimar Döffinger	30f80d855b	lavc/aarch64: port HEVC SIMD idct NEON Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth available on aarch64. For a UHD HDR (10 bit) sample video these were consuming the most time and this optimization reduced overall decode time from 19.4s to 16.4s, approximately 15% speedup. Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts", running on Apple M1. Signed-off-by: Josh Dekker <josh@itanimul.li>	2021-02-18 14:11:53 +01:00
Anton Khirnov	c8c2dfbc37	lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h That is a more appropriate place for it.	2021-01-01 14:11:01 +01:00

1 2 3 4 5 ...

392 Commits