FFmpeg

mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-23 12:43:46 +02:00

Author	SHA1	Message	Date
Martin Storsjö	8089fe072e	aarch64: me_cmp: Avoid using the non-unrolled codepath for the minimum unroll size Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-29 10:29:11 +03:00
Martin Storsjö	6f2ad7f951	aarch64: me_cmp: Avoid redundant loads in ff_pix_abs16_y2_neon This avoids one redundant load per row; pix3 from the previous iteration can be used as pix2 in the next one. Before: Cortex A53 A72 A73 pix_abs_0_2_neon: 138.0 59.7 48.0 After: pix_abs_0_2_neon: 109.7 50.2 39.5 Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-29 10:29:10 +03:00
Andreas Rheinhardt	9beba05311	avcodec/fmtconvert: Remove unused AVCodecContext parameter Unused since `d74a8cb7e4`. Reviewed-by: Rémi Denis-Courmont <remi@remlab.net> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-09-21 20:26:40 +02:00
Hubert Mazur	b2732115dd	lavc/aarch64: Add neon implementation for pix_median_abs8 Provide optimized implementation for pix_median_abs8 function. Performance comparison tests are shown below. - median_sad_1_c: 277.0 - median_sad_1_neon: 82.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-21 12:57:56 +03:00
Hubert Mazur	e9a6170213	lavc/aarch64: Add neon implementation for vsad8_intra Provide optimized implementation for vsad8_intra function. Performance comparison tests are shown below. - vsad_5_c: 94.7 - vsad_5_neon: 20.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-21 12:57:56 +03:00
Hubert Mazur	0ee535b1db	lavc/aarch64: Add neon implementation for pix_median_abs16 Provide optimized implementation for pix_median_abs16 function. Performance comparison tests are shown below. - median_sad_0_c: 720.5 - median_sad_0_neon: 127.2 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-21 12:57:56 +03:00
Rémi Denis-Courmont	b52034270a	lavc/vorbisdsp: use ptrdiff_t rather than intptr_t ... for a difference between pointers.	2022-09-19 13:51:00 -03:00
Andreas Rheinhardt	a54e53a1c4	avcodec/vp8dsp: Constify src in vp8_mc_func Reviewed-by: Peter Ross <pross@xvid.org> Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-09-11 20:57:51 +02:00
Hubert Mazur	06b98e396a	lavc/aarch64: Provide neon implementation of nsse16 Add vectorized implementation of nsse16 function. Performance comparison tests are shown below. - nsse_0_c: 682.2 - nsse_0_neon: 116.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-09 10:19:46 +03:00
Hubert Mazur	908abe8032	lavc/aarch64: Add neon implementation for vsse_intra16 Provide optimized implementation for vsse_intra16 for arm64. Performance tests are shown below. - vsse_4_c: 155.2 - vsse_4_neon: 36.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-09 10:19:46 +03:00
Hubert Mazur	ce03ea3e79	lavc/aarch64: Add neon implementation for vsad_intra16 Provide optimized implementation for vsad_intra16 function for arm64. Performance comparison tests are shown below. - vsad_4_c: 177.5 - vsad_4_neon: 23.5 Benchmarks and tests are run with checkasm tool on AWS Gravtion 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-09 10:19:46 +03:00
Hubert Mazur	c495a4b32d	lavc/aarch64: Add neon implementation of vsse16 Provide optimized implementation of vsse16 for arm64. Performance comparison tests are shown below. - vsse_0_c: 257.7 - vsse_0_neon: 59.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-09 10:19:46 +03:00
Hubert Mazur	200f5e578f	lavc/aarch64: Add neon implementation for vsad16 Provide optimized implementation of vsad16 function for arm64. Performance comparison tests are shown below. - vsad_0_c: 285.2 - vsad_0_neon: 39.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-09 10:19:46 +03:00
Lynne	f99d15cca0	arm/fft: disable NEON optimizations for 131072pt transforms This has been broken since the start, and it was only discovered when I started testing my replacement for the FFT. Disable it, since there's no point in fixing slower code that's about to be removed anyway. The vfp version is not affected.	2022-08-29 07:13:43 +02:00
J. Dekker	ce2f47318b	lavc/aarch64: hevc_add_res add 12bit variants hevc_add_res_4x4_12_c: 46.0 hevc_add_res_4x4_12_neon: 18.7 hevc_add_res_8x8_12_c: 194.7 hevc_add_res_8x8_12_neon: 25.2 hevc_add_res_16x16_12_c: 716.0 hevc_add_res_16x16_12_neon: 69.7 hevc_add_res_32x32_12_c: 3820.7 hevc_add_res_32x32_12_neon: 261.0 Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-08-18 15:04:43 +02:00
Martin Storsjö	48be6616d0	aarch64: me_cmp: Remove a leftover unnecessary instruction This was missed in `a2e45ad407`. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:14:53 +03:00
Hubert Mazur	70efa4d011	lavc/aarch64: Add neon implementation for pix_abs8 Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below. - pix_abs_1_0_c: 101.2 - pix_abs_1_0_neon: 22.5 - sad_1_c: 101.2 - sad_1_neon: 22.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
Hubert Mazur	74312e80d7	lavc/aarch64: Add neon implementation for sse8 Provide optimized implementation of sse8 function for arm64. Performance comparison tests are shown below. - sse_1_c: 130.7 - sse_1_neon: 29.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
Hubert Mazur	a2e45ad407	lavc/aarch64: Add neon implementation for pix_abs16_y2 Provide optimized implementation of pix_abs16_y2 function for arm64. Performance comparison tests are shown below. pix_abs_0_2_c: 317.2 pix_abs_0_2_neon: 37.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
Hubert Mazur	d7abb7d143	lavc/aarch64: Add neon implementation for sse4 Provide neon implementation for sse4 function. Performance comparison tests are shown below. - sse_2_c: 80.7 - sse_2_neon: 31.0 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
Hubert Mazur	ad251fd262	lavc/aarch64: Add neon implementation for sse16 Provide neon implementation for sse16 function. Performance comparison tests are shown below. - sse_0_c: 268.2 - sse_0_neon: 43.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
Martin Storsjö	60109d5b3d	aarch64: me_cmp: Fix the indentation of function declarations Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-18 12:07:26 +03:00
J. Dekker	aa9eabb7a5	lavc/aarch64: reformat add_res funcs Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-08-16 14:00:34 +02:00
Andreas Rheinhardt	333b32af8e	avcodec/h264chroma: Constify src in h264_chroma_mc_func Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-08-05 03:02:13 +02:00
Andreas Rheinhardt	b3bbbb14d0	avcodec/hevcdsp: Constify src pointers Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-08-05 02:54:04 +02:00
Andreas Rheinhardt	abb85429f3	avcodec/me_cmp: Constify me_cmp_func buffer parameters Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-07-31 03:31:53 +02:00
Andreas Rheinhardt	af43da3e4d	avcodec/videodsp: Constify buf in VideoDSPContext.prefetch Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-07-31 03:14:34 +02:00
Martin Storsjö	4136405c86	aarch64: me_cmp: Don't do uaddlv once per iteration The max height is currently documented as 16; the max difference per pixel is 255, and a .8h element can easily contain 16*255, thus keep accumulating in two .8h vectors, and just do the final accumulationat the end. This should work for heights up to 256. This requires a minor register renumbering in ff_pix_abs16_xy2_neon. Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_0_neon: 97.7 47.0 37.5 22.7 pix_abs_0_1_neon: 154.0 59.0 52.0 25.0 pix_abs_0_3_neon: 179.7 96.7 87.5 41.2 After: pix_abs_0_0_neon: 96.0 39.2 31.2 22.0 pix_abs_0_1_neon: 150.7 59.7 46.2 23.7 pix_abs_0_3_neon: 175.7 83.7 81.7 38.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-16 17:26:17 +03:00
Martin Storsjö	68a03f6424	aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neon Using absolute-difference-accumulate does use twice the amount of absolute-difference instructions, but avoids the need for the uaddl and add instructions, reducing the total number of instructions by 3. These can be interleaved in the rest of the calculation, to avoid tight dependencies at the end. Unfortunately, this is marginally slower on Cortex A53, but faster on A72 and A73. Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_3_neon: 175.7 109.2 92.0 41.2 After: pix_abs_0_3_neon: 179.7 96.7 87.5 41.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-16 17:25:54 +03:00
Martin Storsjö	b46de9aba4	aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_3_neon: 183.7 112.7 97.5 41.2 After: pix_abs_0_3_neon: 175.7 109.2 92.0 41.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-16 17:25:44 +03:00
Martin Storsjö	02e7853fd9	libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Checkasm doesn't currently test this codepath. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-16 17:25:11 +03:00
Hubert Mazur	01e190dc99	lavc/aarch64: Add pix_abs16_x2 neon implementation Provide neon implementation for pix_abs16_x2 function. Performance tests of implementation are below. - pix_abs_0_1_c: 283.5 - pix_abs_0_1_neon: 39.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-13 23:25:22 +03:00
Hubert Mazur	eb7ab3928f	lavc/aarch64: Hook up the existing ff_pix_abs16_neon to the sad[0] function pointer Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-07-11 23:58:28 +03:00
Swinney, Jonathan	c471cc7474	lavc/aarch64: motion estimation functions in neon - ff_pix_abs16_neon - ff_pix_abs16_xy2_neon In direct micro benchmarks of these ff functions verses their C implementations, these functions performed as follows on AWS Graviton 3. ff_pix_abs16_neon: pix_abs_0_0_c: 141.1 pix_abs_0_0_neon: 19.6 ff_pix_abs16_xy2_neon: pix_abs_0_3_c: 269.1 pix_abs_0_3_neon: 39.3 Tested with: ./tests/checkasm/checkasm --test=motion --bench --disable-linux-perf Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-06-28 00:51:39 +03:00
J. Dekker	3c694967f8	lavc/aarch64: hevc_sao reschedule slightly Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-05-26 08:10:41 +02:00
J. Dekker	2e832be322	lavc/aarch64: add hevc sao edge 8x8 bench on AWS Graviton: hevc_sao_edge_8x8_8_c: 516.0 hevc_sao_edge_8x8_8_neon: 81.0 Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-05-25 08:04:46 +02:00
J. Dekker	92f67e4017	lavc/aarch64: add hevc sao edge 16x16 bench on AWS Graviton: hevc_sao_edge_16x16_8_c: 1857.0 hevc_sao_edge_16x16_8_neon: 211.0 hevc_sao_edge_32x32_8_c: 7802.2 hevc_sao_edge_32x32_8_neon: 808.2 hevc_sao_edge_48x48_8_c: 16764.2 hevc_sao_edge_48x48_8_neon: 1796.5 hevc_sao_edge_64x64_8_c: 32647.5 hevc_sao_edge_64x64_8_neon: 3118.5 Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-05-25 08:04:39 +02:00
J. Dekker	d957ee34a6	lavc/aarch64: fix hevc sao band filter The SAO band filter can be called with non-multiples of 8, we round up to the nearest multiple of 8 to account for this. Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-05-25 08:04:35 +02:00
Andre Kempe	861285c146	arm64: Fix wrong BTI landing pad This patch fixes a wrong type of BTI landing pad when branching to functions instantiated via the fft*_neon macro. Although the previously employed paciasp instruction serves as a landing pad, for the ways that this function is invoked it is the wrong type, resulting in an unexpected termination of the running process. Signed-off-by: André Kempe <andre.kempe@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-04-26 10:26:49 +03:00
Ben Avison	6eee650289	avcodec/vc1: Arm 64-bit NEON unescape fast path checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_unescape_buffer_c: 655617.7 vc1dsp.vc1_unescape_buffer_neon: 118237.0 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-04-01 10:03:34 +03:00
Ben Avison	5379412ed0	avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. idctdsp.add_pixels_clamped_c: 313.3 idctdsp.add_pixels_clamped_neon: 24.3 idctdsp.put_pixels_clamped_c: 220.3 idctdsp.put_pixels_clamped_neon: 15.5 idctdsp.put_signed_pixels_clamped_c: 210.5 idctdsp.put_signed_pixels_clamped_neon: 19.5 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-04-01 10:03:34 +03:00
Ben Avison	501fdc017d	avcodec/vc1: Arm 64-bit NEON inverse transform fast paths checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_inv_trans_4x4_c: 158.2 vc1dsp.vc1_inv_trans_4x4_neon: 65.7 vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5 vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5 vc1dsp.vc1_inv_trans_4x8_c: 335.2 vc1dsp.vc1_inv_trans_4x8_neon: 106.2 vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2 vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5 vc1dsp.vc1_inv_trans_8x4_c: 365.7 vc1dsp.vc1_inv_trans_8x4_neon: 97.2 vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7 vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5 vc1dsp.vc1_inv_trans_8x8_c: 547.7 vc1dsp.vc1_inv_trans_8x8_neon: 137.0 vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2 vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-04-01 10:03:34 +03:00
Ben Avison	c62bbd4d20	avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C version can still outperform the NEON version in specific cases. The balance between different code paths is stream-dependent, but in practice the best case happens about 5% of the time, the worst case happens about 40% of the time, and the complexity of the remaining cases fall somewhere in between. Therefore, taking the average of the best and worst case timings is probably a conservative estimate of the degree by which the NEON code improves performance. vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7 vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5 vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5 vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7 vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2 vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2 vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2 vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2 vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0 vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7 vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7 vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5 vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7 vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0 vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7 vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0 vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2 vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7 vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0 vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2 vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0 vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0 vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2 vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-04-01 10:03:33 +03:00
Martin Storsjö	a78f136f3f	configure: Use a separate config_components.h header for $ALL_COMPONENTS This avoids unnecessary rebuilds of most source files if only the list of enabled components has changed, but not the other properties of the build, set in config.h. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-03-16 14:12:49 +02:00
Andre Kempe	248986a0db	arm64: Add Armv8.3-A PAC support to assembly files This patch adds optional support for Arm Pointer Authentication Codes. PAC support is turned on or off at compile time using additional compiler flags. Unless any of these is enabled explicitly, no additional code will be emitted at all. Signed-off-by: André Kempe <andre.kempe@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-03-09 15:04:25 +02:00
Andreas Rheinhardt	52e9113695	avcodec/aarch64/idct: Add missing stddef Fixes checkheaders on aarch64. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2022-02-21 13:10:04 +01:00
Martin Storsjö	402784ba9f	aarch64: h264dsp: Fix incorrectly indented code Signed-off-by: Martin Storsjö <martin@martin.st>	2022-02-11 10:49:12 +02:00
Martin Storsjö	24b93022fe	aarch64: Disable ff_hevc_sao_band_filter_8x8_8_neon out of precaution While this function on its own passes all of fate-hevc, there's indications that the function might need to handle widths that aren't a multiple of 8 (noted in commit `f63f9be37c`, which later was reverted). Signed-off-by: Martin Storsjö <martin@martin.st>	2022-01-07 22:33:27 +02:00
Martin Storsjö	16fba44b4d	Revert "lavc/aarch64: add hevc sao edge 16x16" This reverts commit `a9214a2ca3`, as it breaks fate-hevc. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-01-07 22:33:23 +02:00
Martin Storsjö	df48b1d06f	Revert "lavc/aarch64: add hevc sao edge 8x8" This reverts commit `c97ffc1a77`, as it breaks fate-hevc. Signed-off-by: Martin Storsjö <martin@martin.st>	2022-01-07 22:33:19 +02:00

1 2 3 4 5 ...

320 Commits