FFmpeg

mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-07 11:13:41 +02:00

Author	SHA1	Message	Date
Martin Storsjö	9f10cff610	aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter This work is sponsored by, and copyright, Google. This is similar to the arm version, but due to the larger registers on aarch64, we can do 8 pixels at a time for all filter sizes. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_10bpp_neon: 213.2 172.6 vp9_loop_filter_h_8_8_10bpp_neon: 281.2 244.2 vp9_loop_filter_h_16_8_10bpp_neon: 657.0 444.5 vp9_loop_filter_h_16_16_10bpp_neon: 1280.4 877.7 vp9_loop_filter_mix2_h_44_16_10bpp_neon: 397.7 358.0 vp9_loop_filter_mix2_h_48_16_10bpp_neon: 465.7 429.0 vp9_loop_filter_mix2_h_84_16_10bpp_neon: 465.7 428.0 vp9_loop_filter_mix2_h_88_16_10bpp_neon: 533.7 499.0 vp9_loop_filter_mix2_v_44_16_10bpp_neon: 271.5 244.0 vp9_loop_filter_mix2_v_48_16_10bpp_neon: 330.0 305.0 vp9_loop_filter_mix2_v_84_16_10bpp_neon: 329.0 306.0 vp9_loop_filter_mix2_v_88_16_10bpp_neon: 386.0 365.0 vp9_loop_filter_v_4_8_10bpp_neon: 150.0 115.2 vp9_loop_filter_v_8_8_10bpp_neon: 209.0 175.5 vp9_loop_filter_v_16_8_10bpp_neon: 492.7 345.2 vp9_loop_filter_v_16_16_10bpp_neon: 951.0 682.7 This is significantly faster than the ARM version in almost all cases except for the mix2 functions. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 2-3x. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:36:11 +02:00
Martin Storsjö	ceb36b8178	aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm This work is sponsored by, and copyright, Google. Compared to the arm version, on aarch64 we can keep the full 8x8 transform in registers, and for 16x16 and 32x32, we can process it in slices of 4 pixels instead of 2. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_sub4_add_10_neon: 111.0 109.7 vp9_inv_adst_adst_8x8_sub8_add_10_neon: 914.0 733.5 vp9_inv_adst_adst_16x16_sub16_add_10_neon: 5184.0 3745.7 vp9_inv_dct_dct_4x4_sub1_add_10_neon: 65.0 65.7 vp9_inv_dct_dct_4x4_sub4_add_10_neon: 100.0 96.7 vp9_inv_dct_dct_8x8_sub1_add_10_neon: 111.0 119.7 vp9_inv_dct_dct_8x8_sub8_add_10_neon: 618.0 494.7 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 295.1 284.6 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2303.2 1883.9 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2984.8 2189.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3890.0 2799.4 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1044.4 1012.7 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 13333.7 9695.1 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 18531.3 12459.8 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 24470.7 16160.2 vp9_inv_wht_wht_4x4_sub4_add_10_neon: 83.0 79.7 The larger transforms are significantly faster than the corresponding ARM versions. The speedup vs C code is smaller than in 32 bit mode, probably because the 64 bit intermediates in the C code can be expressed more efficiently in aarch64. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:36:08 +02:00
Martin Storsjö	638eceed47	aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC This work is sponsored by, and copyright, Google. This has mostly got the same differences to the 8 bit version as in the arm version. For the horizontal filters, we do 16 pixels in parallel as well. For the 8 pixel wide vertical filters, we can accumulate 4 rows before storing, just as in the 8 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_10bpp_neon: 35.7 30.7 vp9_avg8_10bpp_neon: 93.5 84.7 vp9_avg16_10bpp_neon: 324.4 296.6 vp9_avg32_10bpp_neon: 1236.5 1148.2 vp9_avg64_10bpp_neon: 4639.6 4571.1 vp9_avg_8tap_smooth_4h_10bpp_neon: 130.0 128.0 vp9_avg_8tap_smooth_4hv_10bpp_neon: 440.0 440.5 vp9_avg_8tap_smooth_4v_10bpp_neon: 114.0 105.5 vp9_avg_8tap_smooth_8h_10bpp_neon: 327.0 314.0 vp9_avg_8tap_smooth_8hv_10bpp_neon: 918.7 865.4 vp9_avg_8tap_smooth_8v_10bpp_neon: 330.0 300.2 vp9_avg_8tap_smooth_16h_10bpp_neon: 1187.5 1155.5 vp9_avg_8tap_smooth_16hv_10bpp_neon: 2663.1 2591.0 vp9_avg_8tap_smooth_16v_10bpp_neon: 1107.4 1078.3 vp9_avg_8tap_smooth_64h_10bpp_neon: 17754.6 17454.7 vp9_avg_8tap_smooth_64hv_10bpp_neon: 33285.2 33001.5 vp9_avg_8tap_smooth_64v_10bpp_neon: 16066.9 16048.6 vp9_put4_10bpp_neon: 25.5 21.7 vp9_put8_10bpp_neon: 56.0 52.0 vp9_put16_10bpp_neon/armv8: 183.0 163.1 vp9_put32_10bpp_neon/armv8: 678.6 563.1 vp9_put64_10bpp_neon/armv8: 2679.9 2195.8 vp9_put_8tap_smooth_4h_10bpp_neon: 120.0 118.0 vp9_put_8tap_smooth_4hv_10bpp_neon: 435.2 435.0 vp9_put_8tap_smooth_4v_10bpp_neon: 107.0 98.2 vp9_put_8tap_smooth_8h_10bpp_neon: 303.0 290.0 vp9_put_8tap_smooth_8hv_10bpp_neon: 893.7 828.7 vp9_put_8tap_smooth_8v_10bpp_neon: 305.5 263.5 vp9_put_8tap_smooth_16h_10bpp_neon: 1089.1 1059.2 vp9_put_8tap_smooth_16hv_10bpp_neon: 2578.8 2452.4 vp9_put_8tap_smooth_16v_10bpp_neon: 1009.5 933.5 vp9_put_8tap_smooth_64h_10bpp_neon: 16223.4 15918.6 vp9_put_8tap_smooth_64hv_10bpp_neon: 32153.0 31016.2 vp9_put_8tap_smooth_64v_10bpp_neon: 14516.5 13748.1 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is around 4-9x. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:36:05 +02:00
Martin Storsjö	48ad3fe1be	aarch64: vp9dsp: Restructure the bpp checks This work is sponsored by, and copyright, Google. This is more in line with how it will be extended for more bitdepths. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:36:02 +02:00
Martin Storsjö	0ba0187535	aarch64: vp9mc: Fix a comment to refer to a register with the right name This is cherrypicked from libav commit `85ad5ea72c`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:43 +01:00
Martin Storsjö	02cfb9a16e	aarch64: vp9dsp: Fix vertical alignment in the init file This is cherrypicked from libav commit `65074791e8`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:40 +01:00
Martin Storsjö	8b11a89c06	aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: vp9_inv_dct_dct_16x16_sub16_add_neon: 1373.2 vp9_inv_dct_dct_32x32_sub32_add_neon: 8089.0 By skipping individual 8x16 or 8x32 pixel slices in the first pass, we reduce the runtime of these functions like this: vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_16x16_sub2_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub8_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 1372.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 5190.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0 vp9_inv_dct_dct_32x32_sub8_add_neon: 5183.1 vp9_inv_dct_dct_32x32_sub12_add_neon: 6161.5 vp9_inv_dct_dct_32x32_sub16_add_neon: 6155.5 vp9_inv_dct_dct_32x32_sub20_add_neon: 7136.3 vp9_inv_dct_dct_32x32_sub24_add_neon: 7128.4 vp9_inv_dct_dct_32x32_sub28_add_neon: 8098.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8098.8 I.e. in general a very minor overhead for the full subpartition case due to the additional cmps, but a significant speedup for the cases when we only need to process a small part of the actual input data. This is cherrypicked from libav commits `cad42fadcd` and `a0c443a398`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:32 +01:00
Martin Storsjö	37cb224e3e	aarch64: vp9itxfm: Don't repeatedly set x9 when nothing overwrites it This is cherrypicked from libav commit `2f99117f6f`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:25 +01:00
Martin Storsjö	4a5874ea8d	arm/aarch64: vp9itxfm: Fix indentation of macro arguments This is cherrypicked from libav commit `721bc37522`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:19 +01:00
Martin Storsjö	a95e7de41d	aarch64: vp9itxfm: Use w3 instead of x3 for the int eob parameter The clobbering tests in checkasm are only invoked when testing correctness, so this bug didn't show up when benchmarking the dc-only version. This is cherrypicked from libav commit `4d960a1185`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:16 +01:00
Janne Grunau	cb220eeef9	aarch64: vp9: loop filter: replace 'orr; cbn?z' with 'adds; b.{eq,ne}; The latter is 1 cycle faster on a cortex-53 and since the operands are bytewise (or larger) bitmask (impossible to overflow to zero) both are equivalent. This is cherrypicked from libav commit `e7ae8f7a71`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:10 +01:00
Janne Grunau	62ea07d797	aarch64: vp9: use alternative returns in the core loop filter function Since aarch64 has enough free general purpose registers use them to branch to the appropiate storage code. 1-2 cycles faster for the functions using loop_filter 8/16, ... on a cortex-a53. Mixed results (up to 2 cycles faster/slower) on a cortex-a57. This is cherrypicked from libav commit `d7595de0b2`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:06 +01:00
Rostislav Pehlivanov	4fdacf4cdb	imdct15: remove the AArch64 assembly Prep work for the next commit, which will add a new FFT algorithm which makes the iMDCT over 3x faster than it is currently (standalone, the FFT is with some framesizes over 10x faster). The new FFT algorithm uses the already thouroughly SIMD'd power of two FFT which already has SIMD for AArch64, so users of that platform will still see an improvement. The previous FFT+SIMD was barely 2.5x faster than the C versions on these platforms. Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>	2017-01-05 22:32:02 +00:00
Martin Storsjö	da5c8284c0	aarch64: h264idct: Use the offset parameter to movrel Signed-off-by: Martin Storsjö <martin@martin.st> (cherry picked from commit `6a62795d40`) Cherry pick Suggested-by: Martin Storsjö This should fix the build failure on macosx Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2016-12-08 18:11:07 +01:00
Martin Storsjö	f1212e472b	aarch64: vp9: Implement NEON loop filters This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the loop filters with 16 pixels at a time. The implementation is fully templated, with a single macro which can generate versions for both 8 and 16 pixels wide, for both 4, 8 and 16 pixels loop filters (and the 4/8 mixed versions as well). For the 8 pixel wide versions, it is pretty close in speed (the v_4_8 and v_8_8 filters are the best examples of this; the h_4_8 and h_8_8 filters seem to get some gain in the load/transpose/store part). For the 16 pixels wide ones, we get a speedup of around 1.2-1.4x compared to the 32 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_neon: 144.0 127.2 vp9_loop_filter_h_8_8_neon: 207.0 182.5 vp9_loop_filter_h_16_8_neon: 415.0 328.7 vp9_loop_filter_h_16_16_neon: 672.0 558.6 vp9_loop_filter_mix2_h_44_16_neon: 302.0 203.5 vp9_loop_filter_mix2_h_48_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_84_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_88_16_neon: 376.0 305.2 vp9_loop_filter_mix2_v_44_16_neon: 193.2 128.2 vp9_loop_filter_mix2_v_48_16_neon: 246.7 218.4 vp9_loop_filter_mix2_v_84_16_neon: 248.0 218.5 vp9_loop_filter_mix2_v_88_16_neon: 302.0 218.2 vp9_loop_filter_v_4_8_neon: 89.0 88.7 vp9_loop_filter_v_8_8_neon: 141.0 137.7 vp9_loop_filter_v_16_8_neon: 295.0 272.7 vp9_loop_filter_v_16_16_neon: 546.0 453.7 The speedup vs C code in checkasm tests is around 2-7x, which is pretty much the same as for the 32 bit version. Even if these functions are faster than their 32 bit equivalent, the C version that we compare to also became around 1.3-1.7x faster than the C version in 32 bit. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 4-5x. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon loop_filter_h_4_8_neon: 256.6 93.4 loop_filter_h_8_8_neon: 307.3 139.1 loop_filter_h_16_8_neon: 340.1 254.1 loop_filter_h_16_16_neon: 827.0 407.9 loop_filter_mix2_h_44_16_neon: 524.5 155.4 loop_filter_mix2_h_48_16_neon: 644.5 173.3 loop_filter_mix2_h_84_16_neon: 630.5 222.0 loop_filter_mix2_h_88_16_neon: 697.3 222.0 loop_filter_mix2_v_44_16_neon: 598.5 100.6 loop_filter_mix2_v_48_16_neon: 651.5 127.0 loop_filter_mix2_v_84_16_neon: 591.5 167.1 loop_filter_mix2_v_88_16_neon: 855.1 166.7 loop_filter_v_4_8_neon: 271.7 65.3 loop_filter_v_8_8_neon: 312.5 106.9 loop_filter_v_16_8_neon: 473.3 206.5 loop_filter_v_16_16_neon: 976.1 327.8 The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57 is again 30-50% faster than the cortex-a53. This is an adapted cherry-pick from libav commits `9d2afd1eb8` and `31756abe29`. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2016-11-15 15:10:03 -05:00
Martin Storsjö	f43079e11c	aarch64: vp9: Add NEON itxfm routines This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the 16x16 and 32x32 transforms in slices 8 pixels wide instead of 4. This gives a speedup of around 1.4x compared to the 32 bit version. The fact that aarch64 doesn't have the same d/q register aliasing makes some of the macros quite a bit simpler as well. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_add_neon: 90.0 87.7 vp9_inv_adst_adst_8x8_add_neon: 400.0 354.7 vp9_inv_adst_adst_16x16_add_neon: 2526.5 1827.2 vp9_inv_dct_dct_4x4_add_neon: 74.0 72.7 vp9_inv_dct_dct_8x8_add_neon: 271.0 256.7 vp9_inv_dct_dct_16x16_add_neon: 1960.7 1372.7 vp9_inv_dct_dct_32x32_add_neon: 11988.9 8088.3 vp9_inv_wht_wht_4x4_add_neon: 63.0 57.7 The speedup vs C code (2-4x) is smaller than in the 32 bit case, mostly because the C code ends up significantly faster (around 1.6x faster, with GCC 5.4) when built for aarch64. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon vp9_inv_adst_adst_4x4_add_neon: 152.2 60.0 vp9_inv_adst_adst_8x8_add_neon: 948.2 288.0 vp9_inv_adst_adst_16x16_add_neon: 4830.4 1380.5 vp9_inv_dct_dct_4x4_add_neon: 153.0 58.6 vp9_inv_dct_dct_8x8_add_neon: 789.2 180.2 vp9_inv_dct_dct_16x16_add_neon: 3639.6 917.1 vp9_inv_dct_dct_32x32_add_neon: 20462.1 4985.0 vp9_inv_wht_wht_4x4_add_neon: 91.0 49.8 The asm is around factor 3-4 faster than C on the cortex-a57 and the asm is around 30-50% faster on the a57 compared to the a53. This is an adapted cherry-pick from libav commit `3c9546dfaf`. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2016-11-15 15:10:03 -05:00
Martin Storsjö	1f7801c2bc	aarch64: vp9: Add NEON optimizations of VP9 MC functions This work is sponsored by, and copyright, Google. These are ported from the ARM version; it is essentially a 1:1 port with no extra added features, but with some hand tuning (especially for the plain copy/avg functions). The ARM version isn't very register starved to begin with, so there's not much to be gained from having more spare registers here - we only avoid having to clobber callee-saved registers. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_neon: 27.2 23.7 vp9_avg8_neon: 56.5 54.7 vp9_avg16_neon: 169.9 167.4 vp9_avg32_neon: 585.8 585.2 vp9_avg64_neon: 2460.3 2294.7 vp9_avg_8tap_smooth_4h_neon: 132.7 125.2 vp9_avg_8tap_smooth_4hv_neon: 478.8 442.0 vp9_avg_8tap_smooth_4v_neon: 126.0 93.7 vp9_avg_8tap_smooth_8h_neon: 241.7 234.2 vp9_avg_8tap_smooth_8hv_neon: 690.9 646.5 vp9_avg_8tap_smooth_8v_neon: 245.0 205.5 vp9_avg_8tap_smooth_64h_neon: 11273.2 11280.1 vp9_avg_8tap_smooth_64hv_neon: 22980.6 22184.1 vp9_avg_8tap_smooth_64v_neon: 11549.7 10781.1 vp9_put4_neon: 18.0 17.2 vp9_put8_neon: 40.2 37.7 vp9_put16_neon: 97.4 99.5 vp9_put32_neon/armv8: 346.0 307.4 vp9_put64_neon/armv8: 1319.0 1107.5 vp9_put_8tap_smooth_4h_neon: 126.7 118.2 vp9_put_8tap_smooth_4hv_neon: 465.7 434.0 vp9_put_8tap_smooth_4v_neon: 113.0 86.5 vp9_put_8tap_smooth_8h_neon: 229.7 221.6 vp9_put_8tap_smooth_8hv_neon: 658.9 621.3 vp9_put_8tap_smooth_8v_neon: 215.0 187.5 vp9_put_8tap_smooth_64h_neon: 10636.7 10627.8 vp9_put_8tap_smooth_64hv_neon: 21076.8 21026.9 vp9_put_8tap_smooth_64v_neon: 9635.0 9632.4 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is pretty much the same as for the 32 bit case; on the A53 it's around 6-13x for ther larger 8tap filters. The exact speedup varies a little, since the C versions generally don't end up exactly as slow/fast as on 32 bit. This is an adapted cherry-pick from libav commit `383d96aa22`. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2016-11-15 15:10:03 -05:00
James Almer	42111e8543	avcodec: fix arguments on xmm/neon clobber test wrappers Signed-off-by: James Almer <jamrial@gmail.com>	2016-10-02 02:15:47 -03:00
James Almer	449f263f9f	avcodec: add missing xmm/neon clobber test wrappers for the new encode API Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>	2016-10-01 14:08:50 -03:00
Xiaolei Yu	5a70e56f2f	avcodec: fix vc1dsp dependencies	2016-09-25 13:11:45 +02:00
James Almer	293484fa5e	avcodec: add missing xmm/neon clobber test wrappers for the new decode API Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>	2016-07-03 18:04:30 -03:00
Clément Bœsch	dfd0c0f981	lavc/neontest: fix constness in arm/aarch64 avcodec_open2() wrappers	2016-06-25 13:41:13 +02:00
Clément Bœsch	8ef57a0d61	Merge commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb' * commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb': cosmetics: Fix spelling mistakes Merged-by: Clément Bœsch <u@pkh.me>	2016-06-21 21:55:34 +02:00
James Almer	c8c14d0ffc	aarch64/synth_filter: fix compilation Signed-off-by: James Almer <jamrial@gmail.com>	2016-05-10 23:33:12 -03:00
Derek Buitenhuis	ca5ec2bf51	Merge commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec' * commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec': build: miscellaneous cosmetics Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>	2016-05-09 16:25:28 +01:00
Vittorio Giovara	41ed7ab45f	cosmetics: Fix spelling mistakes Signed-off-by: Diego Biurrun <diego@biurrun.de>	2016-05-04 18:16:21 +02:00
Derek Buitenhuis	87b8e95008	Merge commit 'cdb1665f70def544ddab3e3ed3763ef99c8b3873' * commit 'cdb1665f70def544ddab3e3ed3763ef99c8b3873': aarch64: Make transpose_4x4H do a regular transpose Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>	2016-04-24 12:51:42 +01:00
Derek Buitenhuis	197fa698c6	Merge commit '97aec6e75ef36ed0402653519daa8e1fc8ddb555' * commit '97aec6e75ef36ed0402653519daa8e1fc8ddb555': fft: arm: Drop unnecessary #include, add missing ones Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>	2016-04-12 15:43:09 +01:00
Diego Biurrun	01621202aa	build: miscellaneous cosmetics Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically.	2016-04-07 15:26:08 +02:00
Martin Storsjö	cdb1665f70	aarch64: Make transpose_4x4H do a regular transpose Previously, ff_h264_idct_add_neon (originally in the arm version) used a non-regular transpose in order to be able to use more instructions that deal with registers as 128 bit register pairs. The aarch64 translation doesn't do it to the same extent, but brought along the same structure since it was a straight translation. This reshuffles ff_h264_idct_add_neon, bringing it closer to the C implementation, making the transpose_4x4H macro do a regular transpose, usable for other algorithms as well. Previously, the third and fourth output from transpose_4x4H were swapped, and prior to `cc29d96d5a`, the same inputs as well. In addition to just swapping the outputs, also renumber the intermediate registers for better readability (making the register order match transpose_4x8B). This runs with the same number of cycles as before. Signed-off-by: Martin Storsjö <martin@martin.st>	2016-03-26 21:25:56 +02:00
Diego Biurrun	1a094af638	fft: Split MDCT bits off from FFT	2016-03-01 10:18:28 +01:00
Diego Biurrun	97aec6e75e	fft: arm: Drop unnecessary #include, add missing ones	2016-02-26 14:34:58 +01:00
foo86	ae5b2c5250	avcodec/dca: add new decoder based on libdcadec	2016-01-31 17:09:38 +01:00
foo86	4608996772	avcodec/dca: remove old decoder Remove all files and functions which are not going to be reused, and disable all functions and FATE tests temporarily which will be.	2016-01-31 17:09:38 +01:00
James Almer	209f50e16b	avcodec/synth_filter: split off remaining code from dcadec files Signed-off-by: James Almer <jamrial@gmail.com>	2016-01-25 14:57:38 -03:00
Hendrik Leppkes	d03da3e240	Merge commit '2008f76054906e9ff6bf744800af0e5a5bfe61be' * commit '2008f76054906e9ff6bf744800af0e5a5bfe61be': dca: remove unused decode_hf function and quant_d tables Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-01-02 13:17:48 +01:00
Hendrik Leppkes	e97e2588ca	Merge commit 'a0fc780a2093784e8664f88205ee1b215e109cee' * commit 'a0fc780a2093784e8664f88205ee1b215e109cee': arm64: int32_to_float_fmul neon asm Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-01-02 11:21:16 +01:00
Hendrik Leppkes	10e075c138	Merge commit '705f5e5e155f6f280a360af220fc5b30cfcee702' * commit '705f5e5e155f6f280a360af220fc5b30cfcee702': arm64: port synth_filter_float_neon from arm Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-01-02 11:14:28 +01:00
Hendrik Leppkes	de3a33784c	Merge commit 'c33c1fa8af2b2e82418a06901b6ad17b3d61b73e' * commit 'c33c1fa8af2b2e82418a06901b6ad17b3d61b73e': arm64: convert dcadsp neon asm from arm Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-01-02 11:10:24 +01:00
Alexandra Hájková	2008f76054	dca: remove unused decode_hf function and quant_d tables They were superseded with their integer equivalents. Rename integer decode_hf to decode_hf.	2015-12-24 13:58:18 +01:00
Janne Grunau	cc29d96d5a	arm64: fix inverted register order in transpose_4x4H Fix related register order issue in ff_h264_idct_add_neon. Found-by: zjh8890 <243186085@qq.com>	2015-12-21 13:44:20 +01:00
Janne Grunau	2dba0407fd	avcodec/arm64: fix inverted register order in transpose_4x4H Fix related register order issue in ff_h264_idct_add_neon. Found-by: zjh8890 <243186085@qq.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2015-12-19 03:58:46 +01:00
Michael Niedermayer	95b59bfb9d	Revert "avcodec/aarch64/neon.S: Update neon.s for transpose_4x4H" The change was not correct and broke H264 This reverts commit `cd83f899c9`.	2015-12-17 21:26:37 +01:00
Janne Grunau	a0fc780a20	arm64: int32_to_float_fmul neon asm 3% faster dts decoding on a cortex-a57. cortex-a57 cortex-a53 int32_to_float_fmul_array8_c: 1270.9 4475.6 int32_to_float_fmul_array8_neon: 328.6 569.2 int32_to_float_fmul_scalar_c: 928.5 4119.6 int32_to_float_fmul_scalar_neon: 309.1 524.1	2015-12-14 16:45:02 +01:00
Janne Grunau	705f5e5e15	arm64: port synth_filter_float_neon from arm ~25% faster dts decoding overall. The checkasm CPU cycles numbers are not that useful since synth_filter_float() calls FFTContext.imdct_half(). cortex-a57 cortex-a53 synth_filter_float_c: 1866.2 3490.9 synth_filter_float_neon: 915.0 1531.5 With fftc.imdct_half forced to imdct_half_neon: cortex-a57 cortex-a53 synth_filter_float_c: 1718.4 3025.3 synth_filter_float_neon: 926.2 1530.1	2015-12-14 16:45:01 +01:00
Janne Grunau	c33c1fa8af	arm64: convert dcadsp neon asm from arm ~2% faster dts decoding overall. cortex-a57 cortex-a53 dca_decode_hf_c: 474.8 1659.9 dca_decode_hf_neon: 225.2 301.1 dca_lfe_fir0_c: 913.2 1537.7 dca_lfe_fir0_neon: 286.8 451.9 dca_lfe_fir1_c: 848.7 1711.5 dca_lfe_fir1_neon: 387.1 506.4	2015-12-14 16:45:01 +01:00
zjh8890	c18176bd55	avcodec/aarch64/neon.S: Update neon.s for transpose_4x4H The transpose_4x4H is wrong which cost me much time to find this bug. The orders of r2 and r3 are wrong, this bug waste me much time while I make aarch64 arm instruction which used the function.	2015-12-12 14:20:01 +01:00
Michael Niedermayer	5d5f8b29b4	Merge commit 'f56d8d8dd72b1ab52aa814c5a0fccabf8040ef68' * commit 'f56d8d8dd72b1ab52aa814c5a0fccabf8040ef68': h264: aarch64: intra prediction optimisations Conflicts: libavcodec/h264pred.c Merged-by: Michael Niedermayer <michael@niedermayer.cc>	2015-07-21 01:39:30 +02:00
Janne Grunau	f56d8d8dd7	h264: aarch64: intra prediction optimisations	2015-07-20 23:10:29 +02:00
Janne Grunau	c2de2cf0d2	arm64: constify src in h264qpel dsp function definitions	2015-06-24 08:41:32 +02:00

1 2

96 Commits