FFmpeg

mirror of https://github.com/FFmpeg/FFmpeg.git synced 2025-02-09 14:14:39 +02:00

Author	SHA1	Message	Date
Martin Storsjö	ecd343aa1f	arm: vp9itxfm: Only reload the idct coeffs for the iadst_idct combination This avoids reloading them if they haven't been clobbered, if the first pass also was idct. This is similar to what was done in the aarch64 version. This is cherrypicked from libav commit 3c87039a404c5659ae9bf7454a04e186532eb40b. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:27 +01:00
Martin Storsjö	f69dd26df5	arm: vp9itxfm: Rename a macro parameter to fit better Since the same parameter is used for both input and output, the name inout is more fitting. This matches the naming used below in the dmbutterfly macro. This is cherrypicked from libav commit 79566ec8c77969d5f9be533de04b1349834cca62. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:21 +01:00
Martin Storsjö	4a5874ea8d	arm/aarch64: vp9itxfm: Fix indentation of macro arguments This is cherrypicked from libav commit 721bc37522c5c1d6a8c3cea5e9c3fcde8d256c05. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:19 +01:00
Janne Grunau	a71cd8439f	arm: vp9itxfm: Simplify the stack alignment code This is one instruction less for thumb, and only have got 1/2 arm/thumb specific instructions. This is cherrypicked from libav commit e5b0fc170f85b00f7dd0ac514918fb5c95253d39. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:12 +01:00
Hendrik Leppkes	2818aaaba0	Merge commit '5f74bd31a9bd1ac7655103b11743c12d38e0419f' * commit '5f74bd31a9bd1ac7655103b11743c12d38e0419f': vp8/armv6: mc: avoid boolean expression in calculation Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-11-17 15:05:07 +01:00
Martin Storsjö	6bec60a683	arm: vp9: Add NEON loop filters This work is sponsored by, and copyright, Google. The implementation tries to have smart handling of cases where no pixels need the full filtering for the 8/16 width filters, skipping both calculation and writeback of the unmodified pixels in those cases. The actual effect of this is hard to test with checkasm though, since it tests the full filtering, and the benefit depends on how many filtered blocks use the shortcut. Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9 A53 vp9_loop_filter_h_4_8_neon: 2.72 2.68 1.78 3.15 vp9_loop_filter_h_8_8_neon: 2.36 2.38 1.70 2.91 vp9_loop_filter_h_16_8_neon: 1.80 1.89 1.45 2.01 vp9_loop_filter_h_16_16_neon: 2.81 2.78 2.18 3.16 vp9_loop_filter_mix2_h_44_16_neon: 2.65 2.67 1.93 3.05 vp9_loop_filter_mix2_h_48_16_neon: 2.46 2.38 1.81 2.85 vp9_loop_filter_mix2_h_84_16_neon: 2.50 2.41 1.73 2.85 vp9_loop_filter_mix2_h_88_16_neon: 2.77 2.66 1.96 3.23 vp9_loop_filter_mix2_v_44_16_neon: 4.28 4.46 3.22 5.70 vp9_loop_filter_mix2_v_48_16_neon: 3.92 4.00 3.03 5.19 vp9_loop_filter_mix2_v_84_16_neon: 3.97 4.31 2.98 5.33 vp9_loop_filter_mix2_v_88_16_neon: 3.91 4.19 3.06 5.18 vp9_loop_filter_v_4_8_neon: 4.53 4.47 3.31 6.05 vp9_loop_filter_v_8_8_neon: 3.58 3.99 2.92 5.17 vp9_loop_filter_v_16_8_neon: 3.40 3.50 2.81 4.68 vp9_loop_filter_v_16_16_neon: 4.66 4.41 3.74 6.02 The speedup vs C code is around 2-6x. The numbers are quite inconclusive though, since the checkasm test runs multiple filterings on top of each other, so later rounds might end up with different codepaths (different decisions on which filter to apply, based on input pixel differences). Disabling the early-exit in the asm doesn't give a fair comparison either though, since the C code only does the necessary calcuations for each row. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 4-9x. This is pretty similar in runtime to the corresponding routines in libvpx. (This is comparing vpx_lpf_vertical_16_neon, vpx_lpf_horizontal_edge_8_neon and vpx_lpf_horizontal_edge_16_neon to vp9_loop_filter_h_16_8_neon, vp9_loop_filter_v_16_8_neon and vp9_loop_filter_v_16_16_neon - note that the naming of horizonal and vertical is flipped between the libraries.) In order to have stable, comparable numbers, the early exits in both asm versions were disabled, forcing the full filtering codepath. Cortex A7 A8 A9 A53 vp9_loop_filter_h_16_8_neon: 597.2 472.0 482.4 415.0 libvpx vpx_lpf_vertical_16_neon: 626.0 464.5 470.7 445.0 vp9_loop_filter_v_16_8_neon: 500.2 422.5 429.7 295.0 libvpx vpx_lpf_horizontal_edge_8_neon: 586.5 414.5 415.6 383.2 vp9_loop_filter_v_16_16_neon: 905.0 784.7 791.5 546.0 libvpx vpx_lpf_horizontal_edge_16_neon: 1060.2 751.7 743.5 685.2 Our version is consistently faster on on A7 and A53, marginally slower on A8, and sometimes faster, sometimes slower on A9 (marginally slower in all three tests in this particular test run). This is an adapted cherry-pick from libav commit dd299a2d6d4d1af9528ed35a8131c35946be5973. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2016-11-15 15:10:03 -05:00
Martin Storsjö	b4dc7c341e	arm: vp9: Add NEON itxfm routines This work is sponsored by, and copyright, Google. For the transforms up to 8x8, we can fit all the data (including temporaries) in registers and just do a straightforward transform of all the data. For 16x16, we do a transform of 4x16 pixels in 4 slices, using a temporary buffer. For 32x32, we transform 4x32 pixels at a time, in two steps of 4x16 pixels each. Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9 A53 vp9_inv_adst_adst_4x4_add_neon: 3.39 5.83 4.17 4.01 vp9_inv_adst_adst_8x8_add_neon: 3.79 4.86 4.23 3.98 vp9_inv_adst_adst_16x16_add_neon: 3.33 4.36 4.11 4.16 vp9_inv_dct_dct_4x4_add_neon: 4.06 6.16 4.59 4.46 vp9_inv_dct_dct_8x8_add_neon: 4.61 6.01 4.98 4.86 vp9_inv_dct_dct_16x16_add_neon: 3.35 3.44 3.36 3.79 vp9_inv_dct_dct_32x32_add_neon: 3.89 3.50 3.79 4.42 vp9_inv_wht_wht_4x4_add_neon: 3.22 5.13 3.53 3.77 Thus, the speedup vs C code is around 3-6x. This is mostly marginally faster than the corresponding routines in libvpx on most cores, tested with their 32x32 idct (compared to vpx_idct32x32_1024_add_neon). These numbers are slightly in libvpx's favour since their version doesn't clear the input buffer like ours do (although the effect of that on the total runtime probably is negligible.) Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_add_neon: 18436.8 16874.1 14235.1 11988.9 libvpx vpx_idct32x32_1024_add_neon 20789.0 13344.3 15049.9 13030.5 Only on the Cortex A8, the libvpx function is faster. On the other cores, ours is slightly faster even though ours has got source block clearing integrated. This is an adapted cherry-pick from libav commits a67ae67083151f2f9595a1f2d17b601da19b939e and 52d196fb30fb6628921b5f1b31e7bd11eb7e1d9a. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2016-11-15 15:10:03 -05:00
Martin Storsjö	68caef9d48	arm: vp9: Add NEON optimizations of VP9 MC functions This work is sponsored by, and copyright, Google. The filter coefficients are signed values, where the product of the multiplication with one individual filter coefficient doesn't overflow a 16 bit signed value (the largest filter coefficient is 127). But when the products are accumulated, the resulting sum can overflow the 16 bit signed range. Instead of accumulating in 32 bit, we accumulate the largest product (either index 3 or 4) last with a saturated addition. (The VP8 MC asm does something similar, but slightly simpler, by accumulating each half of the filter separately. In the VP9 MC filters, each half of the filter can also overflow though, so the largest component has to be handled individually.) Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9 A53 vp9_avg4_neon: 1.71 1.15 1.42 1.49 vp9_avg8_neon: 2.51 3.63 3.14 2.58 vp9_avg16_neon: 2.95 6.76 3.01 2.84 vp9_avg32_neon: 3.29 6.64 2.85 3.00 vp9_avg64_neon: 3.47 6.67 3.14 2.80 vp9_avg_8tap_smooth_4h_neon: 3.22 4.73 2.76 4.67 vp9_avg_8tap_smooth_4hv_neon: 3.67 4.76 3.28 4.71 vp9_avg_8tap_smooth_4v_neon: 5.52 7.60 4.60 6.31 vp9_avg_8tap_smooth_8h_neon: 6.22 9.04 5.12 9.32 vp9_avg_8tap_smooth_8hv_neon: 6.38 8.21 5.72 8.17 vp9_avg_8tap_smooth_8v_neon: 9.22 12.66 8.15 11.10 vp9_avg_8tap_smooth_64h_neon: 7.02 10.23 5.54 11.58 vp9_avg_8tap_smooth_64hv_neon: 6.76 9.46 5.93 9.40 vp9_avg_8tap_smooth_64v_neon: 10.76 14.13 9.46 13.37 vp9_put4_neon: 1.11 1.47 1.00 1.21 vp9_put8_neon: 1.23 2.17 1.94 1.48 vp9_put16_neon: 1.63 4.02 1.73 1.97 vp9_put32_neon: 1.56 4.92 2.00 1.96 vp9_put64_neon: 2.10 5.28 2.03 2.35 vp9_put_8tap_smooth_4h_neon: 3.11 4.35 2.63 4.35 vp9_put_8tap_smooth_4hv_neon: 3.67 4.69 3.25 4.71 vp9_put_8tap_smooth_4v_neon: 5.45 7.27 4.49 6.52 vp9_put_8tap_smooth_8h_neon: 5.97 8.18 4.81 8.56 vp9_put_8tap_smooth_8hv_neon: 6.39 7.90 5.64 8.15 vp9_put_8tap_smooth_8v_neon: 9.03 11.84 8.07 11.51 vp9_put_8tap_smooth_64h_neon: 6.78 9.48 4.88 10.89 vp9_put_8tap_smooth_64hv_neon: 6.99 8.87 5.94 9.56 vp9_put_8tap_smooth_64v_neon: 10.69 13.30 9.43 14.34 For the larger 8tap filters, the speedup vs C code is around 5-14x. This is significantly faster than libvpx's implementation of the same functions, at least when comparing the put_8tap_smooth_64 functions (compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from libvpx). Absolute runtimes from checkasm: Cortex A7 A8 A9 A53 vp9_put_8tap_smooth_64h_neon: 20150.3 14489.4 19733.6 10863.7 libvpx vpx_convolve8_horiz_neon: 52623.3 19736.4 21907.7 25027.7 vp9_put_8tap_smooth_64v_neon: 14455.0 12303.9 13746.4 9628.9 libvpx vpx_convolve8_vert_neon: 42090.0 17706.2 17659.9 16941.2 Thus, on the A9, the horizontal filter is only marginally faster than libvpx, while our version is significantly faster on the other cores, and the vertical filter is significantly faster on all cores. The difference is especially large on the A7. The libvpx implementation does the accumulation in 32 bit, which probably explains most of the differences. This is an adapted cherry-pick from libav commits ffbd1d2b0002576ef0d976a41ff959c635373fdc, 392caa65df3efa8b2d48a80f08a6af4892c61c08, 557c1675cf0e803b2fee43b4c8b58433842c84d0 and 11623217e3c9b859daee544e31acdd0821b61039. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2016-11-15 15:10:03 -05:00
Hendrik Leppkes	51f5542c77	Merge commit 'e8b96a77010dd62624c3c65c357d7ae3b397ceaa' * commit 'e8b96a77010dd62624c3c65c357d7ae3b397ceaa': arm: Fix a typo in a comment Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-11-14 15:21:49 +01:00
James Almer	42111e8543	avcodec: fix arguments on xmm/neon clobber test wrappers Signed-off-by: James Almer <jamrial@gmail.com>	2016-10-02 02:15:47 -03:00
James Almer	449f263f9f	avcodec: add missing xmm/neon clobber test wrappers for the new encode API Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>	2016-10-01 14:08:50 -03:00
Diego Biurrun	e4a94d8b36	h264chroma: Change type of stride parameters to ptrdiff_t This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic.	2016-09-29 14:48:04 +02:00
Diego Biurrun	2ec9fa5ec6	idct: Change type of array stride parameters to ptrdiff_t ptrdiff_t is the correct type for array strides and similar.	2016-09-29 14:48:03 +02:00
Diego Biurrun	92c5755a18	hpeldsp: arm: Update comments left behind in 25841dfe806a13de526ae09c11149ab1f83555a8	2016-09-29 14:48:03 +02:00
Anton Khirnov	de2ae3c1fa	lavc: add clobber tests for the new encoding/decoding API	2016-09-28 10:01:52 +02:00
Xiaolei Yu	5a70e56f2f	avcodec: fix vc1dsp dependencies	2016-09-25 13:11:45 +02:00
Anton Khirnov	683da86aab	audiodsp: reorder arguments for vector_clipf This will make the x86 asm simpler. ARM conversion by Martin Storsjö <martin@martin.st> and Janne Grunau <janne-libav@jannau.net>	2016-09-22 09:47:52 +02:00
Anton Khirnov	eea9857bfd	blockdsp: drop the high_bit_depth parameter It has no effect, since the code is supposed to operate the same way for any bit depth.	2016-09-22 09:47:52 +02:00
Diego Biurrun	de452e5037	pixblockdsp: Change type of stride parameters to ptrdiff_t This avoids SIMD-optimized functions having to sign-extend their line size argument manually to be able to do pointer arithmetic. Also adjust parameter names to be "stride" everywhere.	2016-09-14 14:12:36 +02:00
Diego Biurrun	721d57e608	vp56: Separate VP5 and VP6 dsp initialization VP5 has no arch-specific optimizations (nor will it get some in the future), so it makes no sense to try to share dsp init code with VP6.	2016-08-26 11:50:22 +02:00
Diego Biurrun	802727b538	vp8: Update some assembly comments left unchanged in bd66f073fe7286bd3c	2016-08-26 11:36:53 +02:00
Diego Biurrun	d9d26a3674	vp56: Change type of stride parameters to ptrdiff_t This avoids SIMD-optimized functions having to sign-extend their line size argument manually to be able to do pointer arithmetic.	2016-08-26 11:36:26 +02:00
Diego Biurrun	6892df9294	vp3: Change type of stride parameters to ptrdiff_t This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic. Also adjust parameter names to be "stride" everywhere.	2016-08-26 11:36:26 +02:00
Diego Biurrun	014852e932	simple_idct: arm: Drop disabled code variant	2016-08-17 12:21:54 +02:00
Janne Grunau	5f74bd31a9	vp8/armv6: mc: avoid boolean expression in calculation GNU as evaluates true as '-1' while Apple's variant and llvm's internal assembler evaluate it as '1'. The best way to avoid this madness is to eliminate boolean expressions instead of trying to fix it with preprocessor directives. Use a direct formula to calculate the required temporary space on the stack in ff_put_vp8_{epel,bilin}{4,8,16}_h[246]v[246]_armv6(). Fixes a checkasm segfault in vp8dsp.mc when using llvm's internal assembler for a non-Apple target.	2016-07-10 13:35:41 +02:00
Martin Storsjö	e8b96a7701	arm: Fix a typo in a comment Signed-off-by: Martin Storsjö <martin@martin.st>	2016-07-06 22:58:51 +03:00
James Almer	293484fa5e	avcodec: add missing xmm/neon clobber test wrappers for the new decode API Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>	2016-07-03 18:04:30 -03:00
Clément Bœsch	4a081f224e	libavcodec: fix constness in clobber test avcodec_open2() wrappers Signed-off-by: Martin Storsjö <martin@martin.st>	2016-06-26 21:34:04 +03:00
Clément Bœsch	dfd0c0f981	lavc/neontest: fix constness in arm/aarch64 avcodec_open2() wrappers	2016-06-25 13:41:13 +02:00
Clément Bœsch	5d48e4eafa	Merge commit 'a6a750c7ef240b72ce01e9653343a0ddf247d196' * commit 'a6a750c7ef240b72ce01e9653343a0ddf247d196': tests: Move all test programs to a subdirectory Merged-by: Clément Bœsch <clement@stupeflix.com>	2016-06-22 13:44:34 +02:00
Clément Bœsch	8ef57a0d61	Merge commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb' * commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb': cosmetics: Fix spelling mistakes Merged-by: Clément Bœsch <u@pkh.me>	2016-06-21 21:55:34 +02:00
Diego Biurrun	a6a750c7ef	tests: Move all test programs to a subdirectory	2016-05-13 14:55:56 +02:00
Derek Buitenhuis	ca5ec2bf51	Merge commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec' * commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec': build: miscellaneous cosmetics Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>	2016-05-09 16:25:28 +01:00
Vittorio Giovara	41ed7ab45f	cosmetics: Fix spelling mistakes Signed-off-by: Diego Biurrun <diego@biurrun.de>	2016-05-04 18:16:21 +02:00
James Almer	d7815df402	arm/rdft_init: fix license header Signed-off-by: James Almer <jamrial@gmail.com>	2016-04-12 15:01:19 -03:00
Derek Buitenhuis	2605967f7e	Merge commit '4c297249ac0f513a610a62691ce96d6b62f65b94' * commit '4c297249ac0f513a610a62691ce96d6b62f65b94': rdft: arm: Split RDFT initialization into a separate file Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>	2016-04-12 15:43:34 +01:00
Derek Buitenhuis	197fa698c6	Merge commit '97aec6e75ef36ed0402653519daa8e1fc8ddb555' * commit '97aec6e75ef36ed0402653519daa8e1fc8ddb555': fft: arm: Drop unnecessary #include, add missing ones Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>	2016-04-12 15:43:09 +01:00
Diego Biurrun	01621202aa	build: miscellaneous cosmetics Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically.	2016-04-07 15:26:08 +02:00
Diego Biurrun	1a094af638	fft: Split MDCT bits off from FFT	2016-03-01 10:18:28 +01:00
Diego Biurrun	4c297249ac	rdft: arm: Split RDFT initialization into a separate file	2016-02-26 14:34:58 +01:00
Diego Biurrun	97aec6e75e	fft: arm: Drop unnecessary #include, add missing ones	2016-02-26 14:34:58 +01:00
Derek Buitenhuis	b056482ef3	Merge commit '15a24614aef5836af3cd2c7cc3b2b737eee6bf3c' * commit '15a24614aef5836af3cd2c7cc3b2b737eee6bf3c': build: Add vc1dsp component for more fine-grained dependencies Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>	2016-02-24 18:21:38 +00:00
Diego Biurrun	15a24614ae	build: Add vc1dsp component for more fine-grained dependencies	2016-02-19 20:38:18 +01:00
foo86	ae5b2c5250	avcodec/dca: add new decoder based on libdcadec	2016-01-31 17:09:38 +01:00
foo86	4608996772	avcodec/dca: remove old decoder Remove all files and functions which are not going to be reused, and disable all functions and FATE tests temporarily which will be.	2016-01-31 17:09:38 +01:00
James Almer	209f50e16b	avcodec/synth_filter: split off remaining code from dcadec files Signed-off-by: James Almer <jamrial@gmail.com>	2016-01-25 14:57:38 -03:00
Hendrik Leppkes	d03da3e240	Merge commit '2008f76054906e9ff6bf744800af0e5a5bfe61be' * commit '2008f76054906e9ff6bf744800af0e5a5bfe61be': dca: remove unused decode_hf function and quant_d tables Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-01-02 13:17:48 +01:00
Hendrik Leppkes	e23c3a13e3	Merge commit '90b1b9350c0a97c4065ae9054b83e57f48a0de1f' * commit '90b1b9350c0a97c4065ae9054b83e57f48a0de1f': arm: add ff_int32_to_float_fmul_array8_neon Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-01-02 11:21:36 +01:00
Hendrik Leppkes	e754c8e8ca	Merge commit 'e2710e790c09e49e86baa58c6063af0097cc8cb0' * commit 'e2710e790c09e49e86baa58c6063af0097cc8cb0': arm: add a cpu flag for the VFPv2 vector mode Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-01-02 11:01:29 +01:00
Alexandra Hájková	2008f76054	dca: remove unused decode_hf function and quant_d tables They were superseded with their integer equivalents. Rename integer decode_hf to decode_hf.	2015-12-24 13:58:18 +01:00

1 2 3 4 5 ...

850 Commits