FFmpeg

mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-07 11:13:41 +02:00

Author	SHA1	Message	Date
Ronald S. Bultje	d32d0593f1	vp9: disable more pmulhrsw optimizations in idct16/32. For idct16, only when called from a adst16x16 variant, so impact is minor. For idct32, for all, so relatively major impact.	2015-05-14 14:15:27 -04:00
Ronald S. Bultje	96d30c3495	vp9: disable all pmulhrsw in 8/16 iadst x86 optimizations. They all overflow in various samples that are considered valid input.	2015-05-14 13:39:37 -04:00
Ronald S. Bultje	3de13d5212	vp9: remove another optimization branch in iadst16 which causes overflows. See sample vp90-2-14-resize-fp-tiles-16-8.webm from the vp9 test vector set to reproduce the issue.	2015-04-24 16:54:31 +02:00
Ronald S. Bultje	d02d04a18f	vp9: remove one optimization branch in iadst16 which causes overflows. See sample vp90-2-14-resize-fp-tiles-16-8-4-2-1.webm from the vp9 test vector set which reproduces the issue. This probably costs a few cycles, but I don't think there's an easy way to workaround that. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	2015-04-22 21:37:10 +02:00
James Almer	92d903afaa	x86/vp9dsp: fix clobbering of xmm6 on IDCT sse2 functions Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>	2015-02-08 00:50:39 -03:00
Ronald S. Bultje	0a7964dca5	vp9/x86: save one register on 32bit idct32x32. Fixes build on win32. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	2014-12-16 02:51:26 +01:00
Ronald S. Bultje	fd77fbb390	vp9/x86: 32bit and sse2 support for vp9 inverse transform assembly Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	2014-12-15 00:38:05 +01:00
Christophe Gisquet	4e128ab0b1	x86: vpx/h264/hevc/mpeg2: share constants Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	2014-08-06 18:36:31 +02:00
Ronald S. Bultje	c9e6325ed9	vp9/x86: use explicit register for relative stack references. Before this patch, we explicitly modify rsp, which isn't necessarily universally acceptable, since the space under the stack pointer might be modified in things like signal handlers. Therefore, use an explicit register to hold the stack pointer relative to the bottom of the stack (i.e. rsp). This will also clear out valgrind errors about the use of uninitialized data that started occurring after the idct16x16/ssse3 optimizations were first merged.	2014-01-24 19:25:25 -05:00
Ronald S. Bultje	97474d527f	vp9/x86: iwht4x4 (lossless) mmx.	2014-01-24 19:25:25 -05:00
Ronald S. Bultje	d43efa68bd	vp9/x86: 4x4 iadst SIMD (ssse3) variants. Cycle measurements for intra itxfm_4x4_add on ped1080p.webm: idct_idct: 66 -> 67 cycles (noise measurement) idct_iadst: 199 -> 79 cycles iadst_idct: 165 -> 70 cycles iadst_iadst: 183 -> 82 cycles	2014-01-24 19:25:25 -05:00
Ronald S. Bultje	baf47020cd	vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants. Cycle measurements for intra itxfm_8x8_add on ped1080p.webm: idct_idct: 133 -> 135 cycles (noise measurement) idct_iadst: 900 -> 241 cycles iadst_idct: 864 -> 215 cycles iadst_iadst: 973 -> 310 cycles	2014-01-24 19:25:25 -05:00
Ronald S. Bultje	8173d1ffc0	vp9/x86: 16x16 iadst_idct, idct_iadst and iadst_iadst (ssse3+avx). Sample timings on ped1080p.webm (of the ssse3 functions): iadst_idct: 4672 -> 1175 cycles idct_iadst: 4736 -> 1263 cycles iadst_iadst: 4924 -> 1438 cycles Total decoding time changed from 6.565s to 6.413s.	2014-01-16 13:49:31 +01:00
Clément Bœsch	8b4190da93	vp9/x86: add AVX for itxfm and lpf. 4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips 3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips 3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips 2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips 23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips 19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips 4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips 3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips 967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips 887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips	2014-01-15 15:54:03 +01:00
Clément Bœsch	e11ceea68f	vp9/x86: factor out some code in VP9_UNPACK_MULSUB_2W_4X.	2014-01-12 20:19:00 +01:00
Clément Bœsch	c9aa0b8f70	vp9/x86: remove reg redundancy in VP9_MULSUB_2W_2X.	2014-01-12 20:18:55 +01:00
Clément Bœsch	7c55ee6168	vp9/x86: merge IDCT coef macros.	2014-01-12 20:18:44 +01:00
Ronald S. Bultje	c6fe984f2f	vp9/x86: make STORE_2X2 macro local. Prevents this assembler warning: libavcodec/x86/vp9itxfm.asm:1208: warning: (VP9_IDCT32_1D:309) redefining multi-line macro `STORE_2X2' Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	2014-01-08 14:07:15 +01:00
Ronald S. Bultje	04a187fb2a	vp9/x86: idct_32x32_add_ssse3 sub-8x8-idct. Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or from 1425 to 1306 cycles (inter). Overall runtime is not significantly affected.	2014-01-07 20:43:35 -05:00
Ronald S. Bultje	37b001d14d	vp9/x86: idct_32x32_add_ssse3 sub-16x16-idct. Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e. ~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).	2014-01-07 20:43:34 -05:00
Ronald S. Bultje	e84d14df10	vp9/x86: idct_32x32_add_ssse3. Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s (13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter) to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra) or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra) or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all tests done on ped1080p.webm).	2014-01-07 20:43:30 -05:00
Ronald S. Bultje	0d9375fc90	vp9/x86: 16x16 sub-IDCT for top-left 8x8 subblock (eob <= 38). Sub8x8 speed (w/o dc-only case) goes from ~750 cycles (inter) or ~735 cycles (intra) to ~415 cycles (inter) or ~430 cycles (intra). Average overall 16x16 idct speed goes from ~635 cycles (inter) or ~720 cycles (intra) to ~415 cycles (inter) or ~545 (intra) - all measurements done using ped1080p.webm.	2013-12-26 07:40:25 -05:00
Ronald S. Bultje	8d4c616fc0	vp9/x86: idct_add_16x16_ssse3. Currently only dc-only and full 16x16. Other subforms will follow in the near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3 seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes from ~4050 to ~745 cycles.	2013-12-14 12:13:26 -05:00
Ronald S. Bultje	92436e8ad9	vp9: implement top/left half (4x4) sub-8x8-IDCT. For that specific case (eob>3&&eob<=12), runtime of idct8x8 goes from 668 to 477 cycles. For all idct8x8, runtime goes from 521 to 490 cycles.	2013-12-07 12:39:36 -05:00
Ronald S. Bultje	b2045c44a9	vp9: split pre-load of 11585x2 out of 1d idct macro. This allows us to load it only once, instead of twice, in this function.	2013-12-07 12:39:36 -05:00
Ronald S. Bultje	f9a0d4c6e0	vp9: minor refactorings in idct ssse3 assembly. Make register usage in macros explicit; change mulsub_2w_4x to use 2 instead of 3 temp registers.	2013-12-07 12:39:35 -05:00
Ronald S. Bultje	8729964b99	vp9: split x86 assembly in two files. (And in future, loopfilter or intra pred could be put in their own respective files also.)	2013-12-07 12:39:35 -05:00

27 Commits