Diego Biurrun
fd502f4f5f
build: Generalize yasm/nasm-related variable names
...
None of them are specific to the YASM assembler.
(Cherry-picked from libav commit 39e208f4d4756367c7cd2d581847e0c1b8a429c1)
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-21 17:00:29 -03:00
Ronald S. Bultje
f8c019944d
vp9: re-split the decoder/format/dsp interface header files.
...
The advantage here is that the internal software decoder interface is
not exposed to the DSP functions or the hardware accelerations.
2017-03-28 18:04:26 -04:00
Clément Bœsch
1c9f4b5078
lavc/vp9: split into vp9{block,data,mvs}
...
This is following Libav layout to ease merges.
2017-03-27 21:38:21 +02:00
Ronald S. Bultje
83a139e3d8
vp9: add avx2 iadst16 implementations.
...
Also a small cosmetic change to the avx2 idct16 version to make it
explicit that one of the arguments to the write-out macros is unused
for >=avx2 (it uses pmovzxbw instead of punpcklbw).
2016-11-15 11:01:36 -05:00
Ronald S. Bultje
a4edaa0270
vp9: add mxext versions of the single-block (w=8,npx=8) h/v loopfilters.
...
Each takes about 0.1% of runtime in my profiles, and they didn't have
any SIMD yet so far (we only had simd for npx=16 double-block versions).
2016-07-26 15:59:07 -04:00
Ronald S. Bultje
7ca422bb1b
vp9: add mxext versions of the single-block (w=4,npx=8) h/v loopfilters.
...
Each takes about 0.5% of runtime in my profiles, and they didn't have
any SIMD yet so far (we only had simd for npx=16 double-block versions).
2016-07-26 15:59:07 -04:00
Ronald S. Bultje
726501a34e
vp9: add 32x32 idct AVX2 implementation.
...
About 1.8x speedup compared to AVX version for full IDCT. Other
sub-IDCT scenarios also see speedups. Full --bench output for
idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles):
nop: 16.5
vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4
vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0
vp9_inv_dct_dct_32x32_add_8_1_ssse3: 137.4
vp9_inv_dct_dct_32x32_add_8_1_avx: 137.1
vp9_inv_dct_dct_32x32_add_8_1_avx2: 73.2
vp9_inv_dct_dct_32x32_add_8_2_c: 14680.8
vp9_inv_dct_dct_32x32_add_8_2_sse2: 2617.2
vp9_inv_dct_dct_32x32_add_8_2_ssse3: 982.9
vp9_inv_dct_dct_32x32_add_8_2_avx: 958.5
vp9_inv_dct_dct_32x32_add_8_2_avx2: 704.2
vp9_inv_dct_dct_32x32_add_8_4_c: 14443.1
vp9_inv_dct_dct_32x32_add_8_4_sse2: 2717.1
vp9_inv_dct_dct_32x32_add_8_4_ssse3: 965.7
vp9_inv_dct_dct_32x32_add_8_4_avx: 1000.7
vp9_inv_dct_dct_32x32_add_8_4_avx2: 717.1
vp9_inv_dct_dct_32x32_add_8_8_c: 14436.4
vp9_inv_dct_dct_32x32_add_8_8_sse2: 2671.8
vp9_inv_dct_dct_32x32_add_8_8_ssse3: 1038.5
vp9_inv_dct_dct_32x32_add_8_8_avx: 983.0
vp9_inv_dct_dct_32x32_add_8_8_avx2: 729.4
vp9_inv_dct_dct_32x32_add_8_16_c: 14614.7
vp9_inv_dct_dct_32x32_add_8_16_sse2: 2701.7
vp9_inv_dct_dct_32x32_add_8_16_ssse3: 1334.4
vp9_inv_dct_dct_32x32_add_8_16_avx: 1276.7
vp9_inv_dct_dct_32x32_add_8_16_avx2: 719.5
vp9_inv_dct_dct_32x32_add_8_32_c: 14363.6
vp9_inv_dct_dct_32x32_add_8_32_sse2: 2575.6
vp9_inv_dct_dct_32x32_add_8_32_ssse3: 2633.9
vp9_inv_dct_dct_32x32_add_8_32_avx: 2539.6
vp9_inv_dct_dct_32x32_add_8_32_avx2: 1395.0
2016-07-26 15:59:07 -04:00
Ronald S. Bultje
f0a2b6249b
vp9: add 16x16 idct avx2 (8-bit).
...
checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
that it's about 1.65x as fast as the AVX version for the full IDCT, and
similar speedups for the sub-IDCTs:
nop: 24.6
vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8
vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6
vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4
vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2
vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5
vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7
vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9
vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2
vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9
vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3
vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7
vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4
vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1
vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1
vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0
vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4
vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6
vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7
vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9
vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2
vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6
vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5
vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0
vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9
vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
2016-07-11 10:14:58 -04:00
James Almer
70d685a77f
x86: use the new helper macros where useful
...
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-02-14 20:00:21 -03:00
Ganesh Ajjanagadde
38f4e973ef
all: fix -Wextra-semi reported on clang
...
This fixes extra semicolons that clang 3.7 on GNU/Linux warns about.
These were trigggered when built under -Wpedantic, which essentially
checks for strict ISO compliance in numerous ways.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
2015-10-24 17:58:17 -04:00
Ronald S. Bultje
1c3be32533
vp9: add 10/12bpp mmxext-optimized iwht_iwht_4x4 function.
2015-10-13 11:05:57 -04:00
Ronald S. Bultje
344d519040
vp9: add subpel MC SIMD for 10/12bpp.
2015-09-16 21:11:34 -04:00
Ronald S. Bultje
77f359670f
vp9: add fullpel (avg) MC SIMD for 10/12bpp.
2015-09-16 21:11:34 -04:00
Ronald S. Bultje
6354ff0383
vp9: add fullpel (put) MC SIMD for 10/12bpp.
2015-09-16 21:11:34 -04:00
Ronald S. Bultje
fd8b90f5f6
vp9: fix overflow in 8x8 topleft 32x32 idct ssse3 version.
...
Also disable the mmx/iwht optimization when the bitexact flag is set.
With synthetically coded coefficients (i.e. these that lead to a
residual well outside the [-255,255] range), our optimizations will
overflow. It doesn't make sense to fix the overflows, since they can
only occur on synthetic input, not on real fwht-generated input. Thus,
add a bitexact flag that disables this optimization.
2015-09-10 07:51:16 -04:00
James Almer
c16e99e3b3
x86: check for AV_CPU_FLAG_AVXSLOW where useful
...
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2015-06-01 00:15:35 +02:00
Michael Niedermayer
cc77bb09e4
avcodec/x86/vp9dsp_init: Fix mix of declaration and statement
...
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2015-05-07 14:33:10 +02:00
Ronald S. Bultje
b224b165cb
vp9: add keyframe profile 2/3 support.
2015-05-06 15:10:41 -04:00
Ronald S. Bultje
afd8c464b7
vp9/x86: make filter_16_h work on 32-bit.
2014-12-27 16:55:16 -05:00
Ronald S. Bultje
b26bc3520f
vp9/x86: make filter_48/84/88_h work on 32-bit.
2014-12-27 16:55:15 -05:00
Ronald S. Bultje
8a1cff1c35
vp9/x86: make filter_44_h work on 32-bit.
2014-12-27 16:55:15 -05:00
Ronald S. Bultje
047088b8c6
vp9/x86: make filter_16_v work on 32-bit.
2014-12-27 16:55:14 -05:00
Ronald S. Bultje
0cc9c23ea1
vp9/x86: make filter_48/84_v work on 32-bit.
2014-12-27 16:55:14 -05:00
Ronald S. Bultje
6433a9133f
vp9/x86: make filter_88_v work on 32-bit.
2014-12-27 16:55:14 -05:00
Ronald S. Bultje
75f8e52089
vp9/x86: make filter_44_v work on 32-bit.
2014-12-27 16:55:13 -05:00
James Almer
32c836cb11
x86/vp9: remove duplicate function prototypes
...
Fixes "redundant redeclaration" warnings.
Signed-off-by: James Almer <jamrial@gmail.com>
2014-12-23 00:56:51 -03:00
Ronald S. Bultje
bdc1e3e3b2
vp9/x86: intra prediction sse2/32bit support.
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-12-19 14:07:19 +01:00
Ronald S. Bultje
cae893f692
vp9/x86: sse2 MC assembly.
...
Also a slight change to the ssse3 code, which prevents a theoretical
overflow in the sharp filter.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-12-15 02:34:05 +01:00
Ronald S. Bultje
fd77fbb390
vp9/x86: 32bit and sse2 support for vp9 inverse transform assembly
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-12-15 00:38:05 +01:00
James Almer
6b2caa321f
x86/vp9: add AVX and AVX2 MC
...
Roughly 25% faster MC than ssse3 for blocksizes 32 and 64.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2014-09-22 22:35:03 -03:00
James Almer
fc8db12a73
x86/vp9: inital AVX2 intra_pred
...
tos3k-vp9-b10000.webm on a Core i5-4200U @1.6GHz
1219 decicycles in ff_vp9_ipred_dc_32x32_ssse3, 131070 runs, 2 skips
439 decicycles in ff_vp9_ipred_dc_32x32_avx2, 131070 runs, 2 skips
3570 decicycles in ff_vp9_ipred_dc_top_32x32_ssse3, 4096 runs, 0 skips
2494 decicycles in ff_vp9_ipred_dc_top_32x32_avx2, 4096 runs, 0 skips
1419 decicycles in ff_vp9_ipred_dc_left_32x32_ssse3, 16384 runs, 0 skips
717 decicycles in ff_vp9_ipred_dc_left_32x32_avx2, 16384 runs, 0 skips
2737 decicycles in ff_vp9_ipred_tm_32x32_avx, 1024 runs, 0 skips
2088 decicycles in ff_vp9_ipred_tm_32x32_avx2, 1024 runs, 0 skips
3090 decicycles in ff_vp9_ipred_v_32x32_avx, 512 runs, 0 skips
2226 decicycles in ff_vp9_ipred_v_32x32_avx2, 512 runs, 0 skips
1565 decicycles in ff_vp9_ipred_h_32x32_avx, 1024 runs, 0 skips
922 decicycles in ff_vp9_ipred_h_32x32_avx2, 1024 runs, 0 skips
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-08 02:37:20 +02:00
Clément Bœsch
c4148a6668
x86/vp9mc: add vp9 namespace.
2014-03-29 18:13:15 +01:00
Ronald S. Bultje
fdb093c4e4
vp9/x86: intra prediction SIMD.
...
Partially based on h264_intrapred. (I hope to eventually merge these
two intrapred implementations back together.)
2014-02-17 13:39:00 +01:00
Clément Bœsch
91d85bb167
x86/vp9lpf: add ff_vp9_loop_filter_[vh]_44_16_{sse2,ssse3,avx}.
2014-02-05 07:21:06 +01:00
Clément Bœsch
c5dd73b890
x86/vp9lpf: add ff_vp9_loop_filter_h_{48,84}_16_{sse2,ssse3,avx}().
...
5.40s → 5.30s overall decode time with -threads 1 on ped1080p.webm
(i7 920, ssse3)
2014-01-30 19:34:13 +01:00
James Almer
644c32ea4b
x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_sse2()
...
Similar gains as the ssse3 version once again
Signed-off-by: James Almer <jamrial@gmail.com>
2014-01-28 09:30:55 +01:00
Clément Bœsch
222c46c531
x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_{ssse3,avx}.
...
9680 decicycles in loop_filter_v_88_16_c, 4193765 runs, 539 skips
9233 decicycles in loop_filter_h_88_16_c, 4193751 runs, 553 skips
1929 decicycles in ff_vp9_loop_filter_v_88_16_ssse3, 4194118 runs, 186 skips
2738 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193861 runs, 443 skips
5.978 → 5.417 overall decode time on ped1080p.webm (-threads 1)
Adding SSE2 support should be relatively trivial (just a matter of
changing the pshufb [mask_mix] with something else), patch welcome.
2014-01-28 07:36:38 +01:00
Ronald S. Bultje
97474d527f
vp9/x86: iwht4x4 (lossless) mmx.
2014-01-24 19:25:25 -05:00
Ronald S. Bultje
d43efa68bd
vp9/x86: 4x4 iadst SIMD (ssse3) variants.
...
Cycle measurements for intra itxfm_4x4_add on ped1080p.webm:
idct_idct: 66 -> 67 cycles (noise measurement)
idct_iadst: 199 -> 79 cycles
iadst_idct: 165 -> 70 cycles
iadst_iadst: 183 -> 82 cycles
2014-01-24 19:25:25 -05:00
Ronald S. Bultje
baf47020cd
vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants.
...
Cycle measurements for intra itxfm_8x8_add on ped1080p.webm:
idct_idct: 133 -> 135 cycles (noise measurement)
idct_iadst: 900 -> 241 cycles
iadst_idct: 864 -> 215 cycles
iadst_iadst: 973 -> 310 cycles
2014-01-24 19:25:25 -05:00
James Almer
26800e3864
vp9/x86: rename ff_avg[48]_sse to ff_avg[48]_mmxext
...
pavgb is an sse integer instruction, so the mmxext flag is enough
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-18 17:08:25 +01:00
James Almer
d2a7314f1e
vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_sse2().
...
Similar gains in performance as the SSSE3 version
Signed-off-by: James Almer <jamrial@gmail.com>
2014-01-17 14:16:38 +01:00
Ronald S. Bultje
8173d1ffc0
vp9/x86: 16x16 iadst_idct, idct_iadst and iadst_iadst (ssse3+avx).
...
Sample timings on ped1080p.webm (of the ssse3 functions):
iadst_idct: 4672 -> 1175 cycles
idct_iadst: 4736 -> 1263 cycles
iadst_iadst: 4924 -> 1438 cycles
Total decoding time changed from 6.565s to 6.413s.
2014-01-16 13:49:31 +01:00
Clément Bœsch
8b4190da93
vp9/x86: add AVX for itxfm and lpf.
...
4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips
3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips
23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips
4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips
967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
2014-01-15 15:54:03 +01:00
Clément Bœsch
af68bd1c06
vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_ssse3().
...
16662 decicycles in loop_filter_h_16_16_c, 8387355 runs, 1253 skips
17510 decicycles in loop_filter_v_16_16_c, 8387516 runs, 1092 skips
4941 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 8387887 runs, 721 skips
3899 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 8387980 runs, 628 skips
Overall decode time goes from:
./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 8.10s user 0.02s system 99% cpu 8.126 total
to:
./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 6.15s user 0.04s system 99% cpu 6.199 total
(46 to 61 fps)
2014-01-12 20:20:24 +01:00
Michael Niedermayer
390452bab6
Merge commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b'
...
* commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b':
x86: avcodec: Add a bunch of missing #includes for av_cold
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-09 20:24:15 +01:00
Diego Biurrun
b0be1ae792
x86: avcodec: Add a bunch of missing #includes for av_cold
2014-01-09 15:09:07 +01:00
Ronald S. Bultje
e84d14df10
vp9/x86: idct_32x32_add_ssse3.
...
Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).
2014-01-07 20:43:30 -05:00
Ronald S. Bultje
18175baa54
vp9/x86: 16px MC functions (64bit only).
...
Cycle counts for large MCs (old -> new on ped1080p.webm, mx!=0&&my!=0):
16x8: 876 -> 870 (0.7%)
16x16: 1444 -> 1435 (0.7%)
16x32: 2784 -> 2748 (1.3%)
32x16: 2455 -> 2349 (4.5%)
32x32: 4641 -> 4084 (13.6%)
32x64: 9200 -> 7834 (17.4%)
64x32: 8980 -> 7197 (24.8%)
64x64: 17330 -> 13796 (25.6%)
Total decoding time goes from 9.326sec to 9.182sec.
2013-12-26 21:05:10 -05:00
Ronald S. Bultje
8d4c616fc0
vp9/x86: idct_add_16x16_ssse3.
...
Currently only dc-only and full 16x16. Other subforms will follow in the
near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
from ~4050 to ~745 cycles.
2013-12-14 12:13:26 -05:00