Paul B Mahol
6d09d6edbc
avcodec/magicyuv: add 10 bit support
...
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2016-12-20 13:32:15 +01:00
James Darnley
acdd2d805d
avcodec/h264: resolve assert being triggered when stack is not aligned
...
32-bit msvc.
2016-12-07 22:32:19 +01:00
James Darnley
728651df06
avcodec/h264: mmx2, sse2, avx 10-bit 4:2:2 h chroma deblock/loop filter
...
Yorkfield:
- mmx2: 2.53x (504 vs. 199 cycles)
- sse2: 3.83x (504 vs. 131 cycles)
Nehalem:
- mmx2: 2.42x (365 vs. 151 cycles)
- sse2: 3.56x (365 vs. 103 cycles)
Skylake:
- mmx2: 1.81x (308 vs. 170 cycles)
- sse2: 2.84x (308 vs. 108 cycles)
- avx: 2.93x (308 vs. 105 cycles)
2016-12-07 00:29:13 +01:00
James Darnley
add21d0bb3
avcodec/h264: mmx2, sse2, avx 10-bit h chroma deblock/loop filter
...
Yorkfield:
- mmx2: 2.45x (279 vs. 114 cycles)
- sse2: 3.36x (279 vs. 83 cycles)
Nehalem:
- mmx2: 2.10x (192 vs. 92 cycles)
- sse2: 2.84x (192 vs. 68 cycles)
Skylake:
- mmx2: 1.75x (170 vs. 97 cycles)
- sse2: 2.47x (170 vs. 69 cycles)
- avx: 2.47x (170 vs. 69 cycles)
2016-12-07 00:29:13 +01:00
James Darnley
58ca2ef62e
whitespace changes after last commit
2016-12-07 00:29:13 +01:00
James Darnley
f33714a694
avcodec/h264: clean up and expand x86 function definitions
2016-12-07 00:29:13 +01:00
James Darnley
13d71c28cc
avcodec/h264: sse2 and avx 4:2:2 idct add8 10-bit functions
...
Yorkfield:
- sse2:
- complex: 4.13x faster (1514 vs. 367 cycles)
- simple: 4.38x faster (1836 vs. 419 cycles)
Skylake:
- sse2:
- complex: 3.61x faster ( 936 vs. 260 cycles)
- simple: 3.97x faster (1126 vs. 284 cycles)
- avx (versus sse2):
- complex: 1.07x faster (260 vs. 244 cycles)
- simple: 1.03x faster (284 vs. 274 cycles)
2016-11-30 22:58:28 +01:00
James Darnley
1dae7ffa0b
avcodec/h264: mmx 4:2:2 idct add8 function
...
2.87 times faster (1830 vs. 638 cycles)
2016-11-30 22:58:27 +01:00
James Darnley
815ea8c6cc
avcodec/h264: mmxext 4:2:2 chroma intra deblock/loop filter
...
2.1 times faster (401 vs. 194 cycles)
2016-11-30 22:58:27 +01:00
James Almer
2de1c79b61
x86/vp9itxfm: add missing AVX2 guards
...
Fixes compilation with Yasm 1.1.0 and older.
Signed-off-by: James Almer <jamrial@gmail.com>
2016-11-18 17:01:11 -03:00
Ronald S. Bultje
83a139e3d8
vp9: add avx2 iadst16 implementations.
...
Also a small cosmetic change to the avx2 idct16 version to make it
explicit that one of the arguments to the write-out macros is unused
for >=avx2 (it uses pmovzxbw instead of punpcklbw).
2016-11-15 11:01:36 -05:00
Hendrik Leppkes
db854c6c4a
Merge commit '4a081f224e12f4227ae966bcbdd5384f22121ecf'
...
* commit '4a081f224e12f4227ae966bcbdd5384f22121ecf':
libavcodec: fix constness in clobber test avcodec_open2() wrappers
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
2016-11-13 17:30:33 +01:00
Andreas Cadhalpun
c8a6eb58d7
doc: fix spelling errors
...
Thanks to Mathieu Malaterre <malat@debian.org> for reporting the
Que/Queue typo. (https://bugs.debian.org/839542 )
Reviewed-by: Lou Logan <lou@lrcd.com>
Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
2016-10-21 23:58:47 +02:00
Rostislav Pehlivanov
d2ae5f77c6
aacenc: add SIMD optimizations for abs_pow34 and quantization
...
Performance improvements:
quant_bands:
with: 681 decicycles in quant_bands, 8388453 runs, 155 skips
without: 1190 decicycles in quant_bands, 8388386 runs, 222 skips
Around 42% for the function
Twoloop coder:
abs_pow34:
with/without: 7.82s/8.17s
Around 4% for the entire encoder
Both:
with/without: 7.15s/8.17s
Around 12% for the entire encoder
Fast coder:
abs_pow34:
with/without: 3.40s/3.77s
Around 10% for the entire encoder
Both:
with/without: 3.02s/3.77s
Around 20% faster for the entire encoder
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
Tested-by: Michael Niedermayer <michael@niedermayer.cc>
Reviewed-by: James Almer <jamrial@gmail.com>
2016-10-18 21:41:18 +01:00
James Almer
42111e8543
avcodec: fix arguments on xmm/neon clobber test wrappers
...
Signed-off-by: James Almer <jamrial@gmail.com>
2016-10-02 02:15:47 -03:00
James Almer
449f263f9f
avcodec: add missing xmm/neon clobber test wrappers for the new encode API
...
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-10-01 14:08:50 -03:00
Hendrik Leppkes
5ae0ad001a
x86/h264_weight: use appropriate register size for weight parameters
...
Fixes trac 5579
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Acked-by: Michael Niedermayer <michael@niedermayer.cc>
2016-09-23 16:40:57 +02:00
Michael Niedermayer
bc26fe8927
avcodec/h264: Use ptrdiff_t for (bi)weight functions
...
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2016-09-23 04:10:44 +02:00
James Almer
d950279cbf
avcodec/ttadsp: cosmetics
...
Clean some header includes and use the same naming scheme as
in ttaencdsp
Signed-off-by: James Almer <jamrial@gmail.com>
2016-08-06 18:27:01 -03:00
James Almer
efc9d5c4bc
x86/ttaenc: add ff_ttaenc_filter_process_{ssse3,sse4}
...
Signed-off-by: James Almer <jamrial@gmail.com>
2016-08-02 15:48:04 -03:00
Clément Bœsch
15b26e88cb
Merge commit '9df889a5f116c1ee78c2f239e0ba599c492431aa'
...
* commit '9df889a5f116c1ee78c2f239e0ba599c492431aa':
h264: rename h264.[ch] to h264dec.[ch]
Merged-by: Clément Bœsch <u@pkh.me>
2016-07-29 11:01:36 +02:00
Ronald S. Bultje
a4edaa0270
vp9: add mxext versions of the single-block (w=8,npx=8) h/v loopfilters.
...
Each takes about 0.1% of runtime in my profiles, and they didn't have
any SIMD yet so far (we only had simd for npx=16 double-block versions).
2016-07-26 15:59:07 -04:00
Ronald S. Bultje
7ca422bb1b
vp9: add mxext versions of the single-block (w=4,npx=8) h/v loopfilters.
...
Each takes about 0.5% of runtime in my profiles, and they didn't have
any SIMD yet so far (we only had simd for npx=16 double-block versions).
2016-07-26 15:59:07 -04:00
Ronald S. Bultje
726501a34e
vp9: add 32x32 idct AVX2 implementation.
...
About 1.8x speedup compared to AVX version for full IDCT. Other
sub-IDCT scenarios also see speedups. Full --bench output for
idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles):
nop: 16.5
vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4
vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0
vp9_inv_dct_dct_32x32_add_8_1_ssse3: 137.4
vp9_inv_dct_dct_32x32_add_8_1_avx: 137.1
vp9_inv_dct_dct_32x32_add_8_1_avx2: 73.2
vp9_inv_dct_dct_32x32_add_8_2_c: 14680.8
vp9_inv_dct_dct_32x32_add_8_2_sse2: 2617.2
vp9_inv_dct_dct_32x32_add_8_2_ssse3: 982.9
vp9_inv_dct_dct_32x32_add_8_2_avx: 958.5
vp9_inv_dct_dct_32x32_add_8_2_avx2: 704.2
vp9_inv_dct_dct_32x32_add_8_4_c: 14443.1
vp9_inv_dct_dct_32x32_add_8_4_sse2: 2717.1
vp9_inv_dct_dct_32x32_add_8_4_ssse3: 965.7
vp9_inv_dct_dct_32x32_add_8_4_avx: 1000.7
vp9_inv_dct_dct_32x32_add_8_4_avx2: 717.1
vp9_inv_dct_dct_32x32_add_8_8_c: 14436.4
vp9_inv_dct_dct_32x32_add_8_8_sse2: 2671.8
vp9_inv_dct_dct_32x32_add_8_8_ssse3: 1038.5
vp9_inv_dct_dct_32x32_add_8_8_avx: 983.0
vp9_inv_dct_dct_32x32_add_8_8_avx2: 729.4
vp9_inv_dct_dct_32x32_add_8_16_c: 14614.7
vp9_inv_dct_dct_32x32_add_8_16_sse2: 2701.7
vp9_inv_dct_dct_32x32_add_8_16_ssse3: 1334.4
vp9_inv_dct_dct_32x32_add_8_16_avx: 1276.7
vp9_inv_dct_dct_32x32_add_8_16_avx2: 719.5
vp9_inv_dct_dct_32x32_add_8_32_c: 14363.6
vp9_inv_dct_dct_32x32_add_8_32_sse2: 2575.6
vp9_inv_dct_dct_32x32_add_8_32_ssse3: 2633.9
vp9_inv_dct_dct_32x32_add_8_32_avx: 2539.6
vp9_inv_dct_dct_32x32_add_8_32_avx2: 1395.0
2016-07-26 15:59:07 -04:00
James Almer
7a15cf42ee
x86/diracdsp: make ff_put_signed_rect_clamped_10_sse4 work on x86_32
...
Reviewed-by: Rostislav Pehlivanov <atomnuker@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-07-20 13:43:38 -03:00
Rostislav Pehlivanov
df1dc52195
diracdsp_init: add missing ARCH_X86_64 check
...
That SIMD is still x86_64 only for now.
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
2016-07-12 00:39:12 +01:00
Rostislav Pehlivanov
bd61f3c6bf
diracdsp: add SIMD for the 10 bit version of put_signed_rect_clamped
...
Signed-off-by: Rostislav Pehlivanov <rpehlivanov@obe.tv>
2016-07-11 23:33:24 +01:00
Rostislav Pehlivanov
80721cc1ff
diracdsp: add dequantization SIMD
...
Currently unused, to be used in the following commits.
Signed-off-by: Rostislav Pehlivanov <rpehlivanov@obe.tv>
2016-07-11 23:30:11 +01:00
Ronald S. Bultje
f0a2b6249b
vp9: add 16x16 idct avx2 (8-bit).
...
checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
that it's about 1.65x as fast as the AVX version for the full IDCT, and
similar speedups for the sub-IDCTs:
nop: 24.6
vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8
vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6
vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4
vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2
vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5
vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7
vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9
vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2
vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9
vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3
vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7
vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4
vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1
vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1
vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0
vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4
vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6
vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7
vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9
vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2
vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6
vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5
vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0
vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9
vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
2016-07-11 10:14:58 -04:00
Clément Bœsch
84ecbbfb27
Merge commit 'f1a9eee41c4b5ea35db9ff0088ce4e6f1e187f2c'
...
* commit 'f1a9eee41c4b5ea35db9ff0088ce4e6f1e187f2c':
x86: Add missing movsxd for the int stride parameter
Merged-by: Clément Bœsch <u@pkh.me>
2016-07-09 14:52:23 +02:00
James Almer
645489cf90
x86/dcadsp: optimize lfe_fir0_float_fma3 on x86_32
...
About 10% faster.
Signed-off-by: James Almer <jamrial@gmail.com>
2016-07-05 17:48:20 -03:00
James Almer
293484fa5e
avcodec: add missing xmm/neon clobber test wrappers for the new decode API
...
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-07-03 18:04:30 -03:00
Matthieu Bouron
9eb3da2f99
asm: FF_-prefix internal macros used in inline assembly
...
See merge commit '39d6d3618d48625decaff7d9bdbb45b44ef2a805'.
2016-06-27 17:21:18 +02:00
Clément Bœsch
4a081f224e
libavcodec: fix constness in clobber test avcodec_open2() wrappers
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2016-06-26 21:34:04 +03:00
Hendrik Leppkes
c142dc203e
Merge commit 'dc40a70c5755bccfb1a1349639943e1f408bea50'
...
* commit 'dc40a70c5755bccfb1a1349639943e1f408bea50':
Drop unnecessary libavutil/x86/asm.h #includes
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
2016-06-26 15:53:00 +02:00
Clément Bœsch
5d48e4eafa
Merge commit 'a6a750c7ef240b72ce01e9653343a0ddf247d196'
...
* commit 'a6a750c7ef240b72ce01e9653343a0ddf247d196':
tests: Move all test programs to a subdirectory
Merged-by: Clément Bœsch <clement@stupeflix.com>
2016-06-22 13:44:34 +02:00
Clément Bœsch
8ef57a0d61
Merge commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb'
...
* commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb':
cosmetics: Fix spelling mistakes
Merged-by: Clément Bœsch <u@pkh.me>
2016-06-21 21:55:34 +02:00
Anton Khirnov
9df889a5f1
h264: rename h264.[ch] to h264dec.[ch]
...
This is more consistent with the naming of other decoders.
2016-06-21 11:11:26 +02:00
Martin Storsjö
f1a9eee41c
x86: Add missing movsxd for the int stride parameter
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2016-06-17 00:11:21 +03:00
James Almer
ede4ec1f8f
x86/aacpsdsp: optimize add_squares loop
...
Signed-off-by: James Almer <jamrial@gmail.com>
2016-06-14 12:41:23 -03:00
James Almer
82dbfccaf0
x86/aacdec: use HADDPS macro
...
Signed-off-by: James Almer <jamrial@gmail.com>
2016-06-08 14:18:18 -03:00
Diego Biurrun
dc40a70c57
Drop unnecessary libavutil/x86/asm.h #includes
2016-05-28 19:18:26 +02:00
Diego Biurrun
1e9c5bf4c1
asm: FF_-prefix internal macros used in inline assembly
...
These warnings conflict with system macros on Solaris, producing
truckloads of warnings about macro redefinition.
2016-05-28 19:18:26 +02:00
Diego Biurrun
a6a750c7ef
tests: Move all test programs to a subdirectory
2016-05-13 14:55:56 +02:00
Christophe Gisquet
9630b3fc06
x86: lossless audio: SSE4 madd 32bits
...
The unique user so far is wmalossless 24bits. The few samples tested show an
order of 8, so more unrolling or an avx2 version do not make sense.
Timings: 68 -> 49 cycles
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2016-05-07 23:28:48 +02:00
Vittorio Giovara
41ed7ab45f
cosmetics: Fix spelling mistakes
...
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2016-05-04 18:16:21 +02:00
Derek Buitenhuis
d496d52d02
Merge commit '73ff983e8dd22ccee166403d0bbbc9c1cd543622'
...
* commit '73ff983e8dd22ccee166403d0bbbc9c1cd543622':
fft: x86: cosmetics: Drop silly comments, add comment, whitespace
Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
2016-04-12 15:42:21 +01:00
Diego Biurrun
01621202aa
build: miscellaneous cosmetics
...
Restore alphabetical order in lists, break overly long lines, do some
prettyprinting, add some explanatory section comments, group parts
together that belong together logically.
2016-04-07 15:26:08 +02:00
Michael Niedermayer
305344d89e
avcodec/fft: Add revtab32 for FFTs with more than 65536 samples
...
x86 optimizations are used only for the cases they support (<=65536 samples)
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2016-03-04 16:05:47 +01:00
Michael Niedermayer
ae76b84221
avcodec: Extend fft to size 2^17
...
Asked-for-by: durandal_1707
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2016-03-04 13:51:42 +01:00