1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-07 11:13:41 +02:00
Commit Graph

327 Commits

Author SHA1 Message Date
Evgeny Pavlov
cb1479faca avfilter/vf_ssim: Fix x86 assembly code for SSIM calculation
This commit fixes bug #10495

The code had several bugs related to post-loop compensation code:
- test assembly instruction performs bitwise AND operation and
generate flags used by jz branch instruction. Wrong test condition
leads to incorrect branching
- Incorrect compensation code for some branches

Signed-off-by: Evgeny Pavlov <lucenticus@gmail.com>
2023-08-21 17:04:51 +02:00
James Almer
aca8ceb870 x86/vf_bwdif_init: limit AVX2 functions using 256bit vectors to cpus known to be fast with it
Signed-off-by: James Almer <jamrial@gmail.com>
2023-03-25 13:27:20 -03:00
James Darnley
073ec3b9da avfilter/bwdif: add avx2 filter_line function
8-bit:
2.24x faster (1925±1.3 vs. 859±2.2 decicycles) compared with ssse3
10-bit:
2.00x faster (1703±1.7 vs. 853±2.0 decicycles) compared with ssse3
2023-03-25 02:38:17 +01:00
James Darnley
b503b5a0cf avfilter/bwdif: move filter_line init to a dedicated function 2023-03-25 02:38:17 +01:00
Lynne
bbe95f7353
x86: replace explicit REP_RETs with RETs
From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)

x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.

The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.

In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
2023-02-01 04:23:55 +01:00
Wang, Bin
459527108a libavfilter/x86/vf_convolution: fix sobel swap issue on WIN64
Reviewed by: James Almer <jamrial@gmail.com>
Signed-off-by: Wang, Bin <bin.wang@intel.com>
2022-11-21 12:28:25 +08:00
bwang30
3ab11dc5bb libavfilter/x86/vf_convolution: add sobel filter optimization and unit test with intel AVX512 VNNI
This commit enabled assembly code with intel AVX512 VNNI and added unit test for sobel filter

sobel_c: 4537
sobel_avx512icl 2136

Signed-off-by: bwang30 <bin.wang@intel.com>
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
2022-11-14 10:04:16 +08:00
Paul B Mahol
00b03331a0 avfilter/vf_threshold: fix handling of zero threshold 2022-10-27 10:23:24 +02:00
Andreas Rheinhardt
ed42a51930 avfilter/x86/vf_bwdif: Remove obsolete MMXEXT functions
The only system which benefit from these are truely ancient
32bit x86s as all other systems use at least the SSE2 versions
(this includes all x64 cpus (which is why this code is restricted
to x86-32)).

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-06-22 13:38:14 +02:00
Andreas Rheinhardt
7c3c1d938f avfilter/x86/vf_idet: Remove obsolete MMX(EXT) functions
The only system which benefit from these are truely ancient
32bit x86s as all other systems use at least the SSE2 versions
(this includes all x64 cpus (which is why this code is restricted
to x86-32)).

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-06-22 13:38:01 +02:00
Andreas Rheinhardt
4d7128be9a avfilter/x86/vf_yadif: Remove obsolete MMXEXT functions
The only system which benefit from these are truely ancient
32bit x86s as all other systems use at least the SSE2 versions
(this includes all x64 cpus (which is why this code is restricted
to x86-32)).

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-06-22 13:37:48 +02:00
Andreas Rheinhardt
77b2a422a0 avfilter/x86/vf_eq_init: Remove obsolete MMXEXT function
x64 always has MMX, MMXEXT, SSE and SSE2 and this means
that some functions for MMX, MMXEXT and 3dnow are always
overridden by other functions (unless one e.g. explicitly
disables SSE2) for x64. So given that the only systems that
benefit from process_mmxext are truely ancient 32bit x86s
it is removed.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-06-22 13:36:31 +02:00
Andreas Rheinhardt
c5dd2fdc09 avfilter/x86/vf_noise: Remove obsolete MMX function
x64 always has MMX, MMXEXT, SSE and SSE2 and this means
that some functions for MMX, MMXEXT and 3dnow are always
overridden by other functions (unless one e.g. explicitly
disables SSE2) for x64. So given that the only systems that
benefit from line_noise_mmx are truely ancient 32bit x86s
it is removed.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-06-22 13:32:08 +02:00
Andreas Rheinhardt
0df18f29ae avfilter/af_afir: Only keep DSP stuff in header
Only the AudioFIRDSPContext and the functions for its initialization
are needed outside of lavfi/af_afir.c.
Also rename the header to af_afirdsp.h to reflect the change.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-05-06 05:19:49 +02:00
Paul B Mahol
28d011516b avfilter/x86/vf_limiter: use movu, dst may not be always aligned
Happens with pad filter after limiter.
2022-03-24 09:44:09 +01:00
Marton Balint
5b3732227e avfilter/x86/vf_blend: use unaligned movs for output
Fixes crashes with:

ffmpeg -f lavfi -i allyuv=d=1 -vf tblend=difference128,pad=5000:ih:1 -f null x

Signed-off-by: Marton Balint <cus@passwd.hu>
2022-03-21 00:50:44 +01:00
Paul B Mahol
dae95b3ffd avfilter/vf_maskedmerge: fix rounding when masking 2022-03-03 09:57:53 +01:00
Paul B Mahol
047c362d3c avfilter/vf_nlmeans: add x86 SIMD 2021-11-11 21:54:46 +01:00
James Almer
39f3c98bb1 x86/vf_lut3d: use three operand form for some instructions
Fixes compilation with old yasm.

Signed-off-by: James Almer <jamrial@gmail.com>
2021-10-14 18:09:38 -03:00
Mark Reid
3ee7250116 avfilter/vf_lut3d: fix building with --disable-optimizations 2021-10-13 18:01:21 +02:00
Mark Reid
716b396740 avfilter/vf_lut3d: add x86-optimized tetrahedral interpolation
I spotted an interesting pattern that I didn't see before that leads to the implementation being faster.
The bit shifting table I was using before is no longer needed, and was able to remove quite a few lines. 
I also add use of FMA on the AVX2 version.

f32 1920x1080 1 thread with prelut
c impl
1434012700 UNITS in lut3d->interp,       1 runs,      0 skips
1434035335 UNITS in lut3d->interp,       2 runs,      0 skips
1423615347 UNITS in lut3d->interp,       4 runs,      0 skips
1426268863 UNITS in lut3d->interp,       8 runs,      0 skips

sse2
905484420 UNITS in lut3d->interp,       1 runs,      0 skips
905659010 UNITS in lut3d->interp,       2 runs,      0 skips
915167140 UNITS in lut3d->interp,       4 runs,      0 skips
915834222 UNITS in lut3d->interp,       8 runs,      0 skips

avx
574794860 UNITS in lut3d->interp,       1 runs,      0 skips
581035090 UNITS in lut3d->interp,       2 runs,      0 skips
584116720 UNITS in lut3d->interp,       4 runs,      0 skips
581460290 UNITS in lut3d->interp,       8 runs,      0 skips

avx2
301698880 UNITS in lut3d->interp,       1 runs,      0 skips
301982880 UNITS in lut3d->interp,       2 runs,      0 skips
306962430 UNITS in lut3d->interp,       4 runs,      0 skips
305472025 UNITS in lut3d->interp,       8 runs,      0 skips

gbrap16 1920x1080 1 thread with prelut
c impl
1480894840 UNITS in lut3d->interp,       1 runs,      0 skips
1502922990 UNITS in lut3d->interp,       2 runs,      0 skips
1496114307 UNITS in lut3d->interp,       4 runs,      0 skips
1492554551 UNITS in lut3d->interp,       8 runs,      0 skips

sse2
980777180 UNITS in lut3d->interp,       1 runs,      0 skips
986121520 UNITS in lut3d->interp,       2 runs,      0 skips
986489840 UNITS in lut3d->interp,       4 runs,      0 skips
998832248 UNITS in lut3d->interp,       8 runs,      0 skips

avx
622212360 UNITS in lut3d->interp,       1 runs,      0 skips
622981160 UNITS in lut3d->interp,       2 runs,      0 skips
645396315 UNITS in lut3d->interp,       4 runs,      0 skips
641057075 UNITS in lut3d->interp,       8 runs,      0 skips

avx2
321336400 UNITS in lut3d->interp,       1 runs,      0 skips
321268920 UNITS in lut3d->interp,       2 runs,      0 skips
323459895 UNITS in lut3d->interp,       4 runs,      0 skips
324949967 UNITS in lut3d->interp,       8 runs,      0 skips
2021-10-10 22:23:48 +02:00
Wu Jianhua
e26c4d252f avfilter/x86/vf_blend: unify indentation format
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
2021-10-03 09:15:55 +02:00
Wu Jianhua
7bbad32d5a libavfilter/x86/vf_gblur: correct the order of loop step
The problem was caused by if the width of the processed block
minus 1 is a multiple of the aligned number the instruction
jle .bscale_scalar would skip the Optimized Loop Step, which
will lead to an incorrect sampling when specifying steps more
than 1. Move the Optimized Loop Step after .bscale_scalar to
ensure the loop step is enabled.

Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
2021-09-18 12:38:01 +02:00
Wu Jianhua
fcf10c925d libavfilter/x86/vf_gblur: fixed the fate-test failed on MacOS
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
2021-09-18 12:37:56 +02:00
Wu Jianhua
4041c1029b libavfilter/x86/vf_gblur: add localbuf and ff_horiz_slice_avx2/512()
We introduced a ff_horiz_slice_avx2/512() implemented on a new algorithm.
In a nutshell, the new algorithm does three things, gathering data from
8/16 rows, blurring data, and scattering data back to the image buffer.
Here we used a customized transpose 8x8/16x16 to avoid the huge overhead
brought by gather and scatter instructions, which is dependent on the
temporary buffer called localbuf added newly.

Performance data:
ff_horiz_slice_avx2(old): 109.89
ff_horiz_slice_avx2(new): 666.67
ff_horiz_slice_avx512: 1000

Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
2021-08-29 19:58:33 +02:00
Wu Jianhua
68a2722aee libavfilter/x86/vf_gblur: add ff_verti_slice_avx2/512()
The new vertical slice with AVX2/512 acceleration can significantly
improve the performance of Gaussian Filter 2D.

Performance data:
ff_verti_slice_c: 32.57
ff_verti_slice_avx2: 476.19
ff_verti_slice_avx512: 833.33

Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
2021-08-29 19:58:33 +02:00
Wu Jianhua
4a5e24721c libavfilter/x86/vf_gblur: add ff_postscale_slice_avx512()
Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
2021-08-29 19:58:33 +02:00
Paul B Mahol
0068b3d0f0 avfilter/avf_showcqt: switch to TX FFT from avutil 2021-07-27 21:16:28 +02:00
Andreas Rheinhardt
4608f7cc6a Remove unnecessary mem.h inclusions
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-07-22 14:47:57 +02:00
James Almer
1628409b18 x86/vf_gblur: fix reg name in UNIX64 prologue
Signed-off-by: James Almer <jamrial@gmail.com>
2021-02-17 15:51:28 -03:00
James Almer
2b4da1cb8c x86/vf_gblur: fix postscale_slice prologue
x86_32 ABI does not pass float arguments directly on xmm regs, and the Win64
ABI uses only the first four regs for this purpose.

Signed-off-by: James Almer <jamrial@gmail.com>
2021-02-17 13:33:20 -03:00
Paul B Mahol
44cf3a2b16 avfilter/x86/vf_gblur: add postscale SIMD 2021-02-16 21:12:11 +01:00
Paul B Mahol
c6ce18be08 avfilter/vf_convolution: add 16-column operation for filter_column()
Based on patch by Xu Jun <xujunzz@sjtu.edu.cn>
2021-02-13 14:45:48 +01:00
Paul B Mahol
95183d25e8 avfilter/vf_atadenoise: add sigma options 2021-01-22 16:21:22 +01:00
Paul B Mahol
eaba6cecfb avfilter/vf_v360: add mitchell interpolation 2020-10-04 19:23:52 +02:00
Paul B Mahol
fda5363c80 avfilter/x86/vf_convolution_init: there is asm only for 8bit depth 2020-09-15 08:13:04 +02:00
Limin Wang
71ec3e4583 Revert "avfilter/yadif: simplify the code for better readability"
This reverts commit 2a9b934675.
2020-08-27 07:30:30 +08:00
Limin Wang
2a9b934675 avfilter/yadif: simplify the code for better readability
Signed-off-by: Limin Wang <lance.lmwang@gmail.com>
2020-08-26 14:21:11 +08:00
James Almer
320694ff84 x86/vf_blend: fix warnings about trailing empty parameters
Finishes fixing ticket #8771

Signed-off-by: James Almer <jamrial@gmail.com>
2020-07-12 11:30:23 -03:00
Paul B Mahol
8e1354c95d avfilter/x86/vf_v360_init: add missing cases 2020-04-02 12:25:37 +02:00
Paul B Mahol
e4809e12ea avfilter/vf_v360: add SIMD for lagrange9 interpolation 2020-04-02 12:25:37 +02:00
Martin Storsjö
0815a22dcc vf_ssim: Fix loading doubles to float registers on i386
This fixes the tests filter-refcmp-ssim-yuv and filter-refcmp-ssim-rgb
on i386 after breaking in fcc0424c93.

Signed-off-by: Martin Storsjö <martin@martin.st>
2020-02-05 14:38:26 +02:00
Paul B Mahol
fcc0424c93 avfilter/vf_ssim: improve precision
Use doubles for accumulating floats.
2020-02-04 18:28:04 +01:00
Paul B Mahol
3bf28d40e5 avfilter/vf_v360: change remaps to int16_t type 2020-01-19 19:54:29 +01:00
Marton Balint
1f8e43938b avfilter/x86/vf_interlace: always use unaligned movs
Fixes crashes in command lines such as:

ffmpeg -f lavfi -i testsrc2=704x576:r=50,interlace,pad=720:576:8 -f null none

Related to ticket #6491.

Signed-off-by: Marton Balint <cus@passwd.hu>
2019-12-15 00:23:03 +01:00
Paul B Mahol
ac0f5f4c17 avfilter/vf_maskedclamp: add x86 SIMD 2019-10-23 16:20:21 +02:00
James Almer
738bc3e742 x86/vf_transpose: make ff_transpose_8x8_16_sse2 work on x86_32
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2019-10-22 13:51:13 -03:00
James Almer
27bae5aaca x86/vf_transpose: fix cpuflags check
Signed-off-by: James Almer <jamrial@gmail.com>
2019-10-21 17:01:39 -03:00
Paul B Mahol
ccd9bca15a avfilter/vf_transpose: add x86 SIMD 2019-10-21 20:37:51 +02:00
Paul B Mahol
f7f4691f9f avfilter/x86/vf_atadenoise: fix comment 2019-10-21 17:56:45 +02:00