1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-28 20:53:54 +02:00
Commit Graph

590 Commits

Author SHA1 Message Date
Lynne
892f64ad9b
x86/tx_float: remove HAVE_AVX2_EXTERNAL checks
It'll always be enabled.
Thanks, nasm.
2024-10-06 01:32:49 +02:00
Lynne
b17a240c8d
Revert "x86/tx_float: set all operands for shufps"
This reverts commit 74f5fb6db8.
2024-10-06 01:32:49 +02:00
Lynne
24c5a58e55
Revert "x86/tx_float: add missing check for AVX2"
This reverts commit f4097e4c1f.
2024-10-06 01:32:48 +02:00
Lynne
bf643f989b
Revert "x86/tx_float: add missing preprocessor wrapper for AVX2 functions"
This reverts commit 750f378bec.
2024-10-06 01:32:48 +02:00
Lynne
b890482d05
Revert "x86/tx_float: change a condition in a preprocessor check"
This reverts commit 0d8f43c74d.
2024-10-06 01:32:47 +02:00
James Almer
9e7a93c6fd x86/intreadwrite: add SSE2 optimized AV_COPY128U
Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-29 23:17:52 -03:00
James Almer
70c6b904be x86/intreadwrite: add missing casts to pointer arguments
Should make strict compilers happy.

Also, make AV_COPY128 use integer operations while at it. Removing the
inclusion of immintrin.h ensures a lot less intrinsic related headers are
included as well, which fixes a clash of defines with some Clang versions.

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-11 18:24:26 -03:00
James Almer
1a86a7a48d x86/intreadwrite: fix include of config.h
Should fix make checkheaders.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:52:52 -03:00
James Almer
15056dd650 x86/intreadwrite.h: add missing preprocessor checks
Removed by accident in the previous commits. This makes the code only run when
compiled with GCC and Clang like before. Support for other compilers like msvc
can be added later.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:49:21 -03:00
James Almer
bd1bcb07e0 x86/intreadwrite: use intrinsics instead of inline asm for AV_COPY128
This has the benefit of removing any SSE -> AVX penalty that may happen when
the compiler emits VEX encoded instructions.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:25:44 -03:00
James Almer
4a04cca69a x86/intreadwrite: use intrinsics instead of inline asm for AV_ZERO128
When called inside a loop, the inline asm version results in one pxor
unnecessarely emitted per iteration, as the contents of the __asm__() block are
opaque to the compiler's instruction scheduler.
This is not the case with intrinsics, where pxor will be emitted once with any
half decent compiler.

This also has the benefit of removing any SSE -> AVX penalty that may happen
when the compiler emits VEX encoded instructions.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:25:44 -03:00
James Almer
4b57ea8fc7 avutil/common: assert that bit position in av_zero_extend is valid
Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-13 20:36:09 -03:00
James Almer
39c90d6466 avutil: rename av_mod_uintp2 to av_zero_extend
It's more descriptive of what it does.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-13 20:35:57 -03:00
Rémi Denis-Courmont
0231097d1b lavu/x86: remove GCC 4.4- stuff
Since the C11 support is required, those GCC versions can no longer be
supported anyhow. (Clang pretends to be GCC 4.4, but it looks like the
code was intended for old GCC specifically.)
2024-06-13 21:16:16 +03:00
James Almer
a14440867c x86/float_dsp: add SSE2 and AVX versions of scalarproduct_double
Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-03 22:14:55 -03:00
Andreas Rheinhardt
790f793844 avutil/common: Don't auto-include mem.h
There are lots of files that don't need it: The number of object
files that actually need it went down from 2011 to 884 here.

Keep it for external users in order to not cause breakages.

Also improve the other headers a bit while just at it.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-03-31 00:08:43 +01:00
Henrik Gramner
afa471d0ef x86: Update x86inc.asm
Make things up-to-date with upstream.

https://code.videolan.org/videolan/x86inc.asm
2024-03-24 14:53:57 +01:00
Henrik Gramner
c3d3f0e697 avutil/x86util: Fix broken pre-SSE4.1 PMINSD emulation
Fixes yadif-16 which allows FATE to pass.

Broken since 2904db9045 (2017).
2024-03-17 13:52:27 +01:00
Andreas Rheinhardt
c00cd007e8 configure: Remove av_restrict
All versions of MSVC that support C11 (namely >= v19.27)
also support the restrict keyword, therefore av_restrict
is no longer necessary since 75697836b1.

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-03-15 12:51:15 +01:00
Martin Storsjö
7ec2354c38 x86: Remove inline MMX assembly that clobbers the FPU state
These inline implementations of AV_COPY64, AV_SWAP64 and AV_ZERO64
are known to clobber the FPU state - which has to be restored
with the 'emms' instruction afterwards.

This was known and signaled with the FF_COPY_SWAP_ZERO_USES_MMX
define, which calling code seems to have been supposed to check,
in order to call emms_c() after using them. See
0b1972d409,
29c4c0886d and
df215e5758 for history on earlier
fixes in the same area.

However, new code can use these AV_*64() macros without knowing
about the need to call emms_c().

Just get rid of these dangerous inline assembly snippets; this
doesn't make any difference for 64 bit architectures anyway.

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-02-09 23:55:52 +02:00
Lynne
9af87828bd
x86/tx_init: propely indicate the extended available transform sizes
Forgot to do this with the previous commit.

Actually makes the assembly being used.

Still the fastest FFT in the world, 15% faster than FFTW on the
largest available size.
2024-02-09 18:08:42 +01:00
Lynne
bd3e71b21e
x86/tx_float: enable SIMD for sizes over 131072
The tables for the new sizes were added last year due
to being required for SDR.
However, the assembly was never updated to use them.
2024-02-07 15:20:48 +01:00
Henrik Gramner
ed8ddf0bd3 x86inc: Add REPX macro to repeat instructions/operations
When operating on large blocks of data it's common to repeatedly use
an instruction on multiple registers. Using the REPX macro makes it
easy to quickly write dense code to achieve this without having to
explicitly duplicate the same instruction over and over.

For example,

    REPX {paddw x, m4}, m0, m1, m2, m3
    REPX {mova [r0+16*x], m5}, 0, 1, 2, 3

will expand to

    paddw       m0, m4
    paddw       m1, m4
    paddw       m2, m4
    paddw       m3, m4
    mova [r0+16*0], m5
    mova [r0+16*1], m5
    mova [r0+16*2], m5
    mova [r0+16*3], m5

Commit taken from x264:
6d10612ab0

Signed-off-by: Frank Plowman <post@frankplowman.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2023-11-08 13:49:08 +01:00
Andreas Rheinhardt
5b85ca5317 avutil/x86/pixelutils: Empty MMX state in ff_pixelutils_sad_8x8_mmxext
We currently mostly do not empty the MMX state in our MMX
DSP functions; instead we only do so before code that might
be using x87 code. This is a violation of the System V i386 ABI
(and maybe of other ABIs, too):
"The CPU shall be in x87 mode upon entry to a function. Therefore,
every function that uses the MMX registers is required to issue an
emms or femms instruction after using MMX registers, before returning
or calling another function." (See 2.2.1 in [1])
This patch does not intend to change all these functions to abide
by the ABI; it only does so for ff_pixelutils_sad_8x8_mmxext, as this
function can by called by external users, because it is exported
via the pixelutils API. Without this, the following fragment will
assert (on x86/x64):
    uint8_t src1[8 * 8], src2[8 * 8];
    av_pixelutils_sad_fn fn = av_pixelutils_get_sad_fn(3, 3, 0, NULL);
    fn(src1, 8, src2, 8);
    av_assert0_fpu();

[1]: https://raw.githubusercontent.com/wiki/hjl-tools/x86-psABI/intel386-psABI-1.1.pdf

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2023-11-04 01:26:03 +01:00
Andreas Rheinhardt
f8503b4c33 avutil/internal: Don't auto-include emms.h
Instead include emms.h wherever it is needed.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2023-09-04 11:04:45 +02:00
Lynne
bbe95f7353
x86: replace explicit REP_RETs with RETs
From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)

x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.

The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.

In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
2023-02-01 04:23:55 +01:00
Lynne
90c17a05aa
x86/tx_float: fix stray change in 15xM FFT and replace imul->lea
Thanks to rorgoroth for bisecting and kurosu for the lea suggestion.
2022-11-28 16:58:12 +01:00
Lynne
87bae6b018
lavu/tx: refactor to explicitly track and convert lookup table order
Necessary for generalizing PFAs.
2022-11-24 15:58:34 +01:00
Lynne
fab97faf02
x86/tx_float: implement striding in fft_15xM 2022-11-24 15:58:32 +01:00
Lynne
92100eee5b
x86/tx_float_init: properly specify the supported factors of 15xM FFTs
Only powers of two are currently supported.
2022-11-24 15:58:32 +01:00
Lynne
cc1df4045e
x86/tx_float: add a standalone 15-point AVX2 transform
Enables its use everywhere else in the framework.
2022-11-24 15:58:31 +01:00
Lynne
877e575b5d
x86/tx_float: optimize and macro out FFT15 2022-11-24 15:58:31 +01:00
Johannes Kauffmann
a11e745b97 lavu/fixed_dsp: add missing av_restrict qualifiers
The butterflies_fixed function pointer declaration specifies av_restrict
for the first two pointer arguments. So the corresponding function
definitions should honor this declaration.

MSVC emits warning C4113 for this.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2022-10-04 10:56:12 +02:00
Lynne
f21899db7d
x86/tx_float: enable AVX-only split-radix FFT codelets
Sandy Bridge, Ivy Bridge and Bulldozer cores don't support FMA3.
2022-09-24 04:16:55 +02:00
James Almer
d2f482965f x86/tx_float: fix some symbol names
Should fix compilation on MacOS

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-23 18:53:05 -03:00
James Almer
0d8f43c74d x86/tx_float: change a condition in a preprocessor check
Fixes compilation with yasm.

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-23 16:05:07 -03:00
James Almer
750f378bec x86/tx_float: add missing preprocessor wrapper for AVX2 functions
Fixes compilation with old assemblers.

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-23 15:15:20 -03:00
Lynne
74e8541bab
x86/tx_float: generalize iMDCT
To support non-aligned buffers during the post-transform step, just iterate
backwards over the array.

This allows using the 15xN-point FFT, with which the speed is 2.1 times
faster than our old libavcodec implementation.
2022-09-23 12:35:28 +02:00
Lynne
ace42cf581
x86/tx_float: add 15xN PFA FFT AVX SIMD
~4x faster than the C version.
The shuffles in the 15pt dim1 are seriously expensive. Not happy with it,
but I'm contempt.

Can be easily converted to pure AVX by removing all vpermpd/vpermps
instructions.
2022-09-23 12:35:27 +02:00
Lynne
3241e9225c
x86/tx_float: adjust internal ASM call ABI again
There are many ways to go about it, and this one seems optimal for both
MDCTs and PFA FFTs without requiring excessive instructions or stack usage.
2022-09-23 12:33:35 +02:00
Lynne
4ba68639ca
x86/tx_float: add asm call versions of the 2pt and 4pt transforms
Verified to be working.
2022-09-19 06:01:06 +02:00
Lynne
892548e6a1
x86/tx_float: fully support 128bit regs in LOAD64_LUT
The gather path didn't support 128bit registers.
It's not faster on Zen 3, but it's here for completeness.
2022-09-19 06:01:04 +02:00
Lynne
af42bb3d61
x86/tx_float: simplify and describe the intra-asm call convention 2022-09-19 06:01:02 +02:00
James Almer
bda3a9faf4 x86/float_dsp: use three operand form for some instructions
Fixes compilation with old yasm

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-13 13:50:09 -03:00
Paul B Mahol
72acff9f59 avutil/x86/float_dsp: add fma3 for scalarproduct 2022-09-13 17:43:15 +02:00
Andreas Rheinhardt
29c4c0886d avutil/x86/intreadwrite: Add ability to detect whether MMX code is used
It can be used to call emms_c() only when needed.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-09-11 21:08:04 +02:00
James Almer
f4097e4c1f x86/tx_float: add missing check for AVX2
Fixes compilation with old yasm.

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-06 14:06:33 -03:00
James Almer
74f5fb6db8 x86/tx_float: set all operands for shufps
Fixes compilation with AVX2 enabled yasm.

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-06 14:06:03 -03:00
Martin Storsjö
e4759fa951 x86/tx_float: Fix building for platforms with a symbol prefix
This fixes building for x86 macOS (both i386 and x86_64) and
i386 windows.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-06 18:46:39 +03:00
Lynne
4537d9554d
x86/tx_float: implement inverse MDCT AVX2 assembly
This commit implements an iMDCT in pure assembly.

This is capable of processing any mod-8 transforms, rather than just
power of two, but since power of two is all we have assembly for
currently, that's what's supported.
It would really benefit if we could somehow use the C code to decide
which function to jump into, but exposing function labels from assebly
into C is anything but easy.
The post-transform loop could probably be improved.

This was somewhat annoying to write, as we must support arbitrary
strides during runtime. There's a fast branch for stride == 4 bytes
and a slower one which uses vgatherdps.

Zen 3 benchmarks for stride == 4 for old (av_imdct_half) vs new (av_tx):

128pt:
   2811 decicycles in         av_tx (imdct),16775916 runs,   1300 skips
   3082 decicycles in         av_imdct_half,16776751 runs,    465 skips

256pt:
   4920 decicycles in         av_tx (imdct),16775820 runs,   1396 skips
   5378 decicycles in         av_imdct_half,16776411 runs,    805 skips

512pt:
   9668 decicycles in         av_tx (imdct),16775774 runs,   1442 skips
  10626 decicycles in         av_imdct_half,16775647 runs,   1569 skips

1024pt:
  19812 decicycles in         av_tx (imdct),16777144 runs,     72 skips
  23036 decicycles in         av_imdct_half,16777167 runs,     49 skips
2022-09-06 04:21:46 +02:00