This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx.
The design of this pure assembly FFT is pretty unconventional.
On the lowest level, instead of splitting the complex numbers into
real and imaginary parts, we keep complex numbers together but split
them in terms of parity. This saves a number of shuffles in each transform,
but more importantly, it splits each transform into two independent
paths, which we process using separate registers in parallel.
This allows us to keep all units saturated and lets us use all available
registers to avoid dependencies.
Moreover, it allows us to double the granularity of our per-load permutation,
skipping many expensive lookups and allowing us to use just 4 loads per register,
rather than 8, or in case FMA3 (and by extension, AVX2), use the vgatherdpd
instruction, which is at least as fast as 4 separate loads on old hardware,
and quite a bit faster on modern CPUs).
Higher up, we go for a bottom-up construction of large transforms, foregoing
the traditional per-transform call-return recursion chains. Instead, we always
start at the bottom-most basis transform (in this case, a 32-point transform),
and continue constructing larger and larger transforms until we return to the
top-most transform.
This way, we only touch the stack 3 times per a complete target transform:
once for the 1/2 length transform and two times for the 1/4 length transform.
The combination algorithm we use is a standard Split-Radix algorithm,
as used in our C code. Although a version with less operations exists
(Steven G. Johnson and Matteo Frigo's "A modified split-radix FFT with fewer
arithmetic operations", IEEE Trans. Signal Process. 55 (1), 111–119 (2007),
which is the one FFTW uses), it only has 2% less operations and requires at least 4x
the binary code (due to it needing 4 different paths to do a single transform).
That version also has other issues which prevent it from being implemented
with SIMD code as efficiently, which makes it lose the marginal gains it offered,
and cannot be performed bottom-up, requiring many recursive call-return chains,
whose overhead adds up.
We go through a lot of effort to minimize load/stores by keeping as much in
registers in between construcring transforms. This saves us around 32 cycles,
on paper, but in reality a lot more due to load/store aliasing (a load from a
memory location cannot be issued while there's a store pending, and there are
only so many (2 for Zen 3) load/store units in a CPU).
Also, we interleave coefficients during the last stage to save on a store+load
per register.
Each of the smallest, basis transforms (4, 8 and 16-point in our case)
has been extremely optimized. Our 8-point transform is barely 20 instructions
in total, beating our old implementation 8-point transform by 1 instruction.
Our 2x8-point transform is 23 instructions, beating our old implementation by
6 instruction and needing 50% less cycles. Our 16-point transform's combination
code takes slightly more instructions than our old implementation, but makes up
for it by requiring a lot less arithmetic operations.
Overall, the transform was optimized for the timings of Zen 3, which at the
time of writing has the most IPC from all documented CPUs. Shuffles were
preferred over arithmetic operations due to their 1/0.5 latency/throughput.
On average, this code is 30% faster than our old libavcodec implementation.
It's able to trade blows with the previously-untouchable FFTW on small transforms,
and due to its tiny size and better prediction, outdoes FFTW on larger transforms
by 11% on the largest currently supported size.
This sadly required making changes to the code itself,
due to the same context needing to be reused for both versions.
The lookup table had to be duplicated for both versions.
This commit refactors the power-of-two FFT, making it faster and
halving the size of all tables, making the code much smaller on
all systems.
This removes the big/small pass split, because on modern systems
the "big" pass is always faster, and even on older machines there
is no measurable speed difference.
av_set_cpu_flags_mask() has been deprecated in the commit which merged
it: 6df42f98746be06c883ce683563e07c9a2af983f; av_parse_cpu_flags() has
been deprecated in 4b529edff8934c258af95e5acc51f84deea66262.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
av_frame_copy() is allowed to return values >= 0 on success, whereas
the documentation of av_frame_ref() states that the return value is 0 on
success. Ergo the latter must not just return the former's return value.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
av_cpu_count() intends to emit a debug message containing the number of
logical cores when called the first time. The check currently works with
a static volatile int; yet this does not help at all in case of
concurrent accesses by multiple threads. So replace this with an
atomic_int.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
av_adler32_update() is used by av_hash_update() which will be switched
to size_t at the next bump. So it also has to be made to use size_t.
This is also necessary for framecrcenc.c, because the size of side data
will become a size_t, too.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
av_bprint_finalize() can still fail even when it has been checked that
the AVBPrint is currently complete: Namely if the string was so short
that it fit into the AVBPrint's internal buffer.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
Fixes: Integer overflow and division by 0
Fixes: poc-202102-div.mov
Found-by: 1vanChen of NSFOCUS Security Team
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
clang errors when compiling with C++11 about needing spaces between
literal and identifier
Signed-off-by: Christopher Degawa <ccom@randomderp.com>
Signed-off-by: James Almer <jamrial@gmail.com>
Base escaping only escapes values required for base character data
according to part 2.4 of XML, and if additional flags are added
single and double quotes can additionally be escaped in order
to handle single and double quoted attributes.
Co-authored-by: Jan Ekström <jan.ekstrom@24i.com>
Signed-off-by: Jan Ekström <jan.ekstrom@24i.com>
Fixes: signed integer overflow: -9223372053736 * 1000000 cannot be represented in type 'long'
Fixes: 26910/clusterfuzz-testcase-minimized-ffmpeg_dem_CONCAT_fuzzer-6607924558430208
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
out[lut[i]] = in[i] lookups were 4.04 times(!) slower than
out[i] = in[lut[i]] lookups for an out-of-place FFT of length 4096.
The permutes remain unchanged for anything but out-of-place monolithic
FFT, as those benefit quite a lot from the current order (it means
there's only 1 lookup necessary to add to an offset, rather than
a full gather).
The code was based around non-power-of-two FFTs, so this wasn't
benchmarked early on.
No buffer will be fetched from the pool after it's uninitialized, so there's
no benefit from waiting until every single buffer has been returned to it
before freeing them all.
This should free some memory in certain scenarios, which can be beneficial in
low memory systems.
Based on a patch by Jonas Karlman.
Reviewed-by: Anton Khirnov <anton@khirnov.net>
Signed-off-by: James Almer <jamrial@gmail.com>
This is much less precise than the cycle counter register, but
the cycle counter register is not available on apple platforms
(and on linux, it requires a kernel module for allowing user mode
access).
Signed-off-by: Martin Storsjö <martin@martin.st>
This commit adds support for in-place FFT transforms. Since our
internal transforms were all in-place anyway, this only changes
the permutation on the input.
Unfortunately, research papers were of no help here. All focused
on dry hardware implementations, where permutes are free, or on
software implementations where binary bloat is of no concern so
storing dozen times the transforms for each permutation and version
is not considered bad practice.
Still, for a pure C implementation, it's only around 28% slower
than the multi-megabyte FFTW3 in unaligned mode.
Unlike a closed permutation like with PFA, split-radix FFT bit-reversals
contain multiple NOPs, multiple simple swaps, and a few chained swaps,
so regular single-loop single-state permute loops were not possible.
Instead, we filter out parts of the input indices which are redundant.
This allows for a single branch, and with some clever AVX512 asm,
could possibly be SIMD'd without refactoring.
The inplace_idx array is guaranteed to never be larger than the
revtab array, and in practice only requires around log2(len) entries.
The power-of-two MDCTs can be done in-place as well. And it's
possible to eliminate a copy in the compound MDCTs too, however
it'll be slower than doing them out of place, and we'd need to dirty
the input array.
This patch also fixes a -Wtautological-constant-out-of-range-compare
warning from Clang and a -Wtype-limits warning from GCC on systems
where size_t is 64bits and unsigned 32bits. The reason for this seems
to be that variable (whose value derives from sizeof() and can therefore
be known at compile-time) is used instead of using sizeof() directly in
the comparison.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
libavutil/common.h is a public header that provides generic math
functions whereas libavutil/intmath.h is a private header that contains
plattform-specific optimized versions of said math functions. common.h
includes intmath.h (when building the FFmpeg libraries) so that the
optimized versions are used for them.
This interdependency sometimes causes trouble: intmath.h once contained
an inlined ff_sqrt function that relied upon av_log2_16bit. In case there
was no optimized logarithm available on this plattform, intmath.h needed
to include common.h to get the generic implementation and this has been
done after the optimized versions (if any) have been provided so that
common.h used the optimized versions; it also needed to be done before
ff_sqrt. Yet when intmath.h was included from common.h and if an ordinary
inclusion guard was used by common.h, the #include "common.h" in intmath.h
was a no-op and therefore av_log2_16bit was still unknown at the end of
intmath.h (and also in ff_sqrt) if no optimized version was available.
Before a955b5965825631986ba854d007d4e934e466c7d this was solved by
duplicating the #ifndef av_log2_16bit check after the inclusion of
common.h in intmath.h; said commit instead moved these checks to the
end of common.h, outside the inclusion guards and made common.h include
itself to get these unguarded defines. This is still the current
state of affairs.
Yet this is unnecessary since 9734b8ba56d05e970c353dfd5baafa43fdb08024
as said commit removed ff_sqrt as well as the #include "common.h" from
intmath.h. Therefore this commit moves everything inside the inclusion
guards and makes common.h not include itself.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>