Up until now, both the msmpeg4 decoders and encoders initialized several
RLTables common to them (the decoders also initialized the VLCs of these
RLTables). This is an obstacle to making these codecs init-threadsafe.
So move this initialization to ff_msmpeg4_common_init() that already
contains this initialization code. This allows to reuse the AVOnce used
for initializing ff_v2_dc_lum/chroma_table which automatically makes
initializing these RLTables thread-safe.
Reviewed-by: Anton Khirnov <anton@khirnov.net>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
Up until now the RLTable ff_mpeg4_rl_intra was initialized by both mpeg4
decoder and encoder (except the VLCs that are only used by the decoder).
This is an obstacle to making these codecs init-threadsafe, so move
initializing this to a single function that is guarded by a dedicated
AVOnce.
Reviewed-by: Anton Khirnov <anton@khirnov.net>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
This already makes several encoders (namely FLV, H.263, H.263+ and
RealVideo 1.0 and 2.0 and SVQ1) that use this init-threadsafe.
It also makes the Snow encoder init-threadsafe; it was already marked
as such since commit d49210788b0836d56dd872d517fe73f83b080101, because
it was thought to be harmless if one and the same object was
initialized by multiple threads at the same time.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
The documentation of the get_encode_buffer() callback does not require
to zero the padding; therefore we do it in ff_get_encode_buffer().
This also constitutes an implicit check for whether the buffer is
actually allocated with padding.
The memset in avcodec_default_get_encode_buffer() is now redundant and
has been removed.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Also improve readability by keeping a pointer to the IVIBandDesc that is
currently freed.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This encoder sets the min_size in ff_alloc_packet2(), so it can not rely
on av_packet_make_refcounted() to zero the padding.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The ASS margins are utilized to generate percentual values, as
the usage of cell-based sizing and offsetting seems to be not too
well supported by renderers.
Signed-off-by: Jan Ekström <jan.ekstrom@24i.com>
Attempts to utilize the TTML cell resolution as a mapping to the
reference resolution, and maps font size to cell size. Additionally
sets the display and text alignment according to the ASS alignment
number.
Signed-off-by: Jan Ekström <jan.ekstrom@24i.com>
This way the encoder may pass on the following values to the muxer:
1) Additional root "tt" element attributes, such as the subtitle
canvas reference size.
2) Anything before the body element of the document, such as regions
in the head element, which can configure styles.
Signed-off-by: Jan Ekström <jan.ekstrom@24i.com>
Format is still used by modders of these old games.
Reviewed-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Signed-off-by: Aidan Richmond <aidan.is@hotmail.co.uk>
Signed-off-by: Zane van Iperen <zane@zanevaniperen.com>
ADTS frames may contain up to 768 bytes per channel. With 16 channels,
this is 12k, which cannot fit into the maximum 8k buffer.
Signed-off-by: Chris Ribble <chris.ribble@resi.io>
With JPEG-LS PAL8 samples, the JPEG-LS extension parameters signaled with
the LSE marker show up after SOF but before SOS. For those, the pixel format
chosen by get_format() in SOF is GRAY8, and then replaced by PAL8 in LSE.
This has not been an issue given both pixel formats allocate the second data
plane for the palette, but after the upcoming soname bump, GRAY8 will no longer
do that. This will result in segfauls when ff_jpegls_decode_lse() attempts to
write the palette on a buffer originally allocated as a GRAY8 one.
Work around this by calling ff_get_buffer() after the actual pixel format is
known.
Signed-off-by: James Almer <jamrial@gmail.com>
Fixes a crash when decoding VQA files.
Regression since c012f9b265e172de9c240c9dfab8665936fa3e83.
Reported-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Signed-off-by: Zane van Iperen <zane@zanevaniperen.com>
This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx.
The design of this pure assembly FFT is pretty unconventional.
On the lowest level, instead of splitting the complex numbers into
real and imaginary parts, we keep complex numbers together but split
them in terms of parity. This saves a number of shuffles in each transform,
but more importantly, it splits each transform into two independent
paths, which we process using separate registers in parallel.
This allows us to keep all units saturated and lets us use all available
registers to avoid dependencies.
Moreover, it allows us to double the granularity of our per-load permutation,
skipping many expensive lookups and allowing us to use just 4 loads per register,
rather than 8, or in case FMA3 (and by extension, AVX2), use the vgatherdpd
instruction, which is at least as fast as 4 separate loads on old hardware,
and quite a bit faster on modern CPUs).
Higher up, we go for a bottom-up construction of large transforms, foregoing
the traditional per-transform call-return recursion chains. Instead, we always
start at the bottom-most basis transform (in this case, a 32-point transform),
and continue constructing larger and larger transforms until we return to the
top-most transform.
This way, we only touch the stack 3 times per a complete target transform:
once for the 1/2 length transform and two times for the 1/4 length transform.
The combination algorithm we use is a standard Split-Radix algorithm,
as used in our C code. Although a version with less operations exists
(Steven G. Johnson and Matteo Frigo's "A modified split-radix FFT with fewer
arithmetic operations", IEEE Trans. Signal Process. 55 (1), 111–119 (2007),
which is the one FFTW uses), it only has 2% less operations and requires at least 4x
the binary code (due to it needing 4 different paths to do a single transform).
That version also has other issues which prevent it from being implemented
with SIMD code as efficiently, which makes it lose the marginal gains it offered,
and cannot be performed bottom-up, requiring many recursive call-return chains,
whose overhead adds up.
We go through a lot of effort to minimize load/stores by keeping as much in
registers in between construcring transforms. This saves us around 32 cycles,
on paper, but in reality a lot more due to load/store aliasing (a load from a
memory location cannot be issued while there's a store pending, and there are
only so many (2 for Zen 3) load/store units in a CPU).
Also, we interleave coefficients during the last stage to save on a store+load
per register.
Each of the smallest, basis transforms (4, 8 and 16-point in our case)
has been extremely optimized. Our 8-point transform is barely 20 instructions
in total, beating our old implementation 8-point transform by 1 instruction.
Our 2x8-point transform is 23 instructions, beating our old implementation by
6 instruction and needing 50% less cycles. Our 16-point transform's combination
code takes slightly more instructions than our old implementation, but makes up
for it by requiring a lot less arithmetic operations.
Overall, the transform was optimized for the timings of Zen 3, which at the
time of writing has the most IPC from all documented CPUs. Shuffles were
preferred over arithmetic operations due to their 1/0.5 latency/throughput.
On average, this code is 30% faster than our old libavcodec implementation.
It's able to trade blows with the previously-untouchable FFTW on small transforms,
and due to its tiny size and better prediction, outdoes FFTW on larger transforms
by 11% on the largest currently supported size.
This sadly required making changes to the code itself,
due to the same context needing to be reused for both versions.
The lookup table had to be duplicated for both versions.