Previously, the LC3 encoder only accepted planar float (AV_SAMPLE_FMT_FLTP).
This change extends support to packed float (AV_SAMPLE_FMT_FLT) by properly
handling channel layout and sample stride.
The pcm data pointer and stride are now calculated based on the sample
format: for planar, use frame->data[ch]; for packed, use frame->data[0]
with channel offset. The stride is set to 1 for planar and number of
channels for packed layout.
This enables encoding from common packed audio sources without requiring
a prior planar conversion, improving usability and efficiency.
Signed-off-by: cenzhanquan1 <cenzhanquan1@xiaomi.com>
1. Adds support for respecting the requested sample format. Previously,
the decoder always used AV_SAMPLE_FMT_FLTP. Now it checks if the
caller requested a specific format via avctx->request_sample_fmt and
honors that request when supported.
2. Improves planar/interleaved audio buffer handling. The decoding
logic now properly handles both planar and interleaved sample
formats by calculating the correct stride and buffer pointers based
on the actual sample format.
The changes include:
- Added format mapping between AVSampleFormat and lc3_pcm_format
- Implemented format selection logic in initialization.
- Updated buffer pointer calculation for planar/interleaved data.
- Maintained backward compatibility with existing behavior.
Signed-off-by: cenzhanquan1 <cenzhanquan1@xiaomi.com>
Add dtls_active flag to specify the dtls role.
Properly set the send key and recv key depends on DTLS role:
As DTLS server, the recv key is client master key plus salt,
the send key is server master key plus salt.
As DTLS client, the recv key is server master key plus salt,
the send key is client master key plus salt.
Signed-off-by: Jack Lau <jacklau1222@qq.com>
If using the delay_moov flag in combination with hybrid_fragment
(which is a potentially problematic combination otherwise - the
ftyp box does end up hidden in the end), then we need to flush
twice to get both the moov box and the first fragment, if the
file is finished before the first fragment is completed.
If samples were available when the moov was written, chunking
for those samples has been done already, which has to be reset
here.
This is the case when not using empty_moov, when the moov box
describes the first fragment - this case was accounted for already.
But if using the delay_moov flag, then those samples also were
available when writing the moov, so chunking for them has already
been done in this case as well.
Therefore, always reset chunking here (it should be harmless to
always do it), and update the comment to clarify the cases
involved here.
When calculating the max size of an output PNG packet, we should
include the size of a possible eXIf chunk that we may write.
This fixes a regression since d3190a64c3
as well as a bug that existed prior in the apng encoder since commit
4a580975d4.
Signed-off-by: Leo Izen <leo.izen@gmail.com>
When splitting a 5 lines image in 2 slices one will be 3 lines and thus need more space
Fixes: Assertion sc->slice_coding_mode == 0 failed at libavcodec/ffv1enc.c:1668
Fixes: 422811239/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_FFV1_fuzzer-4933405139861504
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
We do not support larger tiles as we use signed int
Alternatively we can check this in apv_decode_tile_component() or init_get_bits*()
or support bitstreams above 2gb length
Fixes: init_get_bits() failure later
Fixes: 421817631/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_APV_fuzzer-4957386534354944
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This has the advantage of not violating the ABI by using
MMX registers without issuing emms; it e.g. allows
to remove an emms_c from bink.c.
This commit uses GP registers on Unix64 (there are not
enough volatile registers to do likewise on Win64) which
reduces codesize and is faster on some CPUs.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Snow calls some of the me_cmp_funcs with insufficient alignment
for the first pointer (see get_block_rd() in snowenc.c);
therefore SSE2 functions which really need this alignment
don't get set for Snow and 542765ce3e
consequently didn't remove MMXEXT functions which are overridden
by these SSE2 functions for normal codecs.
For reference, here is a command line which would segfault
if one simply used the ordinary SSE2 functions for Snow:
./ffmpeg -i mm-short.mpg -an -vcodec snow -t 0.2 -pix_fmt yuv444p \
-vstrict -2 -qscale 2 -flags +qpel -motion_est iter 444iter.avi
This commit adds unaligned SSE2 versions of these functions
and removes the MMXEXT ones. This in particular implies that
sad 16x16 now never uses MMX which allows to remove an emms_c
from ac3enc.c.
Benchmarks (u means unaligned version):
sad_0_c: 8.2 ( 1.00x)
sad_0_mmxext: 10.8 ( 0.76x)
sad_0_sse2: 6.2 ( 1.33x)
sad_0_sse2u: 6.7 ( 1.23x)
vsad_0_c: 44.7 ( 1.00x)
vsad_0_mmxext (approx): 12.2 ( 3.68x)
vsad_0_sse2 (approx): 7.8 ( 5.75x)
vsad_4_c: 88.4 ( 1.00x)
vsad_4_mmxext: 7.1 (12.46x)
vsad_4_sse2: 4.2 (21.15x)
vsad_4_sse2u: 5.5 (15.96x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The SSE2 function overriding them are currently only set
if the SSE2SLOW flag is not set and if the codec is not Snow.
The former affects only outdated processors (AMDs from
before Barcelona (i.e. before 2007)) and is therefore irrelevant.
Snow does not use the pix_abs function pointers at all,
so this is also no obstacle.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The new functions are faster than the existing exact
functions, yet get beaten by the nonexact functions
(they can avoid unpacking to words and back).
The exact (slow) MMX functions have therefore been
removed, which was actually beneficial size-wise
(416B of new functions, 619B of functions removed).
pix_abs_0_3_c: 216.8 ( 1.00x)
pix_abs_0_3_mmx: 71.8 ( 3.02x)
pix_abs_0_3_mmxext (approximative): 17.6 (12.34x)
pix_abs_0_3_sse2: 23.5 ( 9.23x)
pix_abs_0_3_sse2 (approximative): 9.9 (21.94x)
pix_abs_1_3_c: 98.4 ( 1.00x)
pix_abs_1_3_mmx: 36.9 ( 2.66x)
pix_abs_1_3_mmxext (approximative): 9.2 (10.73x)
pix_abs_1_3_sse2: 14.8 ( 6.63x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is a major rewrite of the exising nlmeans vulkan code, with bug
fixes and major performance improvement.
Fix visual artifacts found in ticket #10661, #10733. Add OOB checks for
image loading and patch sized area around the border. Correct chroma
plane height, strength and buffer barrier index.
Improve parallelism with component workgroup axis and more but smaller
workgroups. Split weights pass into vertical/horizontal (integral) and
weights passes. Remove h/v order logic to always calculate sum on
vertical pass. Remove atomic float requirement, which causes high memory
locking contentions, at the cost of higher memory usage of w/s buffer.
Use cache blocking in h pass to reduce memory bandwidth usage.
Write the moov tag at the end first, before overwriting the mdat size
at the start of the file.
In case writing the final moov box fails (e.g. due to being out
of disk), we haven't broken the initial moov box yet.
Thus if writing stops between these steps, we could end up with
a file with two moov boxes - which arguably is more feasible to
recover from, than from a file with no moov boxes at all.
The documentation states that this field is for enabling "extra" usage
flags. This conflicts with the implementation, and the rest of the comment,
though.
In resolving this ambiguity, I think it's better to lean towards the first
sentence and treat this field purely as specifying *extra* usage flags to
enable. Otherwise, this may break vulkan encoding or subsequent hwdownload
if the upstream filter did not specifically advertise this.
Change the default behavior and update the documentation slightly to more
clearly document the semantics.
This avoids having to fix up ABI violations via emms_c and
also leads to a 73% speedup for the line noise average version
here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
When using averaged noise with height > MAX_RES (i.e. 4096),
multiple threads would access the same prev_shift slot,
leading to races. Fix this by disabling slice threading
in such scenarios.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is not only UB, but also leads to races and nondeterministic
output, because the write one last the end of the buffer actually
conflicts with accesses by the thread that actually owns it.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
"all" only exists to set options; it does not need the big arrays
contained in FilterParams.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This reverts commit 301141b576.
cluster[0].dts, pts and frag_info[0].time are already in presentation
timeline, so they shouldn't be shift by start_pts.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
The diff and var functions benefit from psadbw, comb from wider
registers which allows to avoid reloading values, reducing the number
of loads from 48 to 10. Performance increased by 117% (the loop
in compute_metric() has been timed); codesize decreased by 144B.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This allows to remove an emms_c from the filter. It also gives
25% speedup here (when timing the calls to store_slice using
START/STOP_TIMER).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Improves performance and no longer breaks the ABI (by forgetting
to call emms).
Old benchmarks:
add_8x8basis_c: 43.6 ( 1.00x)
add_8x8basis_ssse3: 12.3 ( 3.55x)
New benchmarks:
add_8x8basis_c: 43.0 ( 1.00x)
add_8x8basis_ssse3: 6.3 ( 6.79x)
Notice that the output of try_8x8basis_ssse3 changes a bit:
Before this commit, it computes certain values and adds the values
for i,i+1,i+4 and i+5 before right shifting them; now it adds
the values for i,i+1,i+8,i+9. The second pair in these lists
could be avoided (by shifting xmm0 and xmm1 before adding both together
instead of only shifting xmm0 after adding them), but the former
i,i+1 is inherent in using pmaddwd. This is the reason that this
function is not bitexact.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The only requirement of this code (and essentially the pmulhrsw
instruction) is that the scaled scale fits into an int16_t.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>