The 8x4 and 4x4 use a needlessly large multiplier (unless/until we care
about embedded 64-bit-vector hardware). This is merely suboptimal.
The 8x4 case also uses an incorrect vector length, which leads to incorrect
behaviour on future/hypothetical hardware with 256-bit or larger vectors.
Pointed-out-by: Martin Storsjö <martin@martin.st>
We can't call ff_get_rv_vlenb() if we don't have RVV available
at all.
Acked-by: Rémi Denis-Courmont <remi@remlab.net>
Signed-off-by: Martin Storsjö <martin@martin.st>
The loop iterates over the length of the vector, not the order. This is
to avoid reloading the same data for each lag value. However this means
the loop only works if the maximum order is no larger than VLENB.
The loop is roughly equivalent to:
for (size_t j = 0; j < lag; j++)
autoc[j] = 1.;
while (len > lag) {
for (ptrdiff_t j = 0; j < lag; j++)
autoc[j] += data[j] * *data;
data++;
len--;
}
while (len > 0) {
for (ptrdiff_t j = 0; j < len; j++)
autoc[j] += data[j] * *data;
data++;
len--;
}
Since register pressure is only at 50%, it should be possible to implement
the same loop for order up to 2xVLENB. But this is left for future work.
Performance numbers are all over the place from ~1.25x to ~4x speedups,
but at least they are always noticeably better than nothing.
Fixes: out of array access
Fixes: 62603/clusterfuzz-testcase-minimized-ffmpeg_DEMUXER_fuzzer-5837632490569728
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Intended to replace https://patchwork.ffmpeg.org/project/ffmpeg/patch/20230802000135.26482-3-michael@niedermayer.cc/
with a more accurate block decoding magnitude bound.
Fixes: 62433/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_JPEG2000_fuzzer-5828618092937216
Fixes: 58299/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_JPEG2000_fuzzer-5828618092937216
Previous-version-reviewed-by: Tomas Härdin <git@haerdin.se>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This patch allows the JXL parser to parse sequences of extremely small
files concatenated together. (e.g. smaller than the parser buffer)
Signed-off-by: Leo Izen <leo.izen@gmail.com>
According to ISO/IEC 14996-12, size == 1 means a 64-bit extended-size
field occurs *after* the 32-bit box type, not before. This fix should
allow correct parsing of JXL files with extended-size boxes.
Signed-off-by: Leo Izen <leo.izen@gmail.com>
The type of qsv decoders is FF_CODEC_CB_TYPE_DECODE which must not
return AVERROR(EAGAIN). commit 42b20c9 added an assertion to check the
returned value.
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
Building iOS platform with arm64, the compiler has a warning: "instruction movi.2d with immediate #0 may not function correctly on this CPU, converting to movi.16b"
Signed-off-by: xufuji456 <839789740@qq.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
This fixes corrupted audio for applications relying on ch_layout when
codec downmixing is active.
Signed-off-by: Geoffrey McRae <geoff@hostfission.com>
Signed-off-by: James Almer <jamrial@gmail.com>
Applications making use of this codec with the `downmix` option are
segfaulting unless the `ch_layout` is overridden after `avcodec_open2`
as can be seen in projects like MythTV[1]
This patch fixes this by overriding the ch_layout as done in other
decoders such as AC3.
1: af6f362a14/mythtv/libs/libmythtv/decoders/avformatdecoder.cpp (L4607)
Signed-off-by: Geoffrey McRae <geoff@hostfission.com>
Signed-off-by: James Almer <jamrial@gmail.com>
The input is laid out in 16 segments, of which 13 actually need to be
loaded. There are no really efficient ways to deal with this:
1) If we load 8 segments wit unit stride, then narrow to 16 segments with
right shifts, we can only get one half-size vector per segment, or just 2
elements per vector (EMUL=1/2) - at least with 128-bit vectors.
This ends up unsurprisingly about as fas as the C code.
2) The current approach is to load with strides. We keep that approach,
but improve it using three 4-segmented loads instead of 12 single-segment
loads. This divides the number of distinct loaded addresses by 4.
3) A potential third approach would be to avoid segmentation altogether
and splat the scalar coefficient into vectors. Then we can use a
unit-stride and maximum EMUL. But the downside then is that we have to
multiply the 3 (of 16) unused segments with zero as part of the
multiply-accumulate operations.
In addition, we also reuse vectors mid-loop so as to increase the EMUL
from 1 to 2, which also improves performance a little bit.
Oeverall the gains are quite small with the device under test, as it does
not deal with segmented loads very well. But at least the code is tidier,
and should enjoy bigger speed-ups on better hardware implementation.
Before:
ps_hybrid_analysis_c: 1819.2
ps_hybrid_analysis_rvv_f32: 1037.0 (before)
ps_hybrid_analysis_rvv_f32: 990.0 (after)
This stores the constant coefficients deinterleaved, so that they can be
loaded directly with NF=0. Unfortunately, we cannot optimise loading the
input, due to insufficient memory alignment (not 32-bit).
Before:
g722_apply_qmf_c: 82.5
g722_apply_qmf_rvv_i32: 78.2
After:
g722_apply_qmf_c: 82.5
g722_apply_qmf_rvv_i32: 65.2
Validate that a hw_frames_ctx is available before using it for
the AVHWAccel.free_frame_priv callback, and don't require it to
be present when the callback is not in use by the HWAccel.
v2: check for free_frame_priv (Hendrik)
v3: return EINVAL (Christoph Reiter)
v4: better commit message (Hendrik)
v5: fix typo with missed frames_ctx (Lynne)
See[1]: https://github.com/msys2/MINGW-packages/pull/19050
Fixes: be07145109 ("avcodec: add AVHWAccel.free_frame_priv callback")
CC: Lynne <dev@lynne.ee>
CC: Christoph Reiter <reiter.christoph@gmail.com>
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
For fate-h264_mp4toannexb_ticket5927 and
fate-h264_mp4toannexb_ticket5927_2, they work by accident
previously. The sample file has two 'avc1' entries, and video
samples use the second one. It means packets should be decoded with
new extradata in side data. Before this patch, only extradata was
kept in the output, new extradata has been dropped. The output can
be decoded because the two extradata are almost the same, except
level indication. This patch fixed the issue, and add another
fate test.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
If there is a single group of SPS/PPS before an IDR frame, but no
SPS/PPS after that, we will miss the chance to reset
idr_sps_seen/idr_pps_seen. No SPS/PPS are inserted afterwards.
This patch saves in-band SPS/PPS and insert them before IDR frames
when necessary.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>