qmat and bias always have a constant offset, so one can use one register
to address both of them. This allows to remove the check for HAVE_6REGS
(untested on a system where HAVE_6REGS is false).
Also avoid FF_REG_a while at it.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Both GCC and Clang completely unroll the unlikely loop at -O3,
leading to codesize bloat; their code is also suboptimal, as they
don't make use of pmulhrsw (even with -mssse3). This commit
therefore ports the whole function to external assembly. The new
function occupies 176B here vs 1406B for GCC.
Benchmarks for a testcase with huge qscale (notice that the C version
is unrolled just like the unlikely loop in the SSSE3 version):
add_8x8basis_c: 43.4 ( 1.00x)
add_8x8basis_ssse3 (old): 43.6 ( 1.00x)
add_8x8basis_ssse3 (new): 11.9 ( 3.63x)
Reviewed-by: Kieran Kunhya <kieran@kunhya.com>
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It is very simple to remove the MPVEncContext from it.
Notice that this also fixes a bug in x86/mpegvideoenc.c: It only
used the SSE2 version of denoise_dct when dct_algo was auto or mmx
(and it was therefore unused during FATE).
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
While cl.exe supports -guard:signret, armasm64 complains about
unknown flag. Note that -guard:ehcont is accepted by armasm64.
Fixes:
error A2029: unknown command-line argument or argument value -guard:signret
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
They are undefined behavior and UBSan warns about them
(in the checkasm test). Put the shifts in the constants
instead. This even gives a tiny speedup here:
Old benchmarks:
column_fidct_c: 3369.9 ( 1.00x)
column_fidct_sse2: 829.1 ( 4.06x)
New benchmarks:
column_fidct_c: 3304.2 ( 1.00x)
column_fidct_sse2: 827.9 ( 3.99x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Maybe uint64_t has been used as a poor man's alignment specifier?
Anyway, reading an uint64_t via an lvalue of type int16_t (as happens
in the C versions of the dsp functions) is undefined behavior.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It gains a lot because it has to operate on eight words;
it also saves 608B of .text here.
Old benchmarks:
column_fidct_c: 3365.7 ( 1.00x)
column_fidct_mmx: 1784.6 ( 1.89x)
New benchmarks:
column_fidct_c: 3361.5 ( 1.00x)
column_fidct_sse2: 801.1 ( 4.20x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This avoids some shift instructions and also gives us more headroom
in the registers. In fact, I have proven to myself that everything
that is supposed to fit into 16bits now actually does so.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It currently is not, because the shortcut mode uses different rounding
than the C code (as well as the non-shortcut code).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The x86 assembly uses the following pattern to zero all
the values with abs<threshold:
x -= threshold;
x satu+= threshold (unsigned saturated addition)
x += threshold
x satu-= threshold (unsigned saturated subtraction)
The reference C code meanwhile zeroed everything
with abs <= threshold. This commit makes the C code behave
like the x86 assembly to reduce discrepancies between the two.
An alternative would be to require SSSE3, so that
one can use pabsw, pcmpgtw for abs>threshold, followed by
a pand with the original data. Or one could modify the thresholds
to make both equal.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It is possible because the requirements are fulfilled;
it is also beneficial performance and code-size wise.
For GCC 14 (with -O3), this reduced codesize by 26750B
here; for Clang 20, it was 432B.
Old benchmarks:
mul_thrmat_c: 4.3 ( 1.00x)
mul_thrmat_sse2: 4.3 ( 1.00x)
store_slice_c: 2810.8 ( 1.00x)
store_slice_sse2: 542.5 ( 5.18x)
store_slice2_c: 3817.0 ( 1.00x)
store_slice2_sse2: 410.4 ( 9.30x)
New benchmarks:
mul_thrmat_c: 4.3 ( 1.00x)
mul_thrmat_sse2: 4.3 ( 1.00x)
store_slice_c: 1510.1 ( 1.00x)
store_slice_sse2: 545.2 ( 2.77x)
store_slice2_c: 1763.5 ( 1.00x)
store_slice2_sse2: 408.3 ( 4.32x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is obviously what is intended and what the MMX code does;
yet I cannot rule out that it changes the output for some inputs:
I have observed individual src values which would lead to temp
values just above 512 if they came in pairs (i.e. if both inputs
were simultaneously huge).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This fixes an ABI violation, as mul_thrmat did not issue emms.
It seems that this ABI violation could reach the user, namely
if ff_get_video_buffer() fails. Notice that ff_get_video_buffer()
itself could fail because of this, namely if the allocator uses
floating point registers.
On x64 (where GCC already used SSE2 in the C version)
mul_thrmat_c: 4.4 ( 1.00x)
mul_thrmat_mmx: 8.6 ( 0.52x)
mul_thrmat_sse2: 4.4 ( 1.00x)
On 32bit (where SSE2 is not known to be available):
mul_thrmat_c: 56.0 ( 1.00x)
mul_thrmat_sse2: 6.0 ( 9.40x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is in preparation for adding checkasm tests; without it,
checkasm would pull all of libavfilter in.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
I found that the default value is not set for onfail option. I see that there is an attempt to set this value by default inside parse_slave_failure_policy_option. But look at the CONSUME_OPTION macro. If av_dict_get cannot find this option, then this function is not even called.
The code loads 32bit so we can at maximum use 32bit
the return type is also changed to uint16_t (was requested in review),
no path is known where a return value above 32767 is produced, but that was not exhaustively checked
Fixes: runtime error: shift exponent -9 is negative
Fixes: 439483046/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_PRORES_RAW_DEC_fuzzer-6649466540326912
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
the values contain 3 4 bit values, thus using hex is more natural
and shows more information
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This flag does nothing since the deactivation of
the dsp_mask field of AVCodecContext in
commits 9ae6ba2883 and
9ae6ba2883 (it has
been superseded with better ways to override the CPU flags).
So deprecate it.
Reviewed-by: Lynne <dev@lynne.ee>
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This fixes building after commit
1ce88d29d0.
That commit caused the following errors:
src/doc/fate.texi:234: @anchor expected braces.
src/doc/fate.texi:245: @item found outside of an insertion block.
src/doc/fate.texi:249: @item found outside of an insertion block.
src/doc/fate.texi:261: @item found outside of an insertion block.
src/doc/fate.texi:265: @item found outside of an insertion block.
src/doc/fate.texi:268: @item found outside of an insertion block.
src/doc/fate.texi:274: @item found outside of an insertion block.
src/doc/fate.texi:277: @item found outside of an insertion block.
src/doc/fate.texi:281: @item found outside of an insertion block.
src/doc/fate.texi:287: Unmatched `@end'.
./src/doc/fate.texi:65: Cross reference to nonexistent node `makefile variables' (perhaps incorrect sectioning?).