For GET_UTF8(val, GET_BYTE, ERROR), val has type of uint32_t,
GET_BYTE must return an unsigned integer, otherwise signed
extension happened due to val= (GET_BYTE), and GET_UTF8 went to
the error path.
This bug incidentally cancelled the bug where hb_buffer_add_utf8
was being called with incorrect argument, allowing drawtext to
function correctly on x86 and macOS ARM, which defined char as
signed. However, on Linux and Android ARM environments, because
char is unsigned by default, GET_UTF8 now returns the correct
return, which unexpectedly revealed issue #20906.
From the doc of HarfBuzz, what hb_buffer_add_utf8 needs is the
number of bytes, not Unicode character:
hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
Fix issue #20906.
av_frame_copy doesn't copy the input's PTS property, which resulted
in the frei0r filter always receiving the same static time.
Example that has a static distortion without patch:
ffmpeg -filter_complex "testsrc2=s=328x240:d=5,frei0r=distort0r" out.mp4
They are undefined behavior and UBSan warns about them
(in the checkasm test). Put the shifts in the constants
instead. This even gives a tiny speedup here:
Old benchmarks:
column_fidct_c: 3369.9 ( 1.00x)
column_fidct_sse2: 829.1 ( 4.06x)
New benchmarks:
column_fidct_c: 3304.2 ( 1.00x)
column_fidct_sse2: 827.9 ( 3.99x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Maybe uint64_t has been used as a poor man's alignment specifier?
Anyway, reading an uint64_t via an lvalue of type int16_t (as happens
in the C versions of the dsp functions) is undefined behavior.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It gains a lot because it has to operate on eight words;
it also saves 608B of .text here.
Old benchmarks:
column_fidct_c: 3365.7 ( 1.00x)
column_fidct_mmx: 1784.6 ( 1.89x)
New benchmarks:
column_fidct_c: 3361.5 ( 1.00x)
column_fidct_sse2: 801.1 ( 4.20x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This avoids some shift instructions and also gives us more headroom
in the registers. In fact, I have proven to myself that everything
that is supposed to fit into 16bits now actually does so.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It currently is not, because the shortcut mode uses different rounding
than the C code (as well as the non-shortcut code).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The x86 assembly uses the following pattern to zero all
the values with abs<threshold:
x -= threshold;
x satu+= threshold (unsigned saturated addition)
x += threshold
x satu-= threshold (unsigned saturated subtraction)
The reference C code meanwhile zeroed everything
with abs <= threshold. This commit makes the C code behave
like the x86 assembly to reduce discrepancies between the two.
An alternative would be to require SSSE3, so that
one can use pabsw, pcmpgtw for abs>threshold, followed by
a pand with the original data. Or one could modify the thresholds
to make both equal.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It is possible because the requirements are fulfilled;
it is also beneficial performance and code-size wise.
For GCC 14 (with -O3), this reduced codesize by 26750B
here; for Clang 20, it was 432B.
Old benchmarks:
mul_thrmat_c: 4.3 ( 1.00x)
mul_thrmat_sse2: 4.3 ( 1.00x)
store_slice_c: 2810.8 ( 1.00x)
store_slice_sse2: 542.5 ( 5.18x)
store_slice2_c: 3817.0 ( 1.00x)
store_slice2_sse2: 410.4 ( 9.30x)
New benchmarks:
mul_thrmat_c: 4.3 ( 1.00x)
mul_thrmat_sse2: 4.3 ( 1.00x)
store_slice_c: 1510.1 ( 1.00x)
store_slice_sse2: 545.2 ( 2.77x)
store_slice2_c: 1763.5 ( 1.00x)
store_slice2_sse2: 408.3 ( 4.32x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is obviously what is intended and what the MMX code does;
yet I cannot rule out that it changes the output for some inputs:
I have observed individual src values which would lead to temp
values just above 512 if they came in pairs (i.e. if both inputs
were simultaneously huge).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This fixes an ABI violation, as mul_thrmat did not issue emms.
It seems that this ABI violation could reach the user, namely
if ff_get_video_buffer() fails. Notice that ff_get_video_buffer()
itself could fail because of this, namely if the allocator uses
floating point registers.
On x64 (where GCC already used SSE2 in the C version)
mul_thrmat_c: 4.4 ( 1.00x)
mul_thrmat_mmx: 8.6 ( 0.52x)
mul_thrmat_sse2: 4.4 ( 1.00x)
On 32bit (where SSE2 is not known to be available):
mul_thrmat_c: 56.0 ( 1.00x)
mul_thrmat_sse2: 6.0 ( 9.40x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is in preparation for adding checkasm tests; without it,
checkasm would pull all of libavfilter in.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The frei0r API expects the time in seconds, but was given it in
milliseconds. The bug might exist since 41f1d3a (~14 years ago),
but plugins depending on the time are unwatchable without this
patch. For example:
ffmpeg -filter_complex "testsrc2=d=5,frei0r=distort0r" out.mp4
Signed-off-by: Stefan Breunig <stefan-ffmpeg-devel@breunig.xyz>
This is more informative than the current behavior, because when the first
MERGE() succeeds but the second fails, the original link already has
merged formats and thus the error message is confusing.
This is a regression introduced by the addition of the rotation option,
which overrode the existing rotation attribute that may have been set to
the image.
To fix it, add the rotation istead of setting it - however we have to do this
directly when mapping, so as to not add it multiple times.
Fixes: 4f623b4c59
The drawvg filter can draw vector graphics on top of a video, using libcairo. It
is enabled if FFmpeg is configured with `--enable-cairo`.
The language for drawvg scripts is documented in `doc/drawvg-reference.texi`.
There are two new tests:
- `fate-filter-drawvg-interpreter` launch a script with most commands, and
verify which libcairo functions are executed.
- `fate-filter-drawvg-video` render a very simple image, just to verify that
libcairo is working as expected.
Signed-off-by: Ayose <ayosec@gmail.com>
This is a major rewrite of the exising nlmeans vulkan code, with bug
fixes and major performance improvement.
Fix visual artifacts found in ticket #10661, #10733. Add OOB checks for
image loading and patch sized area around the border. Correct chroma
plane height, strength and buffer barrier index.
Improve parallelism with component workgroup axis and more but smaller
workgroups. Split weights pass into vertical/horizontal (integral) and
weights passes. Remove h/v order logic to always calculate sum on
vertical pass. Remove atomic float requirement, which causes high memory
locking contentions, at the cost of higher memory usage of w/s buffer.
Use cache blocking in h pass to reduce memory bandwidth usage.