This is slower than the Zbb version on real hardware due to register
strides. Proper support for vector byte-swap requires the Zvbb
extension, but it's much too early for me to worry about it.
Including winsock2.h or windows.h without WIN32_LEAN_AND_MEAN cause
bzlib.h to parse as nonsense, due to an instance of #define char small
in rpcndr.h.
See:
https://stackoverflow.com/a/27794577
Signed-off-by: L. E. Segovia <amy@amyspark.me>
Signed-off-by: Martin Storsjö <martin@martin.st>
The code was blindly assuming that Zbb or V implied Zba. While the
earlier is practically always true, the later broke some QEMU setups,
as V was introduced earlier than Zba.
Add missing operand which clang complains about but GCC assumes it to be
'm1' if not specified.
Works around build failure with Clang:
| src/libswscale/riscv/rgb2rgb_rvv.S:88:25: error: operand must be e[8|16|32|64|128|256|512|1024],m[1|2|4|8|f2|f4|f8],[ta|tu],[ma|mu]
| vsetvli t4, t3, e8, ta, ma
| ^
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)
x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.
The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.
In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
Currently, it is done once per slice-thread, leading to
one warning per slice-thread in case a YUVJ pixel format
has been originally used.
This also fixes the anomaly that said parameter are only
updated for the user-facing context (whose values are retrievable
via av_opt_get()) if slice-threading is not in use.
Fixes ticket #9860.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Initializing slice threads currently uses the function
(sws_init_context()) that is also used for initializing
user-facing contexts with the only difference being that
nb_threads is set to one before initializing the slice contexts.
Yet sws_init_context() also initializes lots of stuff
that is not slice-dependent, i.e. (src|dst)Range. This
currently only works because the code sets these fields
to the same values for all slice contexts. This is not
nice; even worse, it entails that log messages are printed
once per slice context (and therefore fill the screen).
This commit lays the groundwork to fix this.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Same principle as previous commit, with sufficiently huge rgb2yuv table
values this produces wrong results and undefined behavior.
The unsigned produces the same incorrect results. That is probably
ok as these cases with huge values seem not to occur in any real
use case.
Fixes: signed integer overflow
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Large rgb2yuv tables and high pixel values cause the intermediate
int32_t of ru*r + gu*g + bu*b to exceed INT_MAX, which is undefined
behavior. This causes libswscale built with LLVM -fsanitize=undefined to
assert. Using unsigned integers instead has defined behavior and
produces identical results, and makes rgb64ToUV_c_template match
rgb64ToY_c_template.
Fixes: signed integer overflow
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Provide arm64 neon optimized implementations for hscale16To19 with
filter sizes 4, 8 and X4.
The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.
hscale_16_to_19__fs_4_dstW_512_c: 6216.0
hscale_16_to_19__fs_4_dstW_512_neon: 2257.0
hscale_16_to_19__fs_8_dstW_512_c: 10417.7
hscale_16_to_19__fs_8_dstW_512_neon: 3112.5
hscale_16_to_19__fs_12_dstW_512_c: 14890.5
hscale_16_to_19__fs_12_dstW_512_neon: 3899.0
hscale_16_to_19__fs_16_dstW_512_c: 19006.5
hscale_16_to_19__fs_16_dstW_512_neon: 5341.2
hscale_16_to_19__fs_32_dstW_512_c: 36629.5
hscale_16_to_19__fs_32_dstW_512_neon: 9502.7
hscale_16_to_19__fs_40_dstW_512_c: 45477.5
hscale_16_to_19__fs_40_dstW_512_neon: 11552.0
(Note, the checkasm tests for these functions haven't been
merged since they fail on x86.)
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Add arm64 neon implementations for hscale 16 to 15 with filter
sizes 4, 8 and X4.
The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.
hscale_16_to_15__fs_4_dstW_512_c: 6703.5
hscale_16_to_15__fs_4_dstW_512_neon: 2298.0
hscale_16_to_15__fs_8_dstW_512_c: 10983.0
hscale_16_to_15__fs_8_dstW_512_neon: 3216.5
hscale_16_to_15__fs_12_dstW_512_c: 15526.0
hscale_16_to_15__fs_12_dstW_512_neon: 3993.0
hscale_16_to_15__fs_16_dstW_512_c: 20183.5
hscale_16_to_15__fs_16_dstW_512_neon: 5369.7
hscale_16_to_15__fs_32_dstW_512_c: 39315.2
hscale_16_to_15__fs_32_dstW_512_neon: 9511.2
hscale_16_to_15__fs_40_dstW_512_c: 48995.7
hscale_16_to_15__fs_40_dstW_512_neon: 11570.0
(Note, the checkasm tests for these functions haven't been
merged since they fail on x86.)
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Add arm64 neon implementations for hscale 8 to 19 with filter
sizes 4, 4X and 8. Both implementations are based on very similar ones
dedicated to hscale 8 to 15. The major changes refer to saving
the data - instead of writing the result as int16_t it is done
with int32_t.
These functions are heavily inspired on patches provided by J. Swinney
and M. Storsjö for hscale8to15 which were slightly adapted for
hscale8to19.
The tests and benchmarks run on AWS Graviton 2 instances. The results
from a checkasm tool shown below.
hscale_8_to_19__fs_4_dstW_512_c: 5663.2
hscale_8_to_19__fs_4_dstW_512_neon: 1259.7
hscale_8_to_19__fs_8_dstW_512_c: 9306.0
hscale_8_to_19__fs_8_dstW_512_neon: 2020.2
hscale_8_to_19__fs_12_dstW_512_c: 12932.7
hscale_8_to_19__fs_12_dstW_512_neon: 2462.5
hscale_8_to_19__fs_16_dstW_512_c: 16844.2
hscale_8_to_19__fs_16_dstW_512_neon: 4671.2
hscale_8_to_19__fs_32_dstW_512_c: 32803.7
hscale_8_to_19__fs_32_dstW_512_neon: 5474.2
hscale_8_to_19__fs_40_dstW_512_c: 40948.0
hscale_8_to_19__fs_40_dstW_512_neon: 6669.7
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Treat the 32 bit stride registers as signed.
Alternatively, we could make the stride arguments ptrdiff_t instead
of int, and changing all of the assembly to operate on these
registers with their full 64 bit width, but that's a much larger
and more intrusive change (and risks missing some operation, which
would clamp the intermediates to 32 bit still).
Fixes: https://trac.ffmpeg.org/ticket/9985
Signed-off-by: Martin Storsjö <martin@martin.st>
The intention here was probably to document this as use of
conditionals does not make sense in a comment.
Fixes doxy warning:
warning: explicit link request to 'if' could not be resolved
Bayer sources are read in groups of 2 lines (e.g. for a
BGGR flavor, the first row contains only B and G samples,
while the second row contains only G and R samples). They
need to be read as a whole.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
This is currently 64-bit only because the stack spilling code would not
assemble on RV32I (and it would corrupt s0 and s1 on RV128I, in theory).
This could be added later in the unlikely that someone wants it.
Up until now, libswscale/output.c used a macro to write
an output pixel which involved a call to av_pix_fmt_desc_get()
to find out whether the input pixel format is BE or LE
despite this being known at compile-time (there are templates
per pixfmt). Even worse, these calls are made in a loop,
so that e.g. there are eight calls to av_pix_fmt_desc_get()
for every pixel processed in yuv2rgba64_X_c_template()
for 64bit RGB formats.
This commit modifies these macros to ensure that isBE()
is evaluated at compile-time. This saved 41184B of .text
for me (GCC 11.2, -O3). Of course, it also improved performance.
E.g. ffmpeg_g -f lavfi -i testsrc2,format=yuva420p -pix_fmt rgba64le \
-threads 1 -t 1:00 -f null - (which uses yuv2rgba64le_X_c,
which is an invocation of yuv2rgba64_X_c_template() mentioned above),
performance improved from 95589 to 41387 decicycles for one call
to yuv2packedX; for the be variant the numbers went down from
76087 to 43024 decicycles.
Reviewed-by: Anton Khirnov <anton@khirnov.net>
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Up until now, libswscale/input.c used a macro to read
an input pixel which involved a call to av_pix_fmt_desc_get()
to find out whether the input pixel format is BE or LE
despite this being known at compile-time (there are templates
per pixfmt). Even worse, these calls are made in a loop,
so that e.g. there are six calls to av_pix_fmt_desc_get()
for every pair of UV pixel processed in
rgb64ToUV_half_c_template().
This commit modifies these macros to ensure that isBE()
is evaluated at compile-time. This saved 9743B of .text
for me (GCC 11.2, -O3). For a simple RGB64LE->YUV420P
transformation like
ffmpeg -f lavfi -i haldclutsrc,format=rgba64le -pix_fmt yuv420p \
-threads 1 -t 1:00 -f null -
the amount of decicycles spent in rgb64LEToUV_half_c
(which is created via the template mentioned above)
decreases from 19751 to 5341; for RGBA64BE the number
went down from 11945 to 5393. For shared builds (where
the call to av_pix_fmt_desc_get() is indirect) the old numbers
are 15230 for RGBA64BE and 27502 for RGBA64LE, whereas
the numbers with this patch are indistinguishable from
the numbers from a static build.
Also make the macros that are touched conform to the
usual convention of using uppercase names while just at it.
Reviewed-by: Anton Khirnov <anton@khirnov.net>
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
These macros are definitions, not only declarations and therefore
should not contain a semicolon. Such a semicolon is actually
spec-incompliant, but compilers happen to accept them.
Reviewed-by: Philip Langdale <philipl@overt.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
As we already have support for VUYA, I figured I should do the small
amount of work to support VUYX as well. That means a little refactoring
to share code.
Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr
filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x
filter-paletteuse-bayer filter-paletteuse-bayer0
filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests
when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to
"mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten
when using SSSE3).
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is by no means perfect, since at least ddagrab will return scRGB
data with values outside of 0.0f to 1.0f for HDR values.
Its primary purpose is to be able to work with the format at all.
This commit adds new code paths for vscale when filterSize is 2, 4, or
8. By using specialized code with unrolling to match the filterSize we
can improve performance.
On AWS c7g (Graviton 3, Neoverse V1) instances:
before after
yuv2yuvX_2_0_512_accurate_neon: 558.8 268.9
yuv2yuvX_4_0_512_accurate_neon: 637.5 434.9
yuv2yuvX_8_0_512_accurate_neon: 1144.8 806.2
yuv2yuvX_16_0_512_accurate_neon: 2080.5 1853.7
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Use scalar times vector multiply accumlate instructions instead of
vector times vector to remove the need for replicating load instructions
which are slightly slower.
On AWS c7g (Graviton 3, Neoverse V1) instances:
yuv2yuvX_8_0_512_accurate_neon: 1144.8 987.4
yuv2yuvX_16_0_512_accurate_neon: 2080.5 1869.4
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Change the reference to exactly match the C reference in swscale,
instead of exactly matching the x86 SIMD implementations (which
differs slightly). Test with and without SWS_ACCURATE_RND - if this
flag isn't set, the output must match the C reference exactly,
otherwise it is allowed to be off by 2.
Mark a couple x86 functions as unavailable when SWS_ACCURATE_RND
is set - apparently this discrepancy hasn't been noticed in other
exact tests before.
Add a test for yuv2plane1.
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
This specialization handles the case where filtersize is 4 mod 8, e.g.
12, 20, etc. Aarch64 was previously using the c function for this case.
This implementation speeds up that case significantly.
hscale_8_to_15__fs_12_dstW_512_c: 6234.1
hscale_8_to_15__fs_12_dstW_512_neon: 1505.6
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
x64 always has MMX, MMXEXT, SSE and SSE2 and this means
that some functions for MMX, MMXEXT, SSE and 3dnow are always
overridden by other functions (unless one e.g. explicitly
disables SSE2). So given that the only systems that
benefit from these functions are truely ancient 32bit x86s
they are removed.
Moreover, some of the removed code was buggy/not bitexact
and lead to failures involving the f32le and f32be versions of
gray, gbrp and gbrap on x86-32 when SSE2 was not disabled.
See e.g.
https://fate.ffmpeg.org/report.cgi?time=20220609221253&slot=x86_32-debian-kfreebsd-gcc-4.4-cpuflags-mmx
Notice that yuv2yuvX_mmx is not removed, because it is used
by SSE3 and AVX2 as fallback in case of unaligned data and
also for tail processing. I don't know why yuv2yuvX_mmxext
isn't being used for this; an earlier version [1] of
554c2bc708 used it, but
the version that was eventually applied does not.
[1]: https://ffmpeg.org/pipermail/ffmpeg-devel/2020-November/272124.html
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
x64 always has MMX, MMXEXT, SSE and SSE2 and this means
that some functions for MMX, MMXEXT and 3dnow are always
overridden by other functions (unless one e.g. explicitly
disables SSE2) for x64. So given that the only systems that
benefit from these functions are truely ancient 32bit x86s
they are removed.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
x64 always has MMX, MMXEXT, SSE and SSE2 and this means
that some functions for MMX, MMXEXT and 3dnow are always
overridden by other functions (unless one e.g. explicitly
disables SSE2) for x64. So given that the only systems that
benefit from these functions are truely ancient 32bit x86s
they are removed.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is more spec-compliant because it does not rely
on dead-code elimination by the compiler. Especially
MSVC has problems with this, as can be seen in
https://ffmpeg.org/pipermail/ffmpeg-devel/2022-May/296373.html
or
https://ffmpeg.org/pipermail/ffmpeg-devel/2022-May/297022.html
This commit does not eliminate every instance where we rely
on dead code elimination: It only tackles branching to
the initialization of arch-specific dsp code, not e.g. all
uses of CONFIG_ and HAVE_ checks. But maybe it is already
enough to compile FFmpeg with MSVC with whole-programm-optimizations
enabled (if one does not disable too many components).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Y, U, V data is loaded at the end of the current iteration for the next
iteration.
It results in memory access past the frame data on the last iteration
(that data is never used after the loading).
So load data at the start of the iteration, so that only useful data is
loaded.
Signed-off-by: Vardan Margaryan <v.t.margaryan@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
This patch adds code to support specializations of the hscale function
and adds a specialization for filterSize == 4.
ff_hscale8to15_4_neon is a complete rewrite. Since the main bottleneck
here is loading the data from src, this data is loaded a whole block
ahead and stored back to the stack to be loaded again with ld4. This
arranges the data for most efficient use of the vector instructions and
removes the need for completion adds at the end. The number of
iterations of the C per iteration of the assembly is increased from 4 to
8, but because of the prefetching, there must be a special section
without prefetching when dstW < 16.
This improves speed on Graviton 2 (Neoverse N1) dramatically in the case
where previously fs=8 would have been required.
before: hscale_8_to_15__fs_8_dstW_512_neon: 1962.8
after : hscale_8_to_15__fs_4_dstW_512_neon: 1220.9
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
This avoids having to rebuild big files every time FFMPEG_VERSION
changes (which it does with every commit).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This avoids unnecessary churn and build breakage for users, by
making sure the whole version.h is included like it has been so far,
while keeping the benefit of not needing to rebuild most files in
the ffmpeg tree on minor/micro bumps.
Signed-off-by: Martin Storsjö <martin@martin.st>
Also bump the minor versions of all libraries, to signify the
API change of splitting the version.h headers and adding the
new version_major.h header.
Signed-off-by: Martin Storsjö <martin@martin.st>
The range parameters need to be set up before calling
sws_init_context (which selects which fastpaths can be used;
this gets called by sws_getContext); solely passing them via
sws_setColorspaceDetails isn't enough.
This fixes producing full range YUV range output when doing
YUV->YUV conversions between different YUV color spaces.
Signed-off-by: Martin Storsjö <martin@martin.st>
Some of these were made possible by moving several common macros to
libavutil/macros.h.
While just at it, also improve the other headers a bit.
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Inside a function an unnecessary ';' is just a null statement;
yet outside of it it is actually illegal (but compilers happen
to accept it without warning except when using -pedantic).
So modify the macros to always expect the user to add a ';'.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is done a second time for 5.0 because master was
merged into 5.0 so that it contains the recent DOVI additions.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
In case of shared builds, some object files containing tables
are currently duplicated into other libraries: log2_tab.c,
golomb.c, reverse.c. The check for whether this is duplicated
is simply whether CONFIG_SHARED is true. Yet this is crude:
E.g. libavdevice includes reverse.c for shared builds, but only
needs it for the decklink input device, which given that decklink
is not enabled by default will be unused in most libavdevice.so.
This commit changes this by making it more explicit about what
to duplicate from other libraries. To do this, two new Makefile
variables were added: SHLIBOBJS and STLIBOBJS. SHLIBOBJS contains
the objects that are duplicated from other libraries in case of
shared builds; STLIBOBJS contains stuff that a library has to
provide for other libraries in case of static builds. These new
variables provide a way to enable/disable with a finer granularity
than just whether shared builds are enabled or not. E.g. lavd's
Makefile now contains: SHLIBOBJS-$(CONFIG_DECKLINK_INDEV) += reverse.o
Another example is provided by the golomb tables. These are provided
by lavc for static builds, even if one uses a build configuration
that makes only lavf use them. Therefore lavc's Makefile contains
STLIBOBJS-$(CONFIG_MXF_MUXER) += golomb.o, whereas lavf's Makefile
has a corresponding SHLIBOBJS-$(CONFIG_MXF_MUXER) += golomb_tab.o.
E.g. in case the MXF muxer is the only component needing these tables
only libavformat.so will contain them for shared builds; currently
libavcodec.so does so, too.
(There is currently a CONFIG_EXTRA group for golomb. But actually
one would need two groups (golomb_avcodec and golomb_avformat) in
order to know when and where to include these tables. Therefore
this commit uses a Makefile-based approach for this and stops
using these groups for the users in libavformat.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Fixes so that fate under 64 bit Windows passes.
These functions replace all ff_hscale8to15_*_ssse3 when avx2 is available.
Signed-off-by: James Almer <jamrial@gmail.com>
This is ment to be a cosmetic change
old timings:
42780 UNITS in grayf32le, 1 runs, 0 skips
56720 UNITS in grayf32le, 2 runs, 0 skips
67265 UNITS in grayf32le, 4 runs, 0 skips
58082 UNITS in grayf32le, 8 runs, 0 skips
63512 UNITS in grayf32le, 16 runs, 0 skips
52720 UNITS in grayf32le, 32 runs, 0 skips
46491 UNITS in grayf32le, 64 runs, 0 skips
68500 UNITS in grayf32be, 1 runs, 0 skips
66930 UNITS in grayf32be, 2 runs, 0 skips
62305 UNITS in grayf32be, 4 runs, 0 skips
55510 UNITS in grayf32be, 8 runs, 0 skips
50216 UNITS in grayf32be, 16 runs, 0 skips
44480 UNITS in grayf32be, 32 runs, 0 skips
42394 UNITS in grayf32be, 64 runs, 0 skips
new timings:
46660 UNITS in grayf32le, 1 runs, 0 skips
51830 UNITS in grayf32le, 2 runs, 0 skips
53390 UNITS in grayf32le, 4 runs, 0 skips
50910 UNITS in grayf32le, 8 runs, 0 skips
44968 UNITS in grayf32le, 16 runs, 0 skips
40349 UNITS in grayf32le, 32 runs, 0 skips
38330 UNITS in grayf32le, 64 runs, 0 skips
39980 UNITS in grayf32be, 1 runs, 0 skips
49630 UNITS in grayf32be, 2 runs, 0 skips
53540 UNITS in grayf32be, 4 runs, 0 skips
59767 UNITS in grayf32be, 8 runs, 0 skips
51206 UNITS in grayf32be, 16 runs, 0 skips
44743 UNITS in grayf32be, 32 runs, 0 skips
41468 UNITS in grayf32be, 64 runs, 0 skips
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This makes output consistent with a similar warning just few
lines above where this flag is checked in the same way.
Signed-off-by: softworkz <softworkz@hotmail.com>
Signed-off-by: Marton Balint <cus@passwd.hu>
Mixing unsigned and signed often leads to unexpected arithmetic results.
Fixes: out of array write
Found-by: Paul B Mahol <onemda@gmail.com>
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This resolves a problem where conversions from YUV to X2RGB10LE
would produce color values a factor 4 too small, because an 8-bit
value was placed in a 10-bit channel.
Signed-off-by: Manuel Stoeckl <code@mstoeckl.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
SSE2 is x86 specific, yet due to the call to av_get_cpu_flags()
compilers were unable to optimize the checks (and the call) away
on other arches.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
In this case the current code tries to warn once; to do so, it uses
ordinary static ints to store whether the warning has already been
emitted. This is both a data race (and therefore undefined behaviour)
as well as a race condition, because it is really possible for multiple
threads to be the one thread to emit the warning. This is actually
common since the introduction of the new multithreaded scaling API.
This commit fixes this by using atomic integers for the state;
furthermore, these are not static anymore, but rather contained
in the user-facing SwsContext (i.e. the parent SwsContext in case
of slice-threading).
Given that these atomic variables are not intended for synchronization
at all (but only for atomicity, i.e. only to output the warning once),
the atomic operations use memory_order_relaxed.
This affected the nv12, nv21, yuv420, yuv420p10, yuv422, yuv422p10 and
yuv444 filter-overlay FATE-tests.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This allows to associate log messages from slice contexts to
the user-visible SwsContext.
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
These inclusions are not necessary, as cpu.h is already included
wherever it is needed (via direct inclusion or via the arch-specific
headers).
Also remove other unnecessary cpu.h inclusions from ordinary
non-headers.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Dither none is only implemented in full chroma interpolation for these rgb formats
Its also a obscure choice (producing less nice images) that implementing it in the
other code-paths makes no sense
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Call the scaler function directly rather than through a function
pointer. Drop the now-unused return value from ff_getSwsFunc() and
rename the function to reflect its new role.
This will be useful in the following commits, where it will become
important that the amount of output is different for scaled vs unscaled
case.
Call ff_sws_rgb2rgb_init via ff_thread_once instead of checking one of the
variables it updates.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The last user of g15Mask, r15Mask, g16Mask and r16Mask was disabled
in 77a416e8aa and finally removed in
36e8de07ed62609df45d064b56501e3084d25723; b15Mask and b16Mask were
apparently always unused (except for in_asm_used_var_warning_killer,
a function that only existed to make the compiler not optimize ASM
constants away).
w10 is unused since d604bab901, w02
since ef423a6618.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
mask24hh etc. are unused since f099fbf5f3,
mask32b and mask32r since 296609f859,
mask32g since b38d487466 and mask32 since
f8a138be52.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
Add inline function for vec_xl if VSX is not supported. vec_xl intrinsic
is only available on POWER 7 or higher.
Fixes ticket #8750.
Signed-off-by: Andriy Gelman <andriy.gelman@gmail.com>
Fixes fate-qtrle-32bit on big-endian.
The macro does a simple byte swap on uint8 array without any casts, so
it's valid on big-endian arches.
The mentioned test was failing because the byteswap function
shuffle_bytes_3210_c() is used in the pixel format conversion
(argb->bgra).
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andriy Gelman <andriy.gelman@gmail.com>
Fixes vf_scale outputting RGB AVFrames with limited range flagged
in case either input or output specifically sets the range.
This is the reverse of the logic utilized for RGB and PAL8 content
in sws_setColorspaceDetails.
These conversion appears to be exhibiting the same rounding error as the rgbf32 formats where.
I seperated the rounding value from the 16 and 128 offsets, I think it makes it a little more clear.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
changes since v1:
- made into fate test
- fixed c90 warnings
- tests more intermediate formats
- tested on BE mips too
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>