Should make strict compilers happy.
Also, make AV_COPY128 use integer operations while at it. Removing the
inclusion of immintrin.h ensures a lot less intrinsic related headers are
included as well, which fixes a clash of defines with some Clang versions.
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: James Almer <jamrial@gmail.com>
av_executor_execute run the task directly when thread is disabled.
The task can schedule a new task by call av_executor_execute. This
forms an implicit recursive call. This patch removed the recursive
call.
Removed by accident in the previous commits. This makes the code only run when
compiled with GCC and Clang like before. Support for other compilers like msvc
can be added later.
Signed-off-by: James Almer <jamrial@gmail.com>
This has the benefit of removing any SSE -> AVX penalty that may happen when
the compiler emits VEX encoded instructions.
Signed-off-by: James Almer <jamrial@gmail.com>
When called inside a loop, the inline asm version results in one pxor
unnecessarely emitted per iteration, as the contents of the __asm__() block are
opaque to the compiler's instruction scheduler.
This is not the case with intrinsics, where pxor will be emitted once with any
half decent compiler.
This also has the benefit of removing any SSE -> AVX penalty that may happen
when the compiler emits VEX encoded instructions.
Signed-off-by: James Almer <jamrial@gmail.com>
Fixes: CID1591930 Wrong sizeof argument
Sponsored-by: Sovereign Tech Fund
Reviewed-by: Steve Lhomme <robux4@ycbcr.xyz>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: CID1591944 Wrong sizeof argument
Sponsored-by: Sovereign Tech Fund
Reviewed-by: Steve Lhomme <robux4@ycbcr.xyz>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: CID1598558 Resource leak
Sponsored-by: Sovereign Tech Fund
Reviewed-by: Steve Lhomme <robux4@ycbcr.xyz>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: CID1591909 Wrong sizeof argument
Sponsored-by: Sovereign Tech Fund
Reviewed-by: Steve Lhomme <robux4@ycbcr.xyz>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
In addition to the other properties, try to obtain the right
CGColorSpace and set it as well, else it could lead to a CVBuffer
tagged as BT.2020 but with a CGColorSpace indicating BT.709.
Therefore it is essential for consistency to set a colorspace
according to the other values, or if none can be obtained (for example
because the other values are all unspecified) unset it as well.
Fix#10884
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
The documentation was not clear at all what specifically the
function does, so it was left unspecified if it will unset or
not touch attachments it could not map from the AVFrame.
The documentation of the return value was wrong as well.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
When mapping AVFrame properties to the CVBuffer attachments, it is
necessary to properly delete undefined attachments, else we can
leave incorrect values in there guessed from VideoToolbox for
example, leading to inconsistent results where the AVFrame and
CVBuffer differ in metadata.
Ref #10884
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
Given that a video stream/frame may have only one view or both views coded with
the packing information being unavailable, this commit adds a new type value
AV_STEREO3D_UNSPEC for this purpose.
The most common case for this is container level signaling of Stereo3D video
where the specifics are defined at the bitstream level.
Signed-off-by: James Almer <jamrial@gmail.com>
Before the patch, disable threads support at configure/build time
was the only method to force zero thread in executor. However,
it's common practice for libavcodec to run on caller's thread when
user specify thread number to one. And for WASM environment, whether
threads are supported needs to be detected at runtime. So executor
should support zero thread at runtime.
A single thread executor can be useful, e.g., to handle network
protocol. So we can't take thread_count one as zero thread, which
disabled a valid usercase.
Other libraries take -threads 0 to mean auto. Executor as a low
level utils doesn't do cpu detect. So take thread_count zero as
zero thread, literally.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
This avoids hardcoding any implementation-specific limitiations as
part of the API, and allows for future expandability.
This also allows API users to more conveniently convert the
values into floats without hardcoding specific conversion constants.
The API was committed a few days ago, so changing this field now
is within the realms of acceptable.
On m1, kpc_get_counter_count(KPC_MASK) return 8 in my test. The
exact value doesn't matter in our case, as long as we have a
sufficiently large array
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
The default timer register pmccntr_el0 usually requires enabling
access with e.g. a kernel module (while it is accessible by
default on Windows). On Linux, the default for checkasm benchmarks
is to use perf (if suitable headers are available) though.
On macOS, using cntvct_el0 gives measurements with the same
magnitude as mach_absolute_time (which is used currently), but
possibly with a little less overhead/noise.
Signed-off-by: Martin Storsjö <martin@martin.st>
The vendor has long since switched to Arm, with the last product
reaching their official end-of-life over 11 years ago. Linux support for
the ISA was dropped 7 years ago. More importantly, this architecture was
never supported by upstream GCC, and the vendor fork is stuck at version
4.2, which FFmpeg no longer supports (as per C11 requirement).
Presumably, this is still the case given the lack of vendor support.
Indeed all of the code being removed here consisted of inline assembler
scalar optimisations. A sane C compiler should be able to perform those
automatically nowadays (with the sole exception of fast CLZ detection),
but this is moot as this architecture is evidently dead.
C code or compiler built-ins are preferable over inline assembler for
byte-swaps as it allows for better optimisations (e.g. instruction
scheduling) which would otherwise be impossible.
As with f64c2e710f for x86 and Arm,
this removes the inline assembler on GCC (and Clang) since we now
require recent enough compiler versions. This indeed seems to work on
AArch64, SuperH and, if Zbb is enabled, RISC-V. (AVR32 was not tested
since it has no known working compilers at this time.)
Since the C11 support is required, those GCC versions can no longer be
supported anyhow. (Clang pretends to be GCC 4.4, but it looks like the
code was intended for old GCC specifically.)
Since the C11 support is required, those GCC versions can no longer be
supported anyhow. (Clang pretends to be GCC 4.4, but the removed code
does not seem to have been intended for Clang.)
Otherwise nothing is written into the destination when a write mapping
is requested.
For example, a vulkan frame mapped from a drm frame (which is wrapped as
a vaapi frame in the example) is used as the output of scale_vulkan
filter, it always gets a green screen without this patch.
ffmpeg -init_hw_device vaapi=va -init_hw_device vulkan=vulkan@va
-filter_hw_device vulkan -f lavfi -i testsrc=size=352x288,format=nv12
-vf
"hwupload,scale_vulkan,hwmap=derive_device=vaapi:reverse=1,format=vaapi,hwdownload,format=nv12"
-f nut - | ffplay -
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
This adds runtime support to use Zbb REV8 for 32- and 64-bit byte-wise
swaps. The result is about five times slower than if targetting Zbb
statically, but still a lot faster than the default bespoke C code or a
call to GCC run-time functions.
For 16-bit swap, this is however unsurprisingly a lot worse, and so this
sticks to the baseline. In fact, even using REV8 statically does not
seem to be beneficial in that case.
Zbb static Zbb dynamic I baseline
bswap16: 0.668184765 3.340764069 0.668029012
bswap32: 0.668174014 3.340763319 9.353855435
bswap64: 0.668221765 3.340496313 14.698672283
(seconds for 1 billion iterations on a SiFive-U74 core)
Due to hysterical raisins, most RISC-V Linux distributions target a
RV64GC baseline excluding the Bit-manipulation ISA extensions, most
notably:
- Zba: address generation extension and
- Zbb: basic bit manipulation extension.
Most CPUs that would make sense to run FFmpeg on support Zba and Zbb
(including the current FATE runner), so it makes sense to optimise for
them. In fact a large chunk of existing assembler optimisations relies
on Zba and/or Zbb.
Since we cannot patch shared library code, the next best thing is to
carry a flag initialised at load-time and check it on need basis.
This results in 3 instructions overhead on isolated use, e.g.:
1: AUIPC rd, %pcrel_hi(ff_rv_zbb_supported)
LBU rd, %pcrel_lo(1b)(rd)
BEQZ rd, non_Zbb_fallback_code
// Zbb code here
The C compiler will typically load the flag ahead of time to reducing
latency, and can also keep it around if Zbb is used multiple times in a
single optimisation scope. For this to work, the flag symbol must be
hidden; otherwise the optimisation degrades with a GOT look-up to
support interposition:
1: AUIPC rd, GOT_OFFSET_HI
LD rd, GOT_OFFSET_LO(rd)
LBU rd, (rd)
BEQZ rd, non_Zbb_fallback_code
// Zbb code here
This patch adds code to provision the flag in libraries using bit
manipulation functions from libavutil: byte-swap, bit-weight and
counting leading or trailing zeroes.