Do this by putting an AVBuffer structure into BufferPoolEntry and
reuse it for all subsequent uses of said BufferPoolEntry.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Microsoft VideoProcessor requires texture with D3DUSAGE_RENDERTARGET flag as output.
There is no way to allocate array of textures with D3D11_BIND_RENDER_TARGET flag
and .ArraySize > 2 by ID3D11Device_CreateTexture2D due to the Microsoft limitation.
Adding AVD3D11FrameDescriptors array to store array of single textures
instead of texture with multiple slices resolves this.
Signed-off-by: Artem Galin <artem.galin@intel.com>
UPD: Rebase of last patch set over current master and use DX9 as default device type.
Makes selection of dxva2/DX9 device type by default as before with explicit d3d11va/DX11 usage to cover more HW configurations.
Added warning message to expect changing default device type in the future.
Fixes TGL / AV1 decode as requires DX11 with explicit DX11 type
selection.
Add headless/multi adapter support and fixes:
https://trac.ffmpeg.org/ticket/7511https://trac.ffmpeg.org/ticket/6827http://ffmpeg.org/pipermail/ffmpeg-trac/2017-November/041901.htmlhttps://trac.ffmpeg.org/ticket/7933338fbcd5bbhttps://github.com/jellyfin/jellyfin/issues/2626#issuecomment-602153952
Any other fixes are welcome including OpenCL interop patch since I don't have proper setup to validate this use case
Decoding, encoding, transcoding have been validated.
child_device_type option is responsible for d3d11va/dxva2 device selection
Usage examples:
DirectX 11:
-init_hw_device qsv:hw,child_device_type=d3d11va
-init_hw_device qsv:hw,child_device_type=d3d11va,child_device=0
OR
-init_hw_device d3d11va=dx -init_hw_device qsv@dx
DirectX 9 is still supported but requires explicit selection:
-init_hw_device qsv:hw,child_device_type=dxva2
OR
-init_hw_device dxva2=dx -init_hw_device qsv@dx
Signed-off-by: Artem Galin <artem.galin@intel.com>
This enables usage of non-powered/headless GPU, better HDR support.
Pool of resources is allocated as one texture with array of slices.
Signed-off-by: Artem Galin <artem.galin@intel.com>
EINVAL is the wrong error code here, since the arguments passed to the
function are valid. The error is that the function is not implemented in
the build, which corresponds to ENOSYS.
From SMPTE RDD 5-2006, the grain seed is to be computed from the
following definition of `pic_offset`:
> When decoding H.264 | MPEG-4 AVC bitstreams, pic_offset is defined as
> follows:
> - pic_offset = PicOrderCnt(CurrPic) + (PicOrderCnt_offset << 5)
> where:
> - PicOrderCnt(CurrPic) is the picture order count of the current frame,
> which shall be derived from [the video stream].
>
> - PicOrderCnt_offset is set to idr_pic_id on IDR frames. idr_pic_id
> shall be read from the slice header of [the video stream]. On non-IDR I
> frames, PicOrderCnt_offset is set to 0. A frame shall be classified as I
> frame when all its slices are I slices, which may be optionally
> designated by setting primary_pic_type to 0 in the access delimiter NAL
> unit. Otherwise, PicOrderCnt_offset it not changed. PicOrderCnt_offset is
> updated in decoding order.
Co-authored-by: James Almer <jamrial@gmail.com>
Signed-off-by: Niklas Haas <git@haasn.dev>
Signed-off-by: James Almer <jamrial@gmail.com>
In particular, document that av_opt_copy() always disentangles
allocated options even on error; this guarantee is needed to e.g.
properly free duplicated thread contexts in libavcodec on error.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The reason why the generic av_image_copy_uc_from() doesn't really
fit in the case for Vulkan is because some planes may be copied via
other methods (such as mapping GPU memory), and if they don't satisfy
the strict alignment requirements, a gpu image->gpu buffer->cpu ram
copy is performed.
We need this for hwcontext_vulkan, and I think this will also be
useful to API users like libplacebo who would rather not write
a custom SIMD memcpy.
Since 580e168a94, av_size_mult() is no
longer inlined; on systems where interposing is a thing, this also
inhibits the compiler from inlining said function into the internal
callers of said function, although inlining such a small function is
typically beneficial: With GCC 10.3 on Ubuntu x64 and -O3 this decreases
the size of av_realloc_array from 91B to 23B, from 129B to 81B for
av_realloc_f and from 77B to 23B for each of av_malloc_array,
av_mallocz_array and av_calloc.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It is also used by libavfilter and it is only natural to define it
alongside ff_dlog().
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Up until now, including error.h alone does not make the AVERROR_* defines
usable, because they just expand to something involving MKTAG, but
without the header providing MKTAG. So include macros.h, the header
providing MKTAG.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
common.h currently contains several things: Math macros, UTF-8 macros,
other fundamental macros; furthermore it also contains miscellaneous
math functions and it (directly and indirectly) includes lots of other
headers.
This commit moves the "other fundamental macros" to macros.h which is
a more fitting place.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Some function had exceed 30 inline assembly register oprands limiation
when using LOONGSON2 version of MMI macros. We can avoid that by take
$at, which is register reserved for assembler, as temporary register.
As none of instructions used in these macros is pseudo, it is safe to
utilize $at here.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Reviewed-by: Shiyou Yin <yinshiyou-hf@loongson.cn>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
{SAVE,RECOVER}_REG will be available for Loongson2 again,
also comment about the magic.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Reviewed-by: Shiyou Yin <yinshiyou-hf@loongson.cn>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
These have mostly been added because of FF_API_*; yet when these were
removed, removing the header has been forgotten.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
These inclusions are not necessary, as cpu.h is already included
wherever it is needed (via direct inclusion or via the arch-specific
headers).
Also remove other unnecessary cpu.h inclusions from ordinary
non-headers.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It is not used here at all; instead, add it where it is used without
including it or any of the arch-specific CPU headers.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Teach AV_HWDEVICE_TYPE_VIDEOTOOLBOX to be able to create AVFrames of type
AV_PIX_FMT_VIDEOTOOLBOX. This can be used to hwupload a regular AVFrame
into its CVPixelBuffer equivalent.
ffmpeg -init_hw_device videotoolbox -f lavfi -i color=black:640x480 -vf hwupload -c:v h264_videotoolbox -f null -y /dev/null
Signed-off-by: Aman Karmani <aman@tmm1.net>
The field is a standard field, yet we were loading it as if it was
a quadword. This worked for forward transforms by chance, but broke
when the transform was inverse.
checkasm couldn't catch that because we only test forward transforms,
which are identical to inverse transforms but with a different revtab.
Fixes: left shift of negative value -1
Fixes: 33736/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_SIREN_fuzzer-6657785795313664
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Clang is more strict on the type of asm operands, float or double
type variable should use constraint 'f', integer variable should
use constraint 'r'.
Signed-off-by: Jin Bo <jinbo@loongson.cn>
Reviewed-by: yinshiyou-hf@loongson.cn
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This mostly reverts 785bfb1d7b.
But I also added some clarifications so that nobody mixes primaries
with matrix again. SMPTE 240 and 170 primaires are the same, while
matrix coeff. are different, because 240 is derived from 170's new
primaries and white point while 170 uses BT.601 derived from BT.470
System M (yes, with Illuminant C) a.k.a. NTSC 1953. Some nits too.
Reviewed-by: Reto Kromer <lists@reto.ch>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
MSA2 optimizations are attached to MSA macros in generic_macros_msa.h.
It's difficult to do runtime check for them. Remove this part of code
can make it more robust. H264 1080p decoding: 5.13x==>5.12x.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
While Vulkan itself went more or less the way it was expected to go,
libvulkan didn't quite solve all of the opengl loader issues. It's multi-vendor,
yes, but unfortunately, the code is Google/Khronos QUALITY, so suffers from
big static linking issues (static linking on anything but OSX is unsupported),
has bugs, and due to the prefix system used, there are 3 or so ways to type out
functions.
Just solve all of those problems by dlopening it. We even have nice emulation
for it on Windows.
VkPhysicalDeviceLimits.optimalBufferCopyRowPitchAlignment and
VkPhysicalDeviceExternalMemoryHostPropertiesEXT.minImportedHostPointerAlignment are of type
VkDeviceSize (a typedef uint64_t).
VkPhysicalDeviceLimits.minMemoryMapAlignment is of type size_t.
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: Lynne <dev@lynne.ee>
Originally deprecated in 1296b1f6c0631ab79464e22d48a6a1548450b943;
scheduled again for removal in a991526832.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx.
The design of this pure assembly FFT is pretty unconventional.
On the lowest level, instead of splitting the complex numbers into
real and imaginary parts, we keep complex numbers together but split
them in terms of parity. This saves a number of shuffles in each transform,
but more importantly, it splits each transform into two independent
paths, which we process using separate registers in parallel.
This allows us to keep all units saturated and lets us use all available
registers to avoid dependencies.
Moreover, it allows us to double the granularity of our per-load permutation,
skipping many expensive lookups and allowing us to use just 4 loads per register,
rather than 8, or in case FMA3 (and by extension, AVX2), use the vgatherdpd
instruction, which is at least as fast as 4 separate loads on old hardware,
and quite a bit faster on modern CPUs).
Higher up, we go for a bottom-up construction of large transforms, foregoing
the traditional per-transform call-return recursion chains. Instead, we always
start at the bottom-most basis transform (in this case, a 32-point transform),
and continue constructing larger and larger transforms until we return to the
top-most transform.
This way, we only touch the stack 3 times per a complete target transform:
once for the 1/2 length transform and two times for the 1/4 length transform.
The combination algorithm we use is a standard Split-Radix algorithm,
as used in our C code. Although a version with less operations exists
(Steven G. Johnson and Matteo Frigo's "A modified split-radix FFT with fewer
arithmetic operations", IEEE Trans. Signal Process. 55 (1), 111–119 (2007),
which is the one FFTW uses), it only has 2% less operations and requires at least 4x
the binary code (due to it needing 4 different paths to do a single transform).
That version also has other issues which prevent it from being implemented
with SIMD code as efficiently, which makes it lose the marginal gains it offered,
and cannot be performed bottom-up, requiring many recursive call-return chains,
whose overhead adds up.
We go through a lot of effort to minimize load/stores by keeping as much in
registers in between construcring transforms. This saves us around 32 cycles,
on paper, but in reality a lot more due to load/store aliasing (a load from a
memory location cannot be issued while there's a store pending, and there are
only so many (2 for Zen 3) load/store units in a CPU).
Also, we interleave coefficients during the last stage to save on a store+load
per register.
Each of the smallest, basis transforms (4, 8 and 16-point in our case)
has been extremely optimized. Our 8-point transform is barely 20 instructions
in total, beating our old implementation 8-point transform by 1 instruction.
Our 2x8-point transform is 23 instructions, beating our old implementation by
6 instruction and needing 50% less cycles. Our 16-point transform's combination
code takes slightly more instructions than our old implementation, but makes up
for it by requiring a lot less arithmetic operations.
Overall, the transform was optimized for the timings of Zen 3, which at the
time of writing has the most IPC from all documented CPUs. Shuffles were
preferred over arithmetic operations due to their 1/0.5 latency/throughput.
On average, this code is 30% faster than our old libavcodec implementation.
It's able to trade blows with the previously-untouchable FFTW on small transforms,
and due to its tiny size and better prediction, outdoes FFTW on larger transforms
by 11% on the largest currently supported size.
This sadly required making changes to the code itself,
due to the same context needing to be reused for both versions.
The lookup table had to be duplicated for both versions.
This commit refactors the power-of-two FFT, making it faster and
halving the size of all tables, making the code much smaller on
all systems.
This removes the big/small pass split, because on modern systems
the "big" pass is always faster, and even on older machines there
is no measurable speed difference.
av_set_cpu_flags_mask() has been deprecated in the commit which merged
it: 6df42f98746be06c883ce683563e07c9a2af983f; av_parse_cpu_flags() has
been deprecated in 4b529edff8.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
av_frame_copy() is allowed to return values >= 0 on success, whereas
the documentation of av_frame_ref() states that the return value is 0 on
success. Ergo the latter must not just return the former's return value.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
av_cpu_count() intends to emit a debug message containing the number of
logical cores when called the first time. The check currently works with
a static volatile int; yet this does not help at all in case of
concurrent accesses by multiple threads. So replace this with an
atomic_int.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>