We currently don't have any cases where this is needed, but include
it for completeness and clarity.
These macros for BTI were added in
08b4716a9e.
A later comment in this file, added in
248986a0db, referenced the macro
AARCH64_VALID_JUMP_CALL_TARGET which never was added here before.
Unit test covering av_video_enc_params_alloc,
av_video_enc_params_block, and
av_video_enc_params_create_side_data.
Tests allocation for all three codec types (VP9, H264, MPEG2) and
the NONE type, with 0 and 4 blocks, with and without size output.
Verifies block getter indexing by writing and reading back
coordinates, dimensions, and delta_qp values. Tests frame-level qp
and delta_qp fields, and side data creation with frame attachment.
Coverage for libavutil/video_enc_params.c: 0.00% -> 86.21%
(remaining uncovered lines are OOM error paths)
Signed-off-by: marcos ashton <marcosashiglesias@gmail.com>
Unit test covering av_detection_bbox_alloc, av_get_detection_bbox,
and av_detection_bbox_create_side_data.
Tests allocation with 0, 1, and 4 bounding boxes, with and without
size output. Verifies bbox getter indexing by writing and reading
back coordinates, labels, and confidence values. Tests classify
fields (labels and confidences), the header source field, and
side data creation with frame attachment.
Coverage for libavutil/detection_bbox.c: 0.00% -> 86.67%
(remaining uncovered lines are OOM error paths)
Signed-off-by: marcos ashton <marcosashiglesias@gmail.com>
Unit test covering all 4 public API functions in libavutil/spherical.c:
av_spherical_alloc, av_spherical_projection_name, av_spherical_from_name,
and av_spherical_tile_bounds.
Tests allocation with and without size output, all 7 projection type
name lookups, projection name round-trip verification, out-of-range
handling, and tile bounds computation for full-frame, quarter-tile,
and centered-tile configurations.
Coverage for libavutil/spherical.c: 0.00% -> 100.00%
Signed-off-by: marcos ashton <marcosashiglesias@gmail.com>
It is only needed in the unlikely codepath. The ordinary one
only uses six xmm registers.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Only the process functions are entered via an indirect _call_ from C.
The kernel functions and process_return are dispatched to by indirect
_branches_ instead (continuation-passing style design).
Make use of the recently added "jumpable" parameter to the function
macro in libavutil/aarch64/asm.S to fix these functions when BTI is
enabled.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The function macro emits AARCH64_VALID_CALL_TARGET for exported symbols,
marking them as valid destinations for indirect _calls_. Functions that
are reached by indirect _branches_ (i.e. tail-call dispatch chains
where the link register is not set) require AARCH64_VALID_JUMP_TARGET
instead.
This commit adds a "jumpable" parameter to the function macro that, when
set, emits AARCH64_VALID_JUMP_TARGET instead of AARCH64_VALID_CALL_TARGET.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
Using AMF interfaces in C can be cumbersome and visually difficult to process in some cases: i.e.: object->function(object, args). To improve code readability, a new macro is added. This commit is instrumental for future AMF integration refactoring.
-vf_vpp_amf.c: Remove unused variables.
-vf_amf_common.c: Fix hdrmeta_buffer memory leak.
-hwcontext_amf.c: Fix av_amf_extract_hdr_metadata not picking up light metadata if display mastering metadata is not set.
-doc/filters.texi: Remove irrelevant example with HDR metadata for vpp_amf.
The use of code section (.text) was forced by the unreleased NASM
3.02rc3 which made the issue worse, but preventing assambling anything
without code section, including when only data was present.
This works fine for the most part, but using code (.text) section with
IMAGE_COMDAT_SELECT_ANY causes issues with lib.exe after stripping such
object:
fatal error LNK1143: invalid or corrupt file: no symbol for COMDAT section 0x2
Esentially it makes our workaround not work in all cases, and while
string could be disabled like it already is for MSVC/ICL builds, it used
to work so let's preserve that state.
This make it not compatible with NASM 3.02rc3 when CV debug info is
generated, but hopefully the upstream fix will be merged before release,
to avoid this regression:
https://github.com/netwide-assembler/nasm/pull/221
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
Add NEON-optimized implementation for HEVC intra Planar prediction at
8-bit depth, supporting all block sizes (4x4 to 32x32).
Planar prediction implements bilinear interpolation using an incremental
base update: base_{y+1}[x] = base_y[x] - (top[x] - left[N]), reducing
per-row computation from 4 multiply-adds to 1 subtract + 1 multiply.
Uses rshrn for rounded narrowing shifts, eliminating manual rounding
bias. All left[y] values are broadcast in the NEON domain, avoiding
GP-to-NEON transfers.
4x4 interleaves row computations across 4 rows to break dependencies.
16x16 uses v19-v22 for persistent base/decrement vectors, avoiding
callee-saved register spills. 32x32 processes 8 rows per loop iteration
(4 iterations total) to reduce code size while maintaining full NEON
utilization.
Speedup over C on Apple M4 (checkasm --bench):
4x4: 2.25x 8x8: 6.40x 16x16: 9.72x 32x32: 3.21x
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Add NEON-optimized implementation for HEVC intra DC prediction at 8-bit
depth, supporting all block sizes (4x4 to 32x32).
DC prediction computes the average of top and left reference samples
using uaddlv, with urshr for rounded division. For luma blocks smaller
than 32x32, edge smoothing is applied: the first row and column are
blended toward the reference using (ref[i] + 3*dc + 2) >> 2 computed
entirely in the NEON domain. Fill stores use pre-computed address
patterns to break dependency chains.
Also adds the aarch64 initialization framework (Makefile, pred.c/pred.h
hooks, hevcpred_init_aarch64.c).
Speedup over C on Apple M4 (checkasm --bench):
4x4: 2.28x 8x8: 3.14x 16x16: 3.29x 32x32: 3.02x
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Add checkasm test for HEVC intra prediction covering DC, planar, and
angular modes at all block sizes (4x4 to 32x32) for 8-bit and 10-bit
depth.
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
In case of >8bpp, there is already a zero register available
(for clipping); in case of Unix64, one can simply use an
unused register. Doing so reduces codesize.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Avoids push+pop on Win64; in any case, using registers m0-m7
more often saves codesize.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Avoids push+pop on Win64; in any case, using registers m0-m7
more often saves codesize.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Also use a register in the 0-7 range as clobber reg,
as this reduces codesize (by 51B).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The height 8 and 16 cases differ from the second BDOF mini block onwards,
but even the beginning of said mini block is the same and can therefore
be deduplicated. This saves 821B here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
m8 here (corresponding to a mix of sgx2 and sgy2 in derive_bdof_vx_vy
in the C version) is always nonnegative, so the psignd boils down to
a check for m8 being zero. But if an entry of m8 is zero, then
the corresponding entry of m9 is automatically zero, too, as sgx2
being zero implies sgxdi being zero and sgy2 implies sgxgy, sgydi
being zero.* So just remove these redundant instructions.
*: In other words, one could remove the sgx2,sgy2>0 checks from
the end of derive_bdof_vx_vy() as long as av_log2(0) is defined.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This commit pieces together the previous few commits to implement the
NEON backend for sws_ops.
In essence, a tool which runs on the target (sws_ops_aarch64) is used
to enumerate all the functions that the backend needs to implement. The
list it generates is stored in the repository (ops_entries.c).
The list from above is used at build time by a code generator tool
(ops_asmgen) to implement all the sws_ops functions the NEON backend
supports, and generate a lookup function in C to retrieve the assembly
function pointers.
At runtime, the NEON backend fetches the function pointers to the
assembly functions and chains them together in a continuation-passing
style design, similar to the x86 backend.
The following speedup is observed from legacy swscale to NEON:
A520: Overall speedup=3.780x faster, min=0.137x max=91.928x
A720: Overall speedup=4.129x faster, min=0.234x max=92.424x
And the following from the C sws_ops implementation to NEON:
A520: Overall speedup=5.513x faster, min=0.927x max=14.169x
A720: Overall speedup=4.786x faster, min=0.585x max=20.157x
The slowdowns from legacy to NEON are the same for C/x86. Mostly low
bit-depth conversions that did not perform dithering in legacy.
The 0.585x outlier from C to NEON is gbrpf32le -> gbrapf32le, which is
mostly memcpy with the C implementation. All other conversions are
better.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The NEON sws_ops backend follows the same continuation-passing style
design as the x86 backend.
Unlike the C and x86 backends, which implement the various operation
functions through the use of templates and preprocessor macros, the
NEON backend uses a build-time code generator, which is introduced by
this commit.
This code generator has two modes of operation:
-ops:
Generates an assembly file in GNU assembler syntax targeting AArch64,
which implements all the sws_ops functions the NEON backend supports.
-lookup:
Generates a C function with a hierarchical condition chain that
returns the pointer to one of the functions generated above, based on
a given set of parameters derived from SwsOp.
This is the core of the NEON sws_ops backend.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The runtime assembler interface provides an instruction-level IR and
builder API tailored to the needs of the swscale dynamic pipeline.
It is not meant to be a general purpose assembler interface.
Currently only a static file backend, which emits GNU assembler text,
has been implemented. In the future, this interface will be used to
write functions dynamically at runtime.
This code will be compiled both for runtime usage to generate optimized
functions and for build-time usage to generate static assembly files.
Therefore, it must not depend on internal FFmpeg libraries.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The NEON sws_ops backend will use a build-time code generator for the
various operation functions it needs to implement. This build time code
generator (ops_asmgen) will need a list of the operations that must be
implemented. This commit adds a tool (sws_ops_aarch64) that generates
such a list (ops_entries.c).
The list is generated by iterating over all possible conversion
combinations and collecting the parameters for each NEON assembly
function that has to be implemented, defined by an unique set of
parameters derived from SwsOp. Whenever swscale evolves, with improved
optimization passes, new pixel formats, or improvements to the backend
itself, this file (ops_entries.c) should be regenerated by running:
$ make sws_ops_entries_aarch64
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
This is needed to cover the case when assembled source doesn't have
.text section. NASM documentation suggest to add $ suffix to section
name for COMDAT in .text, but this actually requires the main .text
section to exist also. And use less generic suffix for our dummy
sub-section.
Third time's the charm.
Fixes: 80cd067715
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
The existing fate-lavf-yuv420p.y4m covers only the default format.
Add four entries that pass -pix_fmt explicitly to the lavf_video
macro: yuv422p, yuv444p, yuv411p, and gray.
These exercise the branches in yuv4mpegpipe_write_header() that write
the "C422", "C444", "C411", and "Cmono" chroma descriptor strings in
the stream header. All four are gated on ENCDEC(RAWVIDEO,YUV4MPEGPIPE)
and added to FATE_LAVF_VIDEO_SCALE so they inherit the requirement for
CONFIG_SCALE_FILTER that lavf_video's -auto_conversion_filters needs.
Reference files were generated from the actual encoder output and
follow the md5+size+CRC format used by the other lavf references.
Signed-off-by: Soham Kute <officialsohamkute@gmail.com>
Add tests/api/api-enc-parser-test.c, a generic encoder+parser round-trip
test that takes codec_name, width, and height on the command line
(defaults: h261 176 144).
Three cases are tested:
garbage - a single av_parser_parse2() call on 8 bytes with no Picture
Start Code; verifies out_size == 0 so the parser emits no spurious data.
bulk - encodes 2 frames, concatenates the raw packets, feeds the whole
buffer to a fresh parser in one call, then flushes. Verifies that
exactly 2 non-empty frames come out and that the parser found the PSC
boundary between them.
split - the same buffer fed in two halves (chunk boundary falls inside
frame 0). Verifies the parser still emits exactly 2 frames when input
arrives incrementally, and that the collected bytes are identical to
the bulk output (checked with memcmp).
Implementation notes: avcodec_get_supported_config() selects the pixel
format; chroma height uses AV_CEIL_RSHIFT with log2_chroma_h from
AVPixFmtDescriptor; data[1] and data[2] are checked independently so
semi-planar formats work; the encoded buffer is given
AV_INPUT_BUFFER_PADDING_SIZE zero bytes at the end; parse_stream()
skips the fed chunk if consumed==0 to prevent an infinite loop.
Two FATE entries in tests/fate/api.mak: QCIF (176x144) and CIF
(352x288), both standard H.261 resolutions.
Signed-off-by: Soham Kute <officialsohamkute@gmail.com>
The original test only mapped the source file and printed its content,
exercising none of the error branches in av_file_map().
Replace it with a test that maps a real file (path via argv[1] for
out-of-tree builds) and verifies it is non-empty, then calls
av_file_map() on a nonexistent file twice: once with log_offset=0 to
confirm the error is logged at AV_LOG_ERROR, and once with log_offset=1
to confirm the level is raised by one, covering the
log_level_offset_offset path in av_vlog(). A custom av_log callback
captures the emitted level independently of the global log level.
The two error cases share a single for() loop to avoid duplication.
Add a FATE entry in tests/fate/libavutil.mak with CMP=null since
there is no fixed stdout to compare.
Signed-off-by: Soham Kute <officialsohamkute@gmail.com>
This is consistent pattern with other files. Also is needed for next
commit to always include x86util.asm
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
Fix the default value of mpegts_original_network_id from 0x0001 to
0xff01 to match the actual code (DVB_PRIVATE_NETWORK_START).
Add the missing hevc_digital_hdtv service type to the
mpegts_service_type option list.
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>