The default timer register pmccntr_el0 usually requires enabling
access with e.g. a kernel module (while it is accessible by
default on Windows). On Linux, the default for checkasm benchmarks
is to use perf (if suitable headers are available) though.
On macOS, using cntvct_el0 gives measurements with the same
magnitude as mach_absolute_time (which is used currently), but
possibly with a little less overhead/noise.
Signed-off-by: Martin Storsjö <martin@martin.st>
C code or compiler built-ins are preferable over inline assembler for
byte-swaps as it allows for better optimisations (e.g. instruction
scheduling) which would otherwise be impossible.
As with f64c2e710f for x86 and Arm,
this removes the inline assembler on GCC (and Clang) since we now
require recent enough compiler versions. This indeed seems to work on
AArch64, SuperH and, if Zbb is enabled, RISC-V. (AVR32 was not tested
since it has no known working compilers at this time.)
It will fallback to mach_absolute_time inside libavutil/timer.h
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
This makes the code much simpler (especially for adding support
for other instruction set extensions), avoids needing inline
assembly for this feature, and generally is more of the canonical
way to do this.
The CPU feature detection was added in
493fcde50a, using HWCAP_CPUID.
The argument for using that, was that HWCAP_CPUID was added much
earlier in the kernel (in Linux v4.11), while the HWCAP flags for
individual features always come later. This allows detecting support
for new CPU extensions before the kernel exposes information about
them via hwcap flags.
However in practice, there's probably quite little advantage in this.
E.g. HWCAP2_I8MM was added in Linux v5.10 - long after HWCAP_CPUID,
but there's probably very little practical cases where one would
run a kernel older than that on a CPU that supports those instructions.
Additionally, we provide our own definitions of the flag values to
check (as they are fixed constants anyway), with names not conflicting
with the ones from system headers. This reduces the number of ifdefs
needed, and allows detecting those features even if building with
userland headers that are lacking the definitions of those flags.
Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04,
do expose support for these features via HWCAP flags, but the
emulated cpuid registers are missing the bits for exposing e.g. I8MM.
(This issue is fixed in later versions of QEMU though.)
Signed-off-by: Martin Storsjö <martin@martin.st>
This eases actual development of the assembly functions, by only
allowing extension instructions within the sections that explicitly
enable them, instead of having all extensions enabled everywhere.
Signed-off-by: Martin Storsjö <martin@martin.st>
Favour left aligned columns over right aligned columns.
In principle either style should be ok, but some of the cases
easily lead to incorrect indentation in the surrounding code (see
a couple of cases fixed up in the preceding patch), and show up in
automatic indentation correction attempts.
Signed-off-by: Martin Storsjö <martin@martin.st>
Some functions have slightly different indentation styles; try
to match the surrounding code.
libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally
uses a layered indentation style to visually show how different
unrolled/interleaved phases fit together.
Signed-off-by: Martin Storsjö <martin@martin.st>
This is not actually used for anything. The configure check causes the
CPU feature flag to be set, but nothing consumes it at all.
While AArch64 does have VFP, it is only used for the scalar C code.
Conversely, it is still possible to disable VFP, by changing the
C compiler flags as before (though that only makes sense for an
hypothetical non-standard Armv8 platform without VFP).
Note that this retains the "vfp" option flag, for backward
compatibility and on the very remote but theoretically possible chance
that FFmpeg actually makes use of it in the future.
AV_CPU_FLAG_VFP is retained as it is actually used by AArch32.
For now, there's not much value in this since Clang don't support
enabling the dotprod or i8mm features with either .arch_extension
or .arch (it has to be enabled by the base arch flags passed to
the compiler). But it may be supported in the future.
Signed-off-by: Martin Storsjö <martin@martin.st>
These are available since ARMv8.4-a and ARMv8.6-a respectively,
but can also be available optionally since ARMv8.2-a.
Check if ".arch armv8.2-a" and ".arch_extension {dotprod,i8mm}" are
supported, and check if the instructions can be assembled.
Current clang versions fail to support the dotprod and i8mm
features in the .arch_extension directive, but do support them
if enabled with -march=armv8.4-a on the command line. (Curiously,
lowering the arch level with ".arch armv8.2-a" doesn't make the
extensions unavailable if they were enabled with -march; if that
changes, Clang should also learn to support these extensions via
.arch_extension for them to remain usable here.)
Signed-off-by: Martin Storsjö <martin@martin.st>
Currently it is done in several different ways, which
might cause needless dependencies or in case of
tx_float_neon.S is incorrect.
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
There are no particular reasons to force the compiler to use the same
register as output and input operand. This forces an extra MOV
instruction if the input value needs to be reused after the swap.
In most cases, this makes no differences, as the compiler will seleect
the same register for both operands either way.
Signed-off-by: Martin Storsjö <martin@martin.st>
The fastest fast Fourier transform in not just the west, but the world,
now for the most popular toy ISA.
On a high level, it follows the design of the AVX2 version closely,
with the exception that the input is slightly less permuted as we don't have
to do lane switching with the input on double 4pt and 8pt.
On a low level, the lack of subadd/addsub instructions REALLY penalizes
any attempt at writing an FFT. That single register matters a lot,
and reloading it simply takes unacceptably long.
In x86 land, vendors would've noticed developers need this.
In ARM land, you get a badly designed complex multiplication instruction
we cannot use, that's not present on 95% of devices. Because only
compilers matter, right?
Future optimization options are very few, perhaps better register
management to use more ld1/st1s.
All timings below are in cycles:
A53:
Length | C | New (lavu) | Old (lavc) | FFTW
------ |-------------|-------------|-------------|-----
4 | 842 | 420 | 1210 | 1460
8 | 1538 | 1020 | 1850 | 2520
16 | 3717 | 1900 | 3700 | 3990
32 | 9156 | 4070 | 8289 | 8860
64 | 21160 | 9931 | 18600 | 19625
128 | 49180 | 23278 | 41922 | 41922
256 | 112073 | 53876 | 93202 | 101092
512 | 252864 | 122884 | 205897 | 207868
1024 | 560512 | 278322 | 458071 | 453053
2048 | 1295402 | 775835 | 1038205 | 1020265
4096 | 3281263 | 2021221 | 2409718 | 2577554
8192 | 8577845 | 4780526 | 5673041 | 6802722
Apple M1
New - Total for len 512 reps 2097152 = 1.459141 s
Old - Total for len 512 reps 2097152 = 2.251344 s
FFTW - Total for len 512 reps 2097152 = 1.868429 s
New - Total for len 1024 reps 4194304 = 6.490080 s
Old - Total for len 1024 reps 4194304 = 9.604949 s
FFTW - Total for len 1024 reps 4194304 = 7.889281 s
New - Total for len 16384 reps 262144 = 10.374001 s
Old - Total for len 16384 reps 262144 = 15.266713 s
FFTW - Total for len 16384 reps 262144 = 12.341745 s
New - Total for len 65536 reps 8192 = 1.769812 s
Old - Total for len 65536 reps 8192 = 4.209413 s
FFTW - Total for len 65536 reps 8192 = 3.012365 s
New - Total for len 131072 reps 4096 = 1.942836 s
Old - Segfaults
FFTW - Total for len 131072 reps 4096 = 3.713713 s
Thanks to wbs for some simplifications, assembler fixes and a review
and to jannau for giving it a look.
This avoids build errors if such features are enabled while targeting
another binary format. (Using such features on other platforms
might require some other form of signaling/setup though, but
the ELF specific .note section isn't applicable at least.)
Signed-off-by: Martin Storsjö <martin@martin.st>
This patch adds optional support for Arm Pointer Authentication Codes.
PAC support is turned on or off at compile time using additional
compiler flags. Unless any of these is enabled explicitly, no additional
code will be emitted at all.
Signed-off-by: André Kempe <andre.kempe@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Add Branch Target Identifiers (BTIs) to all functions defined in
AArch64 assembly files. Most of the BTI landing pads are added
automatically by the 'function' macro.
BTI support is turned on or off at compile time based on the presence
of the __ARM_FEATURE_BTI_DEFAULT feature macro.
A binary compiled with BTI support can be executed on an Armv8-A
processor without BTI support because the instructions are defined in
NOP space.
Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Elijah Ahmad <elijah.ahmad@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
This is much less precise than the cycle counter register, but
the cycle counter register is not available on apple platforms
(and on linux, it requires a kernel module for allowing user mode
access).
Signed-off-by: Martin Storsjö <martin@martin.st>
On windows and darwin (and modern android), the x18 register is reserved
and shouldn't be modified by user code, while it is freely available on
linux. Strictly avoid it, to keep the assembly code portable.
This would have helped catch the issue fixed in 872790b1f9
immediately.
Signed-off-by: Martin Storsjö <martin@martin.st>
As of LLVM r368102, Clang will set a pointer tag in bits 56-63 of the
address of a global when compiling with -fsanitize=hwaddress. This requires
an adjustment to assembly code that takes the address of such globals: the
code cannot use the regular R_AARCH64_ADR_PREL_PG_HI21 relocation to refer
to the global, since the tag would take the address out of range. Instead,
the code must use the non-checking (_NC) variant of the relocation (the
link-time check is substituted by a runtime check).
This change makes the necessary adjustment in the movrel macro, where it is
needed when compiling with -fsanitize=hwaddress.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Martin Storsjö
Reviewed-by: Janne Grunau
As .rodata isn't one of the default created sections for COFF, it was
created as a read-write data section. By using the default .rdata
section name for COFF, it automatically becomes a read-only data section.
The existing ".section .rodata" works as intended for ELF though.
This is based on an original patch and diagnose by Tom Tan
<Tom.Tan@microsoft.com>.
Signed-off-by: Martin Storsjö <martin@martin.st>
The extra space got included as part of the expansion of ELF, which
later interfered with gas-preprocessor which earlier only stripped out
leftover lines starting with '#' if the line started with that char.
Signed-off-by: Martin Storsjö <martin@martin.st>
On windows, the offset for the relocation doesn't get stored in
the relocation itself, but as an unsigned immediate in the opcode.
Therefore, negative offsets has to be handled via a separate sub
instruction, just as on MachO.
Signed-off-by: Martin Storsjö <martin@martin.st>
This fixes building with clang for linux with PIC enabled.
This is cherrypicked from libav commit
8847eeaa14.
Signed-off-by: Martin Storsjö <martin@martin.st>
With apple tools, the linker fails with errors like these, if the
offset is negative:
ld: in section __TEXT,__text reloc 8: symbol index out of range for architecture arm64
This is cherry-picked from libav commit
c44a8a3eab.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
With apple tools, the linker fails with errors like these, if the
offset is negative:
ld: in section __TEXT,__text reloc 8: symbol index out of range for architecture arm64
Signed-off-by: Martin Storsjö <martin@martin.st>
The ISB (instruction synchronization barrier) might be too heavy for
START/STOPTIMER use but should be more accurate in checkasm where the
timing overhead is subtracted.
* commit '780cd20b00a69e26bbfffbb8eec16fbe999ea793':
aarch64: Use .data.rel.ro for const data with relocations
Merged-by: Michael Niedermayer <michaelni@gmx.at>