1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2025-01-13 21:28:01 +02:00
Commit Graph

50 Commits

Author SHA1 Message Date
Lynne
f932b89ea3
lavu/tx: implement aarch64 NEON SIMD FFT
The fastest fast Fourier transform in not just the west, but the world,
now for the most popular toy ISA.

On a high level, it follows the design of the AVX2 version closely,
with the exception that the input is slightly less permuted as we don't have
to do lane switching with the input on double 4pt and 8pt.

On a low level, the lack of subadd/addsub instructions REALLY penalizes
any attempt at writing an FFT. That single register matters a lot,
and reloading it simply takes unacceptably long.
In x86 land, vendors would've noticed developers need this.
In ARM land, you get a badly designed complex multiplication instruction
we cannot use, that's not present on 95% of devices. Because only
compilers matter, right?

Future optimization options are very few, perhaps better register
management to use more ld1/st1s.

All timings below are in cycles:
A53:
Length | C           | New (lavu)  | Old (lavc)  | FFTW
------ |-------------|-------------|-------------|-----
4      |         842 | 420         | 1210        | 1460
8      |        1538 | 1020        | 1850        | 2520
16     |        3717 | 1900        | 3700        | 3990
32     |        9156 | 4070        | 8289        | 8860
64     |       21160 | 9931        | 18600       | 19625
128    |       49180 | 23278       | 41922       | 41922
256    |      112073 | 53876       | 93202       | 101092
512    |      252864 | 122884      | 205897      | 207868
1024   |      560512 | 278322      | 458071      | 453053
2048   |     1295402 | 775835      | 1038205     | 1020265
4096   |     3281263 | 2021221     | 2409718     | 2577554
8192   |     8577845 | 4780526     | 5673041     | 6802722

Apple M1
New  - Total for len 512 reps 2097152 = 1.459141 s
Old  - Total for len 512 reps 2097152 = 2.251344 s
FFTW - Total for len 512 reps 2097152 = 1.868429 s

New  - Total for len 1024 reps 4194304 = 6.490080 s
Old  - Total for len 1024 reps 4194304 = 9.604949 s
FFTW - Total for len 1024 reps 4194304 = 7.889281 s

New  - Total for len 16384 reps 262144 = 10.374001 s
Old  - Total for len 16384 reps 262144 = 15.266713 s
FFTW - Total for len 16384 reps 262144 = 12.341745 s

New  - Total for len 65536 reps 8192 = 1.769812 s
Old  - Total for len 65536 reps 8192 = 4.209413 s
FFTW - Total for len 65536 reps 8192 = 3.012365 s

New  - Total for len 131072 reps 4096 = 1.942836 s
Old  - Segfaults
FFTW - Total for len 131072 reps 4096 = 3.713713 s

Thanks to wbs for some simplifications, assembler fixes and a review
and to jannau for giving it a look.
2022-08-25 17:40:28 +02:00
Martin Storsjö
c3fea6d83b aarch64: Only emit the PAC/BTI note section when targeting ELF
This avoids build errors if such features are enabled while targeting
another binary format. (Using such features on other platforms
might require some other form of signaling/setup though, but
the ELF specific .note section isn't applicable at least.)

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-03-15 00:44:28 +02:00
Andre Kempe
248986a0db arm64: Add Armv8.3-A PAC support to assembly files
This patch adds optional support for Arm Pointer Authentication Codes.

PAC support is turned on or off at compile time using additional
compiler flags. Unless any of these is enabled explicitly, no additional
code will be emitted at all.

Signed-off-by: André Kempe <andre.kempe@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-03-09 15:04:25 +02:00
Jonathan Wright
08b4716a9e aarch64: Add Armv8.5-A BTI support
Add Branch Target Identifiers (BTIs) to all functions defined in
AArch64 assembly files. Most of the BTI landing pads are added
automatically by the 'function' macro.

BTI support is turned on or off at compile time based on the presence
of the __ARM_FEATURE_BTI_DEFAULT feature macro.

A binary compiled with BTI support can be executed on an Armv8-A
processor without BTI support because the instructions are defined in
NOP space.

Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Elijah Ahmad <elijah.ahmad@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-11-16 13:43:56 +02:00
Martin Storsjö
ee040a7fc2 arm/aarch64: Use mach_absolute_time as timer on apple platforms
This is much less precise than the cycle counter register, but
the cycle counter register is not available on apple platforms
(and on linux, it requires a kernel module for allowing user mode
access).

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-02-21 22:41:34 +02:00
Martin Storsjö
07948f3d38 aarch64: Explicitly forbid using the x18 register
On windows and darwin (and modern android), the x18 register is reserved
and shouldn't be modified by user code, while it is freely available on
linux. Strictly avoid it, to keep the assembly code portable.

This would have helped catch the issue fixed in 872790b1f9
immediately.

Signed-off-by: Martin Storsjö <martin@martin.st>
2020-05-15 21:22:22 +03:00
Peter Collingbourne
9bcb1cb6ed Add assembly support for -fsanitize=hwaddress tagged globals.
As of LLVM r368102, Clang will set a pointer tag in bits 56-63 of the
address of a global when compiling with -fsanitize=hwaddress. This requires
an adjustment to assembly code that takes the address of such globals: the
code cannot use the regular R_AARCH64_ADR_PREL_PG_HI21 relocation to refer
to the global, since the tag would take the address out of range. Instead,
the code must use the non-checking (_NC) variant of the relocation (the
link-time check is substituted by a runtime check).

This change makes the necessary adjustment in the movrel macro, where it is
needed when compiling with -fsanitize=hwaddress.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Martin Storsjö
Reviewed-by: Janne Grunau
2019-08-22 11:22:07 +02:00
James Almer
ebdc5c419a Merge commit '41cf3e3b1ca375962951fde1b90a03b16197d205'
* commit '41cf3e3b1ca375962951fde1b90a03b16197d205':
  arm: Create proper .rdata sections for COFF

Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 14:48:58 -03:00
Martin Storsjö
41cf3e3b1c arm: Create proper .rdata sections for COFF
As .rodata isn't one of the default created sections for COFF, it was
created as a read-write data section. By using the default .rdata
section name for COFF, it automatically becomes a read-only data section.
The existing ".section .rodata" works as intended for ELF though.

This is based on an original patch and diagnose by Tom Tan
<Tom.Tan@microsoft.com>.

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-01-25 23:53:37 +02:00
James Almer
35347e7e9b Merge commit '4cf84e254ae75b524e1cacae499a97d7cc9e5906'
* commit '4cf84e254ae75b524e1cacae499a97d7cc9e5906':
  Drop some unnecessary config.h #includes

Merged-by: James Almer <jamrial@gmail.com>
2018-02-11 23:08:48 -03:00
Diego Biurrun
4cf84e254a Drop some unnecessary config.h #includes 2018-02-06 10:03:15 +01:00
James Almer
529de4f1b6 Merge commit '69ac24e556c6fbc7138be5a60d0b90d2a5676c3d'
* commit '69ac24e556c6fbc7138be5a60d0b90d2a5676c3d':
  aarch64: Get rid of a stray double space

Merged-by: James Almer <jamrial@gmail.com>
2017-11-11 17:46:48 -03:00
James Almer
28bb96c408 Merge commit '7b7760ad6efb7b96122aa7133ad21e22653ae222'
* commit '7b7760ad6efb7b96122aa7133ad21e22653ae222':
  aarch64: Fix negative movrel offsets for windows

Merged-by: James Almer <jamrial@gmail.com>
2017-11-11 10:02:43 -03:00
Martin Storsjö
69ac24e556 aarch64: Get rid of a stray double space
The extra space got included as part of the expansion of ELF, which
later interfered with gas-preprocessor which earlier only stripped out
leftover lines starting with '#' if the line started with that char.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-10-18 10:49:28 +03:00
James Almer
3d828c9fd5 cpu: split flag checks per arch in av_cpu_max_align()
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
2017-10-09 11:48:24 +02:00
James Almer
3b345d389b avutil/cpu: split flag checks per arch in av_cpu_max_align()
Signed-off-by: James Almer <jamrial@gmail.com>
2017-09-27 23:10:09 -03:00
Martin Storsjö
7b7760ad6e aarch64: Fix negative movrel offsets for windows
On windows, the offset for the relocation doesn't get stored in
the relocation itself, but as an unsigned immediate in the opcode.
Therefore, negative offsets has to be handled via a separate sub
instruction, just as on MachO.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-08-22 13:41:08 +03:00
Martin Storsjö
dda45c087b aarch64: Add parentheses around the offset parameter in movrel
This fixes building with clang for linux with PIC enabled.

This is cherrypicked from libav commit
8847eeaa14.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:50 +02:00
Martin Storsjö
8847eeaa14 aarch64: Add parentheses around the offset parameter in movrel
This fixes building with clang for linux with PIC enabled.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-16 09:56:11 +02:00
Martin Storsjö
7fe898dbb9 aarch64: Add an offset parameter to the movrel macro
With apple tools, the linker fails with errors like these, if the
offset is negative:

ld: in section __TEXT,__text reloc 8: symbol index out of range for architecture arm64

This is cherry-picked from libav commit
c44a8a3eab.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2016-11-15 15:10:03 -05:00
Martin Storsjö
c44a8a3eab aarch64: Add an offset parameter to the movrel macro
With apple tools, the linker fails with errors like these, if the
offset is negative:

ld: in section __TEXT,__text reloc 8: symbol index out of range for architecture arm64

Signed-off-by: Martin Storsjö <martin@martin.st>
2016-11-10 11:06:08 +02:00
Timothy Gu
44304ae322 all: Add missing header guards 2016-01-28 19:49:48 -08:00
Hendrik Leppkes
af2e6f3215 Merge commit '64034849dad8410bedbe1def4c533490fb85cc4a'
* commit '64034849dad8410bedbe1def4c533490fb85cc4a':
  arm64: add cycle counter support

Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
2016-01-02 10:26:42 +01:00
Janne Grunau
64034849da arm64: add cycle counter support
The ISB (instruction synchronization barrier) might be too heavy for
START/STOPTIMER use but should be more accurate in checkasm where the
timing overhead is subtracted.
2015-12-14 16:42:35 +01:00
Michael Niedermayer
92d47e2aa3 Merge commit '780cd20b00a69e26bbfffbb8eec16fbe999ea793'
* commit '780cd20b00a69e26bbfffbb8eec16fbe999ea793':
  aarch64: Use .data.rel.ro for const data with relocations

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-12-09 12:08:29 +01:00
Martin Storsjö
780cd20b00 aarch64: Use .data.rel.ro for const data with relocations
This reverts commit c00365b46d
in addition to using a different section.

Signed-off-by: Martin Storsjö <martin@martin.st>
2014-12-09 11:43:31 +02:00
Michael Niedermayer
66eacd5580 Merge commit 'a238b83b13640e3192d7d4aaad2242f13a9a84a1'
* commit 'a238b83b13640e3192d7d4aaad2242f13a9a84a1':
  aarch64: use MACH-O const data asm directive in const macro

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-08-04 04:06:34 +02:00
Janne Grunau
a238b83b13 aarch64: use MACH-O const data asm directive in const macro 2014-08-04 00:17:21 +02:00
Michael Niedermayer
a40c338a00 Merge commit 'd5a55981986ac5d1a31aef3a8d16eaff8534a412'
* commit 'd5a55981986ac5d1a31aef3a8d16eaff8534a412':
  build: check if AS supports the '.func' directive

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-04 12:45:35 +02:00
Michael Niedermayer
bcbd7dbce5 Merge commit '68a06b3a639ee21c78532ee4c50c3366bf890ff7'
* commit '68a06b3a639ee21c78532ee4c50c3366bf890ff7':
  aarch64: use '#' for whole line asm comments

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-03 18:13:43 +02:00
Michael Niedermayer
40285d2659 Merge commit '6a0fa4d86f2b3e9060a1153b39fa3bfe36f1b617'
* commit '6a0fa4d86f2b3e9060a1153b39fa3bfe36f1b617':
  aarch64: remove optional :pg_hi21: for adrp instruction

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-03 18:03:19 +02:00
Michael Niedermayer
ab9afcdf04 Merge commit 'fd2981ea92d9a776fcb1a13377dce1c8a7db7b5e'
* commit 'fd2981ea92d9a776fcb1a13377dce1c8a7db7b5e':
  aarch64: add darwin style PAGE/PAGEOFF relocations

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-03 18:02:42 +02:00
Janne Grunau
d5a5598198 build: check if AS supports the '.func' directive
Not supported by Clang's integrated assembler. Since it just adds
debug information it can safely omitted.
2014-06-03 14:23:03 +02:00
Janne Grunau
68a06b3a63 aarch64: use '#' for whole line asm comments
Both gnu as and clang treat lines starting with '#' as comments if they
aren't consumed by the C-style preprocessor.
Using '//' does not work with clang since comments are removed before
macro expansion.
2014-06-03 14:23:02 +02:00
Janne Grunau
6a0fa4d86f aarch64: remove optional :pg_hi21: for adrp instruction
Clang's integrated assembler does not support it.
2014-06-03 14:23:02 +02:00
Janne Grunau
fd2981ea92 aarch64: add darwin style PAGE/PAGEOFF relocations 2014-06-03 14:23:01 +02:00
Michael Niedermayer
232959f184 Merge commit '08cd92144e73195eecc28ed0348e66e255516b82'
* commit '08cd92144e73195eecc28ed0348e66e255516b82':
  aarch64: Use the correct syntax for relocations

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-29 18:16:15 +02:00
Martin Storsjö
08cd92144e aarch64: Use the correct syntax for relocations
This fixes building in PIC mode with gas. The examples in the gas
manual showed using a # here even though gas itself actually didn't
support that syntax (and the gas test suite only tests it without
the extra hash sign).

CC: libav-stable@libav.org
Signed-off-by: Martin Storsjö <martin@martin.st>
2014-05-29 14:47:25 +03:00
Michael Niedermayer
4c57c6a765 Merge commit '8675bcb0addb1c7fb0b04682d1f3f95d5b8dae14'
* commit '8675bcb0addb1c7fb0b04682d1f3f95d5b8dae14':
  aarch64: add armv8 CPU flag

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-07 02:15:18 +02:00
Janne Grunau
8675bcb0ad aarch64: add armv8 CPU flag 2014-04-06 21:18:49 +02:00
Michael Niedermayer
ce9d3daa8a Merge remote-tracking branch 'qatar/master'
* qatar/master:
  aarch64: float_dsp NEON assembler

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-19 02:52:08 +01:00
Janne Grunau
dbd12523a4 aarch64: float_dsp NEON assembler
Ported from arm NEON and added vector_dmul_scalar.

Functions between 1.5 and 5 times faster than the C implementations
using Apple's clang-503.0.19 on A7.
2014-03-18 22:56:07 +01:00
Michael Niedermayer
490215cbd7 Merge commit '9c029f67ca82147ddfa83a1546ee1e109e11fbd4'
* commit '9c029f67ca82147ddfa83a1546ee1e109e11fbd4':
  aarch64: use EXTERN_ASM consistently for exported symbols

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-20 23:13:13 +01:00
Janne Grunau
9c029f67ca aarch64: use EXTERN_ASM consistently for exported symbols
Based on e3fec3f095 for arm.
2014-02-20 15:24:35 +01:00
Michael Niedermayer
949adce125 Merge remote-tracking branch 'qatar/master'
* qatar/master:
  aarch64: port neon clobber test from arm

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-15 15:49:22 +01:00
Michael Niedermayer
d5560f1fec Merge commit 'b7b17ed66e199afc7246e642bf3b35c3f8eca217'
* commit 'b7b17ed66e199afc7246e642bf3b35c3f8eca217':
  aarch64: add cpuflags support for NEON and VFP

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-15 14:42:49 +01:00
Janne Grunau
fe96769bed aarch64: port neon clobber test from arm 2014-01-15 12:31:07 +01:00
Janne Grunau
b7b17ed66e aarch64: add cpuflags support for NEON and VFP
NEON and VFP are currently mandatory for all ARMv8 profiles. Both are
handled as extensions as far as cpuflags are concerned. This is
consistent with handling x86_64 which always has SSE2, but still
handles it as an extension.
2014-01-15 12:05:09 +01:00
Michael Niedermayer
53e6977c07 Merge remote-tracking branch 'qatar/master'
* qatar/master:
  aarch64: bswap inline assembly

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-15 05:23:03 +01:00
Janne Grunau
d0cd2a8c46 aarch64: bswap inline assembly
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-01-14 22:19:38 +01:00