This adds runtime support to use Zbb REV8 for 32- and 64-bit byte-wise
swaps. The result is about five times slower than if targetting Zbb
statically, but still a lot faster than the default bespoke C code or a
call to GCC run-time functions.
For 16-bit swap, this is however unsurprisingly a lot worse, and so this
sticks to the baseline. In fact, even using REV8 statically does not
seem to be beneficial in that case.
Zbb static Zbb dynamic I baseline
bswap16: 0.668184765 3.340764069 0.668029012
bswap32: 0.668174014 3.340763319 9.353855435
bswap64: 0.668221765 3.340496313 14.698672283
(seconds for 1 billion iterations on a SiFive-U74 core)
Due to hysterical raisins, most RISC-V Linux distributions target a
RV64GC baseline excluding the Bit-manipulation ISA extensions, most
notably:
- Zba: address generation extension and
- Zbb: basic bit manipulation extension.
Most CPUs that would make sense to run FFmpeg on support Zba and Zbb
(including the current FATE runner), so it makes sense to optimise for
them. In fact a large chunk of existing assembler optimisations relies
on Zba and/or Zbb.
Since we cannot patch shared library code, the next best thing is to
carry a flag initialised at load-time and check it on need basis.
This results in 3 instructions overhead on isolated use, e.g.:
1: AUIPC rd, %pcrel_hi(ff_rv_zbb_supported)
LBU rd, %pcrel_lo(1b)(rd)
BEQZ rd, non_Zbb_fallback_code
// Zbb code here
The C compiler will typically load the flag ahead of time to reducing
latency, and can also keep it around if Zbb is used multiple times in a
single optimisation scope. For this to work, the flag symbol must be
hidden; otherwise the optimisation degrades with a GOT look-up to
support interposition:
1: AUIPC rd, GOT_OFFSET_HI
LD rd, GOT_OFFSET_LO(rd)
LBU rd, (rd)
BEQZ rd, non_Zbb_fallback_code
// Zbb code here
This patch adds code to provision the flag in libraries using bit
manipulation functions from libavutil: byte-swap, bit-weight and
counting leading or trailing zeroes.
vtype_vli computes the VTYPE value with the optimal LMUL for a given
element width, tail and mask policies and a run-time vector length.
vtype_ivli does the same, but with the compile-time constant vector
length.
vwtypei and vntypei can be used to widen or narrow a VTYPE value for
use in mixed-width vector-optimised functions.
The B Bit manipulation extension was not defined to this day, and
probably never will. Instead it was broken down into Zba, Zbb, Zbc and
Zbs with no particular blessed set to make up B.
This removes the bogus field test. Linux never set this bit, nor
(AFAICT) did FreeBSD or any other OS. We can always add it back in the
unlikely event that it gets taken into use.
Not all C run-times support this, and even then, it will be a while
before distributions provide recent enough versions thereof.
Since this is a trivial system call wrapper, we might just as well call
the corresponding kernel system call directly where the C run-time lacks
support but the kernel headers are new enough (as is the case on Debian
Unstable at the time of writing). In doing so, we need to add a few more
guards as the first suitable kernel (headers) release did not expose the
V, Zba and Zbb extensions.
This adds the Linux-specific function call to detect CPU features. Unlike
the more portable auxillary vector, this supports extensions other than
single lettered ones. At this point, FFmpeg already needs this to detect
Zba and Zbb at run-time, and probably will need it for Zvbb in the near
future.
Support will be available in glibc 2.40 onward.
Gathers are (unsurprisingly) a notable exception to the rule that R-V V
gets faster with larger group multipliers. So roll the function to speed
it up.
Before:
vector_fmul_reverse_fixed_c: 2840.7
vector_fmul_reverse_fixed_rvv_i32: 2430.2
After:
vector_fmul_reverse_fixed_c: 2841.0
vector_fmul_reverse_fixed_rvv_i32: 962.2
It might be possible to further optimise the function by moving the
reverse-subtract out of the loop and adding ad-hoc tail handling.
The gather index vector is only used as double-length (due to register
pressure), so no need to initialise it for quad-length. Basically this
matches the multiplier in the prologue to the the multipler in the loop.
This revectors the inner loop to reverse vectors element in vectors,
thus eliminating the negative register stride. Note that RVV does not
have a vector reverse instruction, so this uses a gather.
It does not make much sense to me, but GCC somehow optimises the
inline assembler even though the output is very obviously used and
having observable side effects.
This reverts commit 09731fbfc3.
So far, AV_READ_TIME would return the cycle counter. This posed two
problems:
1) On recent systems, it would just raise an illegal instruction
exception. Indeed RDCYCLE is blocked in user space to ward off some
side channel attacks. In particular, this would cause the random
number generator to crash.
2) It does not match the x86 behaviour and the apparent original intent
of AV_READ_TIME in the functional code base (outside test cases).
So this replaces the cycle counter with the time counter. The unit is
a platform-dependent constant fraction of time, and the value should be
stable across harts (RISC-V lingo for physical CPU thread).
1) Take the reductive sum out of the loop,
leaving a regular vector addition in the loop.
2) Merge the addition and the multiplication.
3) Unroll.
Before:
scalarproduct_float_rvv_f32: 832.5
After:
scalarproduct_float_rvv_f32: 275.2
The code was blindly assuming that Zbb or V implied Zba. While the
earlier is practically always true, the later broke some QEMU setups,
as V was introduced earlier than Zba.
As with the earlier bswap change, all versions of GCC and Clang that
support RISC-V support the popcount built-ins, so we can just use them
instead of inline assembler.
av_bswapXX() are used in context that expect exact size types, notably
variable arguments to av_log(). On Linux RV64, uint_fast32_t is an
unsigned long, so the current inline assembler does not work properly.
Since GCC and Clang gained their byte-swap built-ins before they
supported RISC-V, we can simply defer to them. As an added bonus, the
compiler can do instruction scheduling, which it couldn't with the Zbb
inline assembler.
VSETVLI xd, x0, ...' has rather nonobvious semantics:
- If xd is x0, then it preserves the current vector length.
- If xd is not x0, it sets the vector length to the supported maximum.
Also somewhat confusingly, while VMV.X.S always does its thing
regardless of the selected vector length, VMV.S.X does _nothing_ if the
selected vector length is zero.
So the current code breaks fails to initialise the accumulator if we
are unlucky to have a selected vector length of zero on entry. Fix it
by forcing the vector length to one.
On most cases, the vector type (VTYPE) for the RISC-V Vector extension
is supplied as an immediate value, with either of the VSETVLI or
VSETIVLI instructions. There is however a third instruction VSETVL
which takes the vector type from a general purpose register. That is so
the type can be selected at run-time.
This introduces a macro to load a (valid) vector type into a register.
The syntax follows that of VSETVLI and VSETIVLI, with element size,
group multiplier, then tail and mask policies.
Unfortunately, it is common, and will remain so, that the Bit
manipulations are not enabled at compilation time. This is an official
policy for Debian ports in general (though they do not support RISC-V
officially as of yet) to stick to the minimal target baseline, which
does not include the B extension or even its Zbb subset.
For inline helpers (CPOP, REV8), compiler builtins (CTZ, CLZ) or
even plain C code (MIN, MAX, MINU, MAXU), run-time detection seems
impractical. But at least it can work for the byte-swap DSP functions.