1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2025-03-03 14:32:16 +02:00

56 Commits

Author SHA1 Message Date
Rémi Denis-Courmont
3c6516330f lavc/exrdsp: R-V V reoder_pixels 2023-10-09 19:52:51 +03:00
Rémi Denis-Courmont
89c10d8d20 lavc/ac3: add R-V Zbb extract_exponents 2023-10-05 18:13:00 +03:00
Rémi Denis-Courmont
cec48e3b32 riscv: factor out the bswap32 assembler 2023-10-02 22:28:21 +03:00
Rémi Denis-Courmont
b36f3d5330 lavc/fmtconvert: unroll R-V V int32_to_float_fmul_scalar 2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont
f3dfd4ccf2 lavc/aacpsdsp: unroll RISC-V V hybrid_synthesis_deint 2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont
0f1336b285 lavc/aacpsdsp: unroll RISC-V V hybrid_analysis_ileave 2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont
69d7486e59 lavc/aacpsdsp: unroll RISC-V V mul_pair_single 2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont
c270928cc0 lavc/aacpsdsp: unroll R-V V stereo interpolate 2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont
27d74fc1ef lavc/aacpsdsp: simplify R-V V stereo interpolate
Remove some useless vector splat.
2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont
3575ee2ea3 lavc/audiodsp: unroll RISC-V clip functions
audiodsp.vector_clip_int32_c: 17500.7
audiodsp.vector_clip_int32_rvv_i32: 8404.7  (m1)
audiodsp.vector_clip_int32_rvv_i32: 2689.9  (m8)

audiodsp.vector_clipf_c: 33679.7
audiodsp.vector_clipf_rvf: 7019.7
audiodsp.vector_clipf_rvv_f32: 8328.0       (m1)
audiodsp.vector_clipf_rvv_f32: 2209.4       (m8)
2023-10-02 18:07:54 +03:00
Rémi Denis-Courmont
9bc5676e40 lavc/g722dsp: add RISC-V V DSP function 2023-08-24 21:07:18 +03:00
Arnie Chang
8d1316e515 lavc/h264chroma: RISC-V V add motion compensation for 4xH and 2xH chroma blocks
Optimize the put and avg filtering for 4xH and 2xH blocks

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-07-25 19:10:40 +03:00
Rémi Denis-Courmont
44cac1def0 lavc/audiodsp: rework RISC-V V scalar product
Take vector reduction out of the loop and unroll.

Before:
audiodsp.scalarproduct_int16_c: 12321.0
audiodsp.scalarproduct_int16_rvv_i32: 4175.7

After:
audiodsp.scalarproduct_int16_c: 12320.5
audiodsp.scalarproduct_int16_rvv_i32: 1230.2
2023-07-20 22:54:34 +03:00
Rémi Denis-Courmont
61e5ca4ded lavc/bswapdsp: purge RISC-V V bswap32
This cannot beat the Zbb implementation, and it is unlikely that a real
meaningful CPU design would support V and not Zbb. The best loop rewrite
that I could come up with (4 shifts, 2 ands, 3 ors) is still ~40% slower
than Zbb.

A proper faster vector implementation should be feasible with the
cryptographic vector extensions, but that is a story for another time.
2023-07-19 19:29:35 +03:00
Rémi Denis-Courmont
5de1db5370 lavc/bswapdsp: rewrite RISC-V V bswap16
This favours bit-wise logic over slow strided stores.
2023-07-19 19:29:35 +03:00
Rémi Denis-Courmont
b6585eb04c lavu: add/use flag for RISC-V Zba extension
The code was blindly assuming that Zbb or V implied Zba. While the
earlier is practically always true, the later broke some QEMU setups,
as V was introduced earlier than Zba.
2023-07-19 19:29:35 +03:00
Rémi Denis-Courmont
2eb55157aa lavc/aacpsdsp: unroll RISC-V V add_squares
This slightly improves performance with the Device Under Test.
2023-07-19 19:29:35 +03:00
Rémi Denis-Courmont
c541ecf0dc lavc/alacdsp: unroll RISC-V V loops
This increases the group multiplier as per T-Head C910 benchmarks:

alac_append_extra_bits_mono_c: 803.0
alac_append_extra_bits_stereo_c: 1604.2
alac_decorrelate_stereo_c: 1077.5

LMUL=1
alac_append_extra_bits_mono_rvv_i32: 418.2
alac_append_extra_bits_stereo_rvv_i32: 693.2
alac_decorrelate_stereo_rvv_i32: 673.5

LMUL=2
alac_append_extra_bits_mono_rvv_i32: 382.2
alac_append_extra_bits_stereo_rvv_i32: 648.2
alac_decorrelate_stereo_rvv_i32: 542.7

LMUL=4
alac_append_extra_bits_mono_rvv_i32: 241.5
alac_append_extra_bits_stereo_rvv_i32: 512.7
alac_decorrelate_stereo_rvv_i32: 364.2

LMUL=8
alac_append_extra_bits_mono_rvv_i32: 239.7
alac_append_extra_bits_stereo_rvv_i32: 497.2
alac_decorrelate_stereo_rvv_i32: 426.7
2023-07-16 23:24:00 +03:00
Rémi Denis-Courmont
a28aa0475d lavc/vorbisdsp: unroll RISC-V V inverse_coupling
This increases the group multiplier as per T-Head C910 benchmarks:

inverse_coupling_c: 4597.0
inverse_coupling_rvv_i32: 1312.7 (m1)
inverse_coupling_rvv_i32: 1116.7 (m2)
inverse_coupling_rvv_i32: 732.2  (m4)
inverse_coupling_rvv_i32: 898.0  (m8)
2023-07-16 23:24:00 +03:00
Arnie Chang
c5508f60c2 lavc/h264chroma: RISC-V V add motion compensation for 8x8 chroma blocks
Optimize the put and avg filtering for 8x8 chroma blocks

Signed-off-by: Arnie Chang <arnie.chang@sifive.com>
2023-05-30 17:15:05 +02:00
Rémi Denis-Courmont
4d66e8c12e lavc/audiodsp: fix RISC-V V scalar product (again)
The loop uses a 32-bit accumulator. The current code would only zero
the lower 16 bits thereof.
2022-10-17 06:39:00 +02:00
Rémi Denis-Courmont
96a83ceea4 riscv: fix scalar product initialisation
VSETVLI xd, x0, ...' has rather nonobvious semantics:
- If xd is x0, then it preserves the current vector length.
- If xd is not x0, it sets the vector length to the supported maximum.

Also somewhat confusingly, while VMV.X.S always does its thing
regardless of the selected vector length, VMV.S.X does _nothing_ if the
selected vector length is zero.

So the current code breaks fails to initialise the accumulator if we
are unlucky to have a selected vector length of zero on entry. Fix it
by forcing the vector length to one.
2022-10-13 10:17:38 +02:00
Rémi Denis-Courmont
105921251a lavc/aacpsdsp: fix clobber on RISC-V LP64D/ILP32D
Although the DSP function only uses single precision from RISC-V F, the
caller may leave double precision values in the spilled registers if the
calling convention supports double precision hardware floats. Then, we
need to save and restore FS registers as double precision.

Conversely, we do not need to save anything at all if an integer calling
convention is in use. However we can assume that single precision floats
are supported, since the Zve32f extension implies the F extension.
So for the sake of simplicity, we always save at least single precision
values.

In theory, we should even save quadruple precision values if the LP64Q
ABI is in use. I have yet to see a compiler that supports it though.
2022-10-10 02:23:18 +02:00
Rémi Denis-Courmont
bfc69297c5 lavc/opusdsp: RISC-V V (512-bit) postfilter
This adds a variant of the postfilter for use with 512-bit vectors.
Half a vector is enough to perform the scalar product. Normally a whole
vector would be used anyhow. Indeed fractional multiplers are no faster
than the unit multipler.

But in this particular function, a full vector makes up 16 samples,
which would be loaded at each iteration of the outer loop. The minimum
guaranteed CELT postfilter period is only 15. Accounting for the edges,
we can only safely preload up to 13 samples.

The fractional multipler is thus used to cap the selected vector length
to a safe value of 8 elements or 256 bits.

Likewise, we have the 1024-bit variant with the quarter multipler. In
theory, a 2048-bit one would be possible with the eigth multipler, but
that length is not even defined in the specifications as of yet, nor is
it supported by any emulator - forget actual hardware.
2022-10-10 02:23:17 +02:00
Rémi Denis-Courmont
97d34befea lavc/opusdsp: RISC-V V (256-bit) postfilter
This adds a variant of the postfilter for use with 256-bit vectors.
As a single vector is then large enough to perform the scalar product,
the group multipler is reduced to just one at run-time.

The different vector type is passed via register. Unfortunately,
there is no VSETIVL instruction, so the constant vector size (5) also
needs to be passed via a register.
2022-10-10 02:22:39 +02:00
Rémi Denis-Courmont
8009581912 lavc/opusdsp: RISC-V V (128-bit) postfilter
This is implemented for a vector size of 128-bit. Since the scalar
product in the inner loop covers 5 samples or 160 bits, we need a group
multipler of 2.

To avoid reconfiguring the vector type, the outer loop, which loads
multiple input samples sticks to the same multipler. Consequently, the
outer loop loads 8 samples per iteration. This is safe since the minimum
period of the CELT codec is 15 samples.

The same code would also work, albeit needlessly inefficiently with a
vector length of 256 bits. A proper implementation will follow instead.
2022-10-10 02:22:10 +02:00
Rémi Denis-Courmont
2abafd7307 lavc/bswapdsp: RISC-V V bswap16_buf 2022-10-05 08:26:19 +02:00
Rémi Denis-Courmont
d7528af4df lavc/bswapdsp: RISC-V V bswap_buf 2022-10-05 08:26:19 +02:00
Rémi Denis-Courmont
f0ef11ea83 lavc/bswapdsp: RISC-V B bswap_buf
Simply taking the Zbb REV8 instruction into use in a simple loop gives
some significant savings:

bswap_buf_c: 1081.0
bswap_buf_rvb_b: 771.0

But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with
just one additional shift, and one fewer load, effectively doubling the
bandwidth. Consequently, this patch is useful even if the compile-time
target has Zbb enabled for C code:

bswap_buf_c: 1081.0
bswap_buf_rvb_b: 341.0  (this patch)

On the other hand, this approach fails miserably for bswap16_buf as the
ratio of shifts and stores becomes unfavorable compared to naïve C:

bswap16_buf_c: 1542.0
bswap16_buf_rvb_b: 1803.7

Unrolling to process 128 bits (4 samples) at a time actually worsens
performance ever so slightly:

bswap_buf_c: 1081.0
bswap_buf_rvb_b: 408.5
2022-10-05 08:26:19 +02:00
Lynne
b25c6a5704
riscv/alacdsp: drop config.h include 2022-10-05 06:59:43 +02:00
Rémi Denis-Courmont
3ba5579e55 riscv: remove unnecessary #include's
Pointed out by Andreas Rheinhardt.
2022-10-05 06:54:56 +02:00
Rémi Denis-Courmont
f0d1637c11 lavc/alacdsp: RISC-V V append_extra_bits[1] 2022-10-05 06:51:11 +02:00
Rémi Denis-Courmont
55bde97f29 lavc/alacdsp: RISC-V V append_extra_bits[0] 2022-10-05 06:51:11 +02:00
Rémi Denis-Courmont
64ab577954 lavc/alacdsp: RISC-V V decorrelate_stereo
To avoid data dependencies, this does the following unroll, which
requires one extra but probably free addition:

    coeff = (b * left_weight) >> decorr_shift;
    b += a;
    a -= coeff;
    b -= coeff;
    swap(a, b);
2022-10-05 06:51:11 +02:00
Martin Storsjö
6059ea2a14 riscv: Fix linking without RVV; change #ifdef into #if
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-29 10:28:37 +03:00
Rémi Denis-Courmont
d31013166a lavc/pixblockdsp: RISC-V diff_pixels & diff_pixels_unaligned 2022-09-28 11:46:11 +02:00
Rémi Denis-Courmont
ebee25855a lavc/pixblockdsp: RISC-V V 16-bit get_pixels & get_pixels_unaligned 2022-09-28 11:46:11 +02:00
Rémi Denis-Courmont
676b08cb70 lavc/pixblockdsp: RISC-V V 8-bit get_pixels & get_pixels_unaligned 2022-09-28 11:46:11 +02:00
Rémi Denis-Courmont
2746329ce2 lavc/idctdsp: RISC-V V put_signed_pixels_clamped function 2022-09-28 11:46:11 +02:00
Rémi Denis-Courmont
fa983b5656 lavc/idctdsp: RISC-V V add_pixels_clamped function 2022-09-28 11:46:11 +02:00
Rémi Denis-Courmont
b29ee63a1b lavc/idctdsp: RISC-V V put_pixels_clamped function 2022-09-28 11:46:11 +02:00
Martin Storsjö
dd2e524ffa riscv: Use the correct path for including asm.S
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-28 11:02:46 +03:00
Rémi Denis-Courmont
c03f9654c9 lavc/aacpsdsp: RISC-V V stereo_interpolate[0] 2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont
a15edb0bc0 lavc/aacpsdsp: RISC-V V hybrid_synthesis_deint 2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont
09f907999f lavc/aacpsdsp: RISC-V V hybrid_analysis_ileave 2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont
15c3a0bd6e lavc/aacpsdsp: RISC-V V hybrid_analysis
This starts with one-time initialisation of the 26 constant factors
like  08edacc248bce3f8946d75e97188d189c74a6de6. That is done with
the scalar instruction set. While the formula can readily be vectored,
the gains would (probably) be more than lost in transfering the results
back to FP registers (or suitably reshuffling them into vector
registers).

Note that the main loop could likely be scheduled sligthly better by
expanding the filter macro and interleaving loads with arithmetic.
It is not clear yet if that would be relevant for vector processing (as
opposed to traditional SIMD).

We could also use fewer vectors, but there is not much point in sparing
them (they are *all* callee-clobbered).
2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont
e180326a0b lavc/aacpsdsp: RISC-V V mul_pair_single 2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont
b0cacf4c3f lavc/aacpsdsp: RISC-V V add_squares 2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont
453aba71e6 lavc/vorbisdsp: RISC-V V inverse_coupling
This uses the following vectorisation:

    for (i = 0; i < blocksize; i++) {
        ang[i] = mag[i] - copysignf(fmaxf(ang[i], 0.f), mag[i]);
        mag[i] = mag[i] - copysignf(fminf(ang[i], 0.f), mag[i]);
    }
2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont
220dfd0945 lavc/fmtconvert: RISC-V V int32_to_float_fmul_array8 2022-09-27 13:19:52 +02:00