1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-02 03:06:28 +02:00
Commit Graph

123 Commits

Author SHA1 Message Date
Andreas Rheinhardt
88b3b09afa avcodec/aacenc: Move initializing DSP out of aacenc.c
Otherwise aacenc.o gets pulled in by the aacencdsp checkasm
test and it in turn pulls the rest of lavc in.
Besides being bad size-wise this also has the downside that
it pulls in avpriv_(cga|vga16)_font from libavutil which are
marked as being imported from another library when building
libavcodec as a DLL and this breaks checkasm because it links
both lavc and lavu statically.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-03-02 02:54:11 +01:00
sunyuechi
a7ad76fbbf lavc/me_cmp: R-V V nsse
C908:
nsse_0_c: 1990.0
nsse_0_rvv_i32: 572.0
nsse_1_c: 910.0
nsse_1_rvv_i32: 456.0

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-27 20:31:30 +02:00
sunyuechi
9b90d0d36a lavc/me_cmp: R-V V vsse vsad intra
C908:
vsad_4_c: 681.0
vsad_4_rvv_i32: 182.2
vsad_5_c: 278.0
vsad_5_rvv_i32: 145.2
vsse_4_c: 595.0
vsse_4_rvv_i32: 125.2
vsse_5_c: 281.0
vsse_5_rvv_i32: 101.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-25 11:05:25 +02:00
sunyuechi
925b55a5e8 lavc/me_cmp: R-V V vsse vsad
C908:
vsad_0_c: 936.0
vsad_0_rvv_i32: 236.2
vsad_1_c: 424.0
vsad_1_rvv_i32: 190.2
vsse_0_c: 877.0
vsse_0_rvv_i32: 204.2
vsse_1_c: 439.0
vsse_1_rvv_i32: 140.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-25 11:05:25 +02:00
sunyuechi
9cb8f262f2 lavc/me_cmp: R-V V sse
C908:
sse_0_c: 614.7
sse_0_rvv_i32: 138.2
sse_1_c: 302.7
sse_1_rvv_i32: 107.2
sse_2_c: 175.7
sse_2_rvv_i32: 104.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-21 20:08:55 +02:00
sunyuechi
37463d7979 lavc/me_cmp: R-V V pix_abs_y2
C908:
pix_abs_0_2_c: 904.0
pix_abs_0_2_rvv_i32: 172.2
pix_abs_1_2_c: 460.0
pix_abs_1_2_rvv_i32: 168.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-21 20:08:25 +02:00
sunyuechi
f1ec475f66 lavc/me_cmp: R-V V pix_abs_x2
C908:
pix_abs_0_1_c: 767.0
pix_abs_0_1_rvv_i32: 196.2
pix_abs_1_1_c: 388.0
pix_abs_1_1_rvv_i32: 185.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-21 20:08:25 +02:00
sunyuechi
b41e115dde lavc/me_cmp: R-V V pix_abs
C908:
pix_abs_0_0_c: 534.0
pix_abs_0_0_rvv_i32: 136.2
pix_abs_1_0_c: 287.7
pix_abs_1_0_rvv_i32: 125.2
sad_0_c: 534.0
sad_0_rvv_i32: 136.2
sad_1_c: 287.7
sad_1_rvv_i32: 125.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-21 20:08:25 +02:00
sunyuechi
d897bbb48d lavc/vp8dsp: R-V V vp8_idct_dc_add4uv
c908:
vp8_idct_dc_add4uv_c: 387.7
vp8_idct_dc_add4uv_rvv_i32: 134.5

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-17 14:45:49 +02:00
sunyuechi
e74e18cae4 lavc/vp8dsp: R-V V vp8_idct_dc_add4y
c908:
vp8_idct_dc_add4y_c: 368.5
vp8_idct_dc_add4y_rvv_i32: 134.5

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-17 14:45:49 +02:00
sunyuechi
c12053cefc lavc/vp8dsp: R-V V vp8_idct_dc_add
c908:
vp8_idct_dc_add_c: 102.2
vp8_idct_dc_add_rvv_i32: 42.0

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-17 14:45:49 +02:00
sunyuechi
89189dd9e7 lavc/rv34dsp: R-V V rv34_idct_dc_add
C908:
rv34_idct_dc_add_c: 134.7
rv34_idct_dc_add_rvv_i32: 45.5

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-17 14:33:35 +02:00
sunyuechi
ee08974f90 lavc/rv34dsp: R-V V rv34_inv_transform_dc
C908:
rv34_inv_transform_dc_c: 35.5
rv34_inv_transform_dc_rvv_i32: 27.0

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-17 14:33:35 +02:00
sunyuechi
fdebde817c lavc/blockdsp: R-V V clear_blocks
C908:
blockdsp.clear_blocks_c: 128.2
blockdsp.clear_blocks_rvv_i64: 102.5

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-13 21:29:46 +02:00
sunyuechi
0748d2bbc7 lavc/blockdsp: R-V V clear_block
C908:
blockdsp.clear_block_c: 47.2
blockdsp.clear_block_rvv_i64: 28.5

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-12 22:00:03 +02:00
sunyuechi
8e23ebe6f9 lavc/svq1enc: R-V V ssd_int8_vs_int16
C908
ssd_int8_vs_int16_c: 207.7
ssd_int8_vs_int16_rvv_i32: 14.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-01-17 17:49:54 +02:00
Rémi Denis-Courmont
278b4b60d6 lavc/takdsp: R-V V decorrelate_sf
decorrelate_sf_c:      259.2
decorrelate_sf_rvv_i32: 45.5
2024-01-15 19:00:25 +02:00
sunyuechi
3d39b8d4e7 lavc/takdsp: R-V V decorrelate_sm
C908:
decorrelate_sm_c: 130.0
decorrelate_sm_rvv_i32: 43.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
(with minor changes)
2023-12-22 17:40:00 +02:00
James Almer
46775e64f8 avcodec/takdsp: fix const correctness
Signed-off-by: James Almer <jamrial@gmail.com>
2023-12-22 09:28:04 -03:00
sunyuechi
c933ff2779 lavc/takdsp: R-V V decorrelate_sr
C908:
decorrelate_sr_c: 95.5
decorrelate_sr_rvv_i32: 28.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-21 22:42:34 +02:00
sunyuechi
864174dd00 lavc/takdsp: R-V V decorrelate_ls
C908:
decorrelate_ls_c: 69.7
decorrelate_ls_rvv_i32: 27.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-21 22:42:34 +02:00
Rémi Denis-Courmont
cdd38a2ffe lavc/aacpsdsp: fix R-V V stereo interpolate
The penultimate loop iteration could pick any vl such that:
 vlenb/4 < vl <= vlenb/2
Thus if the total length is not a multiple of vlenb/2, the vfadd.vf
on the penultimate iteration would yield corrupt values for the last
iteration.

To avoid this, force vl = vlen/2 until the last iteration. Unfortunately
this latent bug is not reproducible with either hardware or QEMU as of now.
2023-12-21 17:54:23 +02:00
Rémi Denis-Courmont
db32f75c63 lavc/opusdsp: simplify R-V V postfilter
This skips the round-trip to scalar register for the sliding 'x'
coefficients, improving performance by about 5%. The trick here is that
the vector slide-up instruction preserves elements in destination vector
until the slide offset.

The switch from vfslide1up.vf to vslideup.vi also allows the elimination
of data dependencies on consecutive slides. Since the specifications
recommend sticking to power of two offsets, we could slide as follows:

        vslideup.vi v8, v0, 2
        vslideup.vi v4, v0, 1
        vslideup.vi v12, v8, 1
        vslideup.vi v16, v8, 2

However in the device under test, this seems to make performance slightly
worse, so this is left for (in)validation with future better hardware.
2023-12-21 17:54:08 +02:00
Rémi Denis-Courmont
419145c11b lavc/vc1dsp: fix R-V V vector lengths
The 8x4 and 4x4 use a needlessly large multiplier (unless/until we care
about embedded 64-bit-vector hardware). This is merely suboptimal.

The 8x4 case also uses an incorrect vector length, which leads to incorrect
behaviour on future/hypothetical hardware with 256-bit or larger vectors.

Pointed-out-by: Martin Storsjö <martin@martin.st>
2023-12-17 09:27:52 +02:00
Martin Storsjö
b51d9eb58e riscv: vc1dsp: Don't check vlenb before checking the CPU flags
We can't call ff_get_rv_vlenb() if we don't have RVV available
at all.

Acked-by: Rémi Denis-Courmont <remi@remlab.net>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-12-16 22:30:26 +02:00
Rémi Denis-Courmont
918b3ed2d5 lavc/lpc: R-V V compute_autocorr
The loop iterates over the length of the vector, not the order. This is
to avoid reloading the same data for each lag value. However this means
the loop only works if the maximum order is no larger than VLENB.

The loop is roughly equivalent to:

    for (size_t j = 0; j < lag; j++)
        autoc[j] = 1.;

    while (len > lag) {
        for (ptrdiff_t j = 0; j < lag; j++)
            autoc[j] += data[j] * *data;
        data++;
        len--;
    }

    while (len > 0) {
        for (ptrdiff_t j = 0; j < len; j++)
            autoc[j] += data[j] * *data;
        data++;
        len--;
    }

Since register pressure is only at 50%, it should be possible to implement
the same loop for order up to 2xVLENB. But this is left for future work.

Performance numbers are all over the place from ~1.25x to ~4x speedups,
but at least they are always noticeably better than nothing.
2023-12-16 11:18:01 +02:00
sunyuechi
98596f90f4 lavc/aacencdsp: R-V V abs_pow34
C908:
abs_pow34_c: 535.5
abs_pow34_rvv_f32: 337.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-11 18:42:07 +02:00
Rémi Denis-Courmont
272d0c164d lavc/lpc: R-V V apply_welch_window
apply_welch_window_even_c:       617.5
apply_welch_window_even_rvv_f64: 235.0
apply_welch_window_odd_c:        709.0
apply_welch_window_odd_rvv_f64:  256.5
2023-12-11 18:17:43 +02:00
Rémi Denis-Courmont
b3825bbe45 riscv: test for assembler support
This should fix the build on LLVM 16 and earlier, at the cost of turning
all non-RVV optimisations off.
2023-12-08 17:21:09 +02:00
sunyuechi
0b9d009b4a lavc/vc1dsp: R-V V inv_trans
C908:
vc1dsp.vc1_inv_trans_4x4_dc_c:      125.7
vc1dsp.vc1_inv_trans_4x4_dc_rvv_i32: 53.5
vc1dsp.vc1_inv_trans_4x8_dc_c:      230.7
vc1dsp.vc1_inv_trans_4x8_dc_rvv_i32: 65.5
vc1dsp.vc1_inv_trans_8x4_dc_c:      228.7
vc1dsp.vc1_inv_trans_8x4_dc_rvv_i64: 64.5
vc1dsp.vc1_inv_trans_8x8_dc_c:      476.5
vc1dsp.vc1_inv_trans_8x8_dc_rvv_i64: 80.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-08 17:20:48 +02:00
sunyuechi
8bdb663062 lavc/ac3dsp: R-V V float_to_fixed24
c910
    float_to_fixed24_c: 2207.2
    float_to_fixed24_rvv_f32: 696.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-06 16:04:22 +02:00
Rémi Denis-Courmont
0fa421c8f1 lavc/llvidencdsp: add R-V V diff_bytes
diff_bytes_c:      163.0
diff_bytes_rvv_i32: 52.7
2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont
0183c2c830 lavc/aacpsdsp: use LMUL=2 and amortise strides
The input is laid out in 16 segments, of which 13 actually need to be
loaded. There are no really efficient ways to deal with this:
1) If we load 8 segments wit unit stride, then narrow to 16 segments with
   right shifts, we can only get one half-size vector per segment, or just 2
   elements per vector (EMUL=1/2) - at least with 128-bit vectors.
   This ends up unsurprisingly about as fas as the C code.
2) The current approach is to load with strides. We keep that approach,
   but improve it using three 4-segmented loads instead of 12 single-segment
   loads. This divides the number of distinct loaded addresses by 4.
3) A potential third approach would be to avoid segmentation altogether
   and splat the scalar coefficient into vectors. Then we can use a
   unit-stride and maximum EMUL. But the downside then is that we have to
   multiply the 3 (of 16) unused segments with zero as part of the
   multiply-accumulate operations.

In addition, we also reuse vectors mid-loop so as to increase the EMUL
from 1 to 2, which also improves performance a little bit.

Oeverall the gains are quite small with the device under test, as it does
not deal with segmented loads very well. But at least the code is tidier,
and should enjoy bigger speed-ups on better hardware implementation.

Before:
ps_hybrid_analysis_c:       1819.2
ps_hybrid_analysis_rvv_f32: 1037.0 (before)
ps_hybrid_analysis_rvv_f32:  990.0 (after)
2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont
b88d4058f9 lavc/g722dsp: optimise R-V V apply_qmf
This stores the constant coefficients deinterleaved, so that they can be
loaded directly with NF=0. Unfortunately, we cannot optimise loading the
input, due to insufficient memory alignment (not 32-bit).

Before:
g722_apply_qmf_c:       82.5
g722_apply_qmf_rvv_i32: 78.2

After:
g722_apply_qmf_c:       82.5
g722_apply_qmf_rvv_i32: 65.2
2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont
fbc7adba67 lavc/llviddsp: R-V V add_bytes
add_bytes_c:      2077.2
add_bytes_rvv_i32: 105.0
2023-11-18 22:07:14 +02:00
Rémi Denis-Courmont
ca664f2254 lavc/flacdsp: R-V V LPC16 function
In this case, the inner loop computing the scalar product can be reduced
to just one multiplication and one sum even with 128-bit vectors. The
result is a lot simpler, but also brings more modest performance gains:

flac_lpc_16_13_c:       15241.0
flac_lpc_16_13_rvv_i32: 11230.0
flac_lpc_16_16_c:       17884.0
flac_lpc_16_16_rvv_i32: 12125.7
flac_lpc_16_29_c:       27847.7
flac_lpc_16_29_rvv_i32: 10494.0
flac_lpc_16_32_c:       30051.5
flac_lpc_16_32_rvv_i32: 10355.0
2023-11-18 22:06:57 +02:00
Rémi Denis-Courmont
295092b46d lavc/flacdsp: R-V V LPC32
The entire set of 32 coefficients and corresponding past 32 samples can
fit in a single vector (with LMUL=8) exactly, but... since widening
double the needed vector sizes, we still end up too short with 128-bit
vectors. This adds a very simple version for future 256+-bit hardware,
and for pred_orders values up to 16, and a bit more involved loop for
for 128-bit hardware with pred_orders between 17 and 32.

With 128-bit hardware, the benchmarks look like this:
flac_lpc_32_13_c:       30152.0
flac_lpc_32_13_rvv_i32: 10244.7
flac_lpc_32_16_c:       37314.2
flac_lpc_32_16_rvv_i32: 10126.2
flac_lpc_32_29_c:       61910.0
flac_lpc_32_29_rvv_i32: 14495.2
flac_lpc_32_32_c:       68204.0
flac_lpc_32_32_rvv_i32: 13273.7
2023-11-18 22:05:43 +02:00
Rémi Denis-Courmont
07c303b708 lavc/flacdsp: R-V V decorrelate_indep 16-bit packed
flac_decorrelate_indep2_16_c:        981.7
flac_decorrelate_indep2_16_rvv_i32:  199.2
flac_decorrelate_indep4_16_c:       1749.7
flac_decorrelate_indep4_16_rvv_i32:  401.2
flac_decorrelate_indep6_16_c:       2517.7
flac_decorrelate_indep6_16_rvv_i32:  858.0
flac_decorrelate_indep8_16_c:       3285.7
flac_decorrelate_indep8_16_rvv_i32: 1123.5
2023-11-17 23:59:56 +02:00
Rémi Denis-Courmont
fb0295e5fd lavc/flacdsp: R-V V decorrelate_indep 32-bit packed
flac_decorrelate_indep2_32_c:       981.7
flac_decorrelate_indep2_32_rvv_i32: 183.7
flac_decorrelate_indep4_32_c:      1749.7
flac_decorrelate_indep4_32_rvv_i32: 362.5
flac_decorrelate_indep6_32_c:      2517.7
flac_decorrelate_indep6_32_rvv_i32: 715.2
flac_decorrelate_indep8_32_c:      3285.7
flac_decorrelate_indep8_32_rvv_i32: 909.0
2023-11-17 23:59:56 +02:00
Rémi Denis-Courmont
6183a69c0b lavc/flacdsp: R-V V decorrelate_ms packed
flac_decorrelate_ms_16_c:       585.5
flac_decorrelate_ms_16_rvv_i32: 263.0
flac_decorrelate_ms_32_c:       584.7
flac_decorrelate_ms_32_rvv_i32: 250.0
2023-11-17 23:59:23 +02:00
Rémi Denis-Courmont
636ae0e0bc lavc/flacdsp: R-V V packed decorrelate_{l,r}s
flac_decorrelate_ms_16_c:       457.2
flac_decorrelate_ms_16_rvv_i32: 203.0
flac_decorrelate_ms_32_c:       457.2
flac_decorrelate_ms_32_rvv_i32: 203.5
flac_decorrelate_rs_16_c:       456.2
flac_decorrelate_rs_16_rvv_i32: 207.0
flac_decorrelate_rs_32_c:       456.2
flac_decorrelate_rs_32_rvv_i32: 210.5
2023-11-17 23:59:22 +02:00
Rémi Denis-Courmont
d076517056 lavc/llauddsp: R-V V scalarproduct_and_madd_int32
scalarproduct_and_madd_int32_c:      10899.7
scalarproduct_and_madd_int32_rvv_i32: 1749.0
2023-11-16 16:53:44 +02:00
Rémi Denis-Courmont
45d0eb3f70 lavc/llauddsp: R-V V scalarproduct_and_madd_int16
scalarproduct_and_madd_int16_c:      10355.7
scalarproduct_and_madd_int16_rvv_i32: 1480.0
2023-11-16 16:53:44 +02:00
Rémi Denis-Courmont
90a779bed6 lavc/huffyuvdsp: basic R-V V add_hfyu_left_pred_bgr32
Better performance can probably be achieved with a more intricate
unrolled loop, but this is a start:

add_hfyu_left_pred_bgr32_c: 15084.0
add_hfyu_left_pred_bgr32_rvv_i32: 10280.2

This would actually be cleaner with the RISC-V P extension, but that is
not ratified yet (I think?) and usually not supported if V is supported.
2023-11-15 16:51:07 +02:00
Rémi Denis-Courmont
c536e92207 lavc/sbrdsp: R-V V hf_apply_noise functions
This is restricted to 128-bit vectors as larger vector sizes could read
past the end of the noise array. Support for future hardware with larger
vector sizes is left for some other time.

hf_apply_noise_0_c:       2319.7
hf_apply_noise_0_rvv_f32: 1229.0
hf_apply_noise_1_c:       2539.0
hf_apply_noise_1_rvv_f32: 1244.7
hf_apply_noise_2_c:       2319.7
hf_apply_noise_2_rvv_f32: 1232.7
hf_apply_noise_3_c:       2541.2
hf_apply_noise_3_rvv_f32: 1244.2
2023-11-13 18:34:29 +02:00
Rémi Denis-Courmont
5b33104fca lavc/sbrdsp: R-V V hf_gen
hf_gen_c:      2922.7
hf_gen_rvv_f32: 731.5
2023-11-13 18:33:02 +02:00
Rémi Denis-Courmont
cd7b352c53 lavc/sbrdsp: R-V V autocorrelate
With 5 accumulator vectors and 6 inputs, this can only use LMUL=2.
Also the number of vector loop iterations is small, just 5 on 128-bit
vector hardware.

The vector loop is somewhat unusual in that it processes data in
descending memory order, in order to save on vector slides:
in descending order, we can extract elements to carry over to the next
iteration from the bottom of the vectors directly. With ascending order
(see in the Opus postfilter function), there are no ways to get the top
elements directly. On the downside, this requires the use of separate
shift and sub (the would-be SH3SUB instruction does not exist), with
a small pipeline stall on the vector load address.

The edge cases in scalar are done in scalar as this saves on loads
and remains significantly faster than C.

autocorrelate_c: 669.2
autocorrelate_rvv_f32: 421.0
2023-11-12 14:03:09 +02:00
Rémi Denis-Courmont
f576a0835b lavc/aacpsdsp: rework R-V V hybrid_synthesis_deint
Given the size of the data set, strided memory accesses cannot be avoided.
We can still do better than the current code.

ps_hybrid_synthesis_deint_c:       12065.5
ps_hybrid_synthesis_deint_rvv_i32: 13650.2 (before)
ps_hybrid_synthesis_deint_rvv_i64:  8181.0 (after)
2023-11-12 14:03:09 +02:00
Rémi Denis-Courmont
eb508702a8 lavc/aacpsdsp: rework R-V V add_squares
Segmented loads may be slower than not. So this advantageously uses a
unit-strided load and narrowing shifts instead.

Before:
ps_add_squares_c: 60757.7
ps_add_squares_rvv_f32: 22242.5

After:
ps_add_squares_c: 60516.0
ps_add_squares_rvv_i64: 17067.7
2023-11-12 14:03:09 +02:00
Rémi Denis-Courmont
adc87a5f7c lavc/opusdsp: rewrite R-V V postfilter
This uses a more traditional approach allowing up processing of up to
period minus two elements per iteration. This also allows the algorithm
to work for all and any vector length.

As the T-Head C908 device under test can load 16 elements loop, there is
unsurprisingly a little performance drop when the period is minimal and
the parallelism is capped at 13 elements:

Before:
postfilter_15_c:         21222.2
postfilter_15_rvv_f32:   22007.7
postfilter_512_c:        20189.7
postfilter_512_rvv_f32:  22004.2
postfilter_1022_c:       20189.7
postfilter_1022_rvv_f32: 22004.2

After:
postfilter_15_c:         20189.5
postfilter_15_rvv_f32:    7057.2
postfilter_512_c:        20189.5
postfilter_512_rvv_f32:   5667.2
postfilter_1022_c:       20192.7
postfilter_1022_rvv_f32:  5667.2
2023-11-06 22:09:30 +02:00