1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2025-01-03 05:10:03 +02:00
FFmpeg/libavcodec/riscv
Rémi Denis-Courmont 0183c2c830 lavc/aacpsdsp: use LMUL=2 and amortise strides
The input is laid out in 16 segments, of which 13 actually need to be
loaded. There are no really efficient ways to deal with this:
1) If we load 8 segments wit unit stride, then narrow to 16 segments with
   right shifts, we can only get one half-size vector per segment, or just 2
   elements per vector (EMUL=1/2) - at least with 128-bit vectors.
   This ends up unsurprisingly about as fas as the C code.
2) The current approach is to load with strides. We keep that approach,
   but improve it using three 4-segmented loads instead of 12 single-segment
   loads. This divides the number of distinct loaded addresses by 4.
3) A potential third approach would be to avoid segmentation altogether
   and splat the scalar coefficient into vectors. Then we can use a
   unit-stride and maximum EMUL. But the downside then is that we have to
   multiply the 3 (of 16) unused segments with zero as part of the
   multiply-accumulate operations.

In addition, we also reuse vectors mid-loop so as to increase the EMUL
from 1 to 2, which also improves performance a little bit.

Oeverall the gains are quite small with the device under test, as it does
not deal with segmented loads very well. But at least the code is tidier,
and should enjoy bigger speed-ups on better hardware implementation.

Before:
ps_hybrid_analysis_c:       1819.2
ps_hybrid_analysis_rvv_f32: 1037.0 (before)
ps_hybrid_analysis_rvv_f32:  990.0 (after)
2023-11-23 18:57:18 +02:00
..
aacpsdsp_init.c
aacpsdsp_rvv.S lavc/aacpsdsp: use LMUL=2 and amortise strides 2023-11-23 18:57:18 +02:00
ac3dsp_init.c
ac3dsp_rvb.S
alacdsp_init.c
alacdsp_rvv.S
audiodsp_init.c
audiodsp_rvf.S
audiodsp_rvv.S
bswapdsp_init.c
bswapdsp_rvb.S
bswapdsp_rvv.S
exrdsp_init.c
exrdsp_rvv.S
flacdsp_init.c lavc/flacdsp: R-V V LPC16 function 2023-11-18 22:06:57 +02:00
flacdsp_rvv.S lavc/flacdsp: R-V V LPC16 function 2023-11-18 22:06:57 +02:00
fmtconvert_init.c
fmtconvert_rvv.S
g722dsp_init.c
g722dsp_rvv.S lavc/g722dsp: optimise R-V V apply_qmf 2023-11-23 18:57:18 +02:00
h264_chroma_init_riscv.c
h264_mc_chroma.S
huffyuvdsp_init.c
huffyuvdsp_rvv.S
idctdsp_init.c
idctdsp_rvv.S
jpeg2000dsp_init.c
jpeg2000dsp_rvv.S
llauddsp_init.c
llauddsp_rvv.S
llviddsp_init.c lavc/llviddsp: R-V V add_bytes 2023-11-18 22:07:14 +02:00
llviddsp_rvv.S lavc/llviddsp: R-V V add_bytes 2023-11-18 22:07:14 +02:00
Makefile lavc/llviddsp: R-V V add_bytes 2023-11-18 22:07:14 +02:00
opusdsp_init.c
opusdsp_rvv.S
pixblockdsp_init.c
pixblockdsp_rvi.S
pixblockdsp_rvv.S
sbrdsp_init.c
sbrdsp_rvv.S
utvideodsp_init.c
utvideodsp_rvv.S
vorbisdsp_init.c
vorbisdsp_rvv.S