FFmpeg

virtualenv/FFmpeg

Fork 0

mirror of https://github.com/FFmpeg/FFmpeg.git synced 2025-03-23 04:24:35 +02:00

Commit Graph

Author	SHA1	Message	Date
Rémi Denis-Courmont	2abafd7307	lavc/bswapdsp: RISC-V V bswap16_buf	2022-10-05 08:26:19 +02:00
Rémi Denis-Courmont	d7528af4df	lavc/bswapdsp: RISC-V V bswap_buf	2022-10-05 08:26:19 +02:00
Rémi Denis-Courmont	f0ef11ea83	lavc/bswapdsp: RISC-V B bswap_buf Simply taking the Zbb REV8 instruction into use in a simple loop gives some significant savings: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 771.0 But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with just one additional shift, and one fewer load, effectively doubling the bandwidth. Consequently, this patch is useful even if the compile-time target has Zbb enabled for C code: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 341.0 (this patch) On the other hand, this approach fails miserably for bswap16_buf as the ratio of shifts and stores becomes unfavorable compared to naïve C: bswap16_buf_c: 1542.0 bswap16_buf_rvb_b: 1803.7 Unrolling to process 128 bits (4 samples) at a time actually worsens performance ever so slightly: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 408.5	2022-10-05 08:26:19 +02:00

Author

SHA1

Message

Date

Rémi Denis-Courmont

2abafd7307

lavc/bswapdsp: RISC-V V bswap16_buf

2022-10-05 08:26:19 +02:00

Rémi Denis-Courmont

d7528af4df

lavc/bswapdsp: RISC-V V bswap_buf

2022-10-05 08:26:19 +02:00

Rémi Denis-Courmont

f0ef11ea83

lavc/bswapdsp: RISC-V B bswap_buf

Simply taking the Zbb REV8 instruction into use in a simple loop gives
some significant savings:

bswap_buf_c: 1081.0
bswap_buf_rvb_b: 771.0

But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with
just one additional shift, and one fewer load, effectively doubling the
bandwidth. Consequently, this patch is useful even if the compile-time
target has Zbb enabled for C code:

bswap_buf_c: 1081.0
bswap_buf_rvb_b: 341.0  (this patch)

On the other hand, this approach fails miserably for bswap16_buf as the
ratio of shifts and stores becomes unfavorable compared to naïve C:

bswap16_buf_c: 1542.0
bswap16_buf_rvb_b: 1803.7

Unrolling to process 128 bits (4 samples) at a time actually worsens
performance ever so slightly:

bswap_buf_c: 1081.0
bswap_buf_rvb_b: 408.5

2022-10-05 08:26:19 +02:00

3 Commits