You've already forked FFmpeg
mirror of
https://github.com/FFmpeg/FFmpeg.git
synced 2025-11-23 21:54:53 +02:00
The key idea is to pass the pre-generated tables to the TBL instruction
and churn through the data 16 bytes at a time. The remaining 4 elements
are handled with a specialized block located at the end of the routine.
The 3210 variant can be implemented using rev32, but surprisingly it is
slower than the generic TBL on A78, but much faster on A72.
There may be some room for improvement. Possibly instead of handling
last 8 and then 4 bytes separately, we can load these 4 into {v0.s}[2]
and process along with the last 8 bytes.
Speeds measured with checkasm --test=sw_rgb --bench --runs=10 | grep shuf
- A78
shuffle_bytes_0321_c: 75.5 ( 1.00x)
shuffle_bytes_0321_neon: 26.5 ( 2.85x)
shuffle_bytes_1203_c: 136.2 ( 1.00x)
shuffle_bytes_1203_neon: 27.2 ( 5.00x)
shuffle_bytes_1230_c: 135.5 ( 1.00x)
shuffle_bytes_1230_neon: 28.0 ( 4.84x)
shuffle_bytes_2013_c: 138.8 ( 1.00x)
shuffle_bytes_2013_neon: 22.0 ( 6.31x)
shuffle_bytes_2103_c: 76.5 ( 1.00x)
shuffle_bytes_2103_neon: 20.5 ( 3.73x)
shuffle_bytes_2130_c: 137.5 ( 1.00x)
shuffle_bytes_2130_neon: 28.0 ( 4.91x)
shuffle_bytes_3012_c: 138.2 ( 1.00x)
shuffle_bytes_3012_neon: 21.5 ( 6.43x)
shuffle_bytes_3102_c: 138.2 ( 1.00x)
shuffle_bytes_3102_neon: 27.2 ( 5.07x)
shuffle_bytes_3210_c: 138.0 ( 1.00x)
shuffle_bytes_3210_neon: 22.0 ( 6.27x)
shuf3210 using rev32
shuffle_bytes_3210_c: 139.0 ( 1.00x)
shuffle_bytes_3210_neon: 28.5 ( 4.88x)
- A72
shuffle_bytes_0321_c: 120.0 ( 1.00x)
shuffle_bytes_0321_neon: 36.0 ( 3.33x)
shuffle_bytes_1203_c: 188.2 ( 1.00x)
shuffle_bytes_1203_neon: 37.8 ( 4.99x)
shuffle_bytes_1230_c: 195.0 ( 1.00x)
shuffle_bytes_1230_neon: 36.0 ( 5.42x)
shuffle_bytes_2013_c: 195.8 ( 1.00x)
shuffle_bytes_2013_neon: 43.5 ( 4.50x)
shuffle_bytes_2103_c: 117.2 ( 1.00x)
shuffle_bytes_2103_neon: 53.5 ( 2.19x)
shuffle_bytes_2130_c: 203.2 ( 1.00x)
shuffle_bytes_2130_neon: 37.8 ( 5.38x)
shuffle_bytes_3012_c: 183.8 ( 1.00x)
shuffle_bytes_3012_neon: 46.8 ( 3.93x)
shuffle_bytes_3102_c: 180.8 ( 1.00x)
shuffle_bytes_3102_neon: 37.8 ( 4.79x)
shuffle_bytes_3210_c: 195.8 ( 1.00x)
shuffle_bytes_3210_neon: 37.8 ( 5.19x)
shuf3210 using rev32
shuffle_bytes_3210_c: 194.8 ( 1.00x)
shuffle_bytes_3210_neon: 30.8 ( 6.33x)
- x13s:
shuffle_bytes_0321_c: 49.4 ( 1.00x)
shuffle_bytes_0321_neon: 18.1 ( 2.72x)
shuffle_bytes_1203_c: 98.4 ( 1.00x)
shuffle_bytes_1203_neon: 18.4 ( 5.35x)
shuffle_bytes_1230_c: 97.4 ( 1.00x)
shuffle_bytes_1230_neon: 19.1 ( 5.09x)
shuffle_bytes_2013_c: 101.4 ( 1.00x)
shuffle_bytes_2013_neon: 16.9 ( 6.01x)
shuffle_bytes_2103_c: 53.9 ( 1.00x)
shuffle_bytes_2103_neon: 13.9 ( 3.88x)
shuffle_bytes_2130_c: 100.9 ( 1.00x)
shuffle_bytes_2130_neon: 19.1 ( 5.27x)
shuffle_bytes_3012_c: 97.4 ( 1.00x)
shuffle_bytes_3012_neon: 17.1 ( 5.69x)
shuffle_bytes_3102_c: 100.9 ( 1.00x)
shuffle_bytes_3102_neon: 19.1 ( 5.27x)
shuffle_bytes_3210_c: 100.6 ( 1.00x)
shuffle_bytes_3210_neon: 16.9 ( 5.96x)
shuf3210 using rev32
shuffle_bytes_3210_c: 100.6 ( 1.00x)
shuffle_bytes_3210_neon: 18.6 ( 5.40x)
Signed-off-by: Martin Storsjö <martin@martin.st>
32 KiB
32 KiB