This work is sponsored by, and copyright, Google.
This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.
This gives a pretty substantial speedup for the smaller subpartitions.
The code size increases from 21512 bytes to 31400 bytes.
The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.
Before:
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 284.6
vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1902.7
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1903.0
vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2201.1
vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2510.0
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2821.3
vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1011.6
vp9_inv_dct_dct_32x32_sub2_add_10_neon: 9716.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9704.9
vp9_inv_dct_dct_32x32_sub8_add_10_neon: 10641.7
vp9_inv_dct_dct_32x32_sub12_add_10_neon: 11555.7
vp9_inv_dct_dct_32x32_sub16_add_10_neon: 12499.8
vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13403.7
vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14335.8
vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15253.6
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16179.5
After:
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 282.8
vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1142.4
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1139.0
vp9_inv_dct_dct_16x16_sub8_add_10_neon: 1772.9
vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2515.2
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2823.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1012.7
vp9_inv_dct_dct_32x32_sub2_add_10_neon: 6944.4
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 6944.2
vp9_inv_dct_dct_32x32_sub8_add_10_neon: 7609.8
vp9_inv_dct_dct_32x32_sub12_add_10_neon: 9953.4
vp9_inv_dct_dct_32x32_sub16_add_10_neon: 10770.1
vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13418.8
vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14330.7
vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15257.1
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16190.6
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from
26288 to 21512 bytes.
This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.
Before:
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1887.4
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2801.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9691.4
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16154.9
After:
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1899.5
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2827.2
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9714.7
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16175.9
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
Compared to the arm version, on aarch64 we can keep the full 8x8
transform in registers, and for 16x16 and 32x32, we can process
it in slices of 4 pixels instead of 2.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_inv_adst_adst_4x4_sub4_add_10_neon: 111.0 109.7
vp9_inv_adst_adst_8x8_sub8_add_10_neon: 914.0 733.5
vp9_inv_adst_adst_16x16_sub16_add_10_neon: 5184.0 3745.7
vp9_inv_dct_dct_4x4_sub1_add_10_neon: 65.0 65.7
vp9_inv_dct_dct_4x4_sub4_add_10_neon: 100.0 96.7
vp9_inv_dct_dct_8x8_sub1_add_10_neon: 111.0 119.7
vp9_inv_dct_dct_8x8_sub8_add_10_neon: 618.0 494.7
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 295.1 284.6
vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2303.2 1883.9
vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2984.8 2189.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3890.0 2799.4
vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1044.4 1012.7
vp9_inv_dct_dct_32x32_sub2_add_10_neon: 13333.7 9695.1
vp9_inv_dct_dct_32x32_sub16_add_10_neon: 18531.3 12459.8
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 24470.7 16160.2
vp9_inv_wht_wht_4x4_sub4_add_10_neon: 83.0 79.7
The larger transforms are significantly faster than the corresponding
ARM versions.
The speedup vs C code is smaller than in 32 bit mode, probably
because the 64 bit intermediates in the C code can be expressed
more efficiently in aarch64.
Signed-off-by: Martin Storsjö <martin@martin.st>