Using absolute-difference-accumulate does use twice the amount of
absolute-difference instructions, but avoids the need for the
uaddl and add instructions, reducing the total number of instructions
by 3.
These can be interleaved in the rest of the calculation, to avoid
tight dependencies at the end. Unfortunately, this is marginally
slower on Cortex A53, but faster on A72 and A73.
Before: Cortex A53 A72 A73 Graviton 3
pix_abs_0_3_neon: 175.7 109.2 92.0 41.2
After:
pix_abs_0_3_neon: 179.7 96.7 87.5 41.2
Signed-off-by: Martin Storsjö <martin@martin.st>