This avoids one redundant load per row; pix3 from the previous
iteration can be used as pix2 in the next one.
Before: Cortex A53 A72 A73
pix_abs_0_2_neon: 138.0 59.7 48.0
After:
pix_abs_0_2_neon: 109.7 50.2 39.5
Signed-off-by: Martin Storsjö <martin@martin.st>