Unrolling the main loop to process, instead of 4 elements:
- 8: minor gain of 2 cycles (not worth the extra object size)
- 2: loss of 8 cycles.
Assigning STEP to a register is a loss. Output address (Y) is almost always
unaligned.
Timings:
- C (32/64 bits): 117/109 cycles
- SSE: 57 cycles
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>