By replacing memcpy with an unrolled loop using the alignment knowledge
it has, some speedup can be obtained.
Before (gcc 4.6.1): ~400 cycles
After: ~370 cycles
Overall, around 2% speed increase when decoding a 2400s mp3 to f32le.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>