This fixes some global out of array reads and wrong cliping.
No speed difference meassurable under clang on i5
also all important code paths on all important platforms should
use SIMD.
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>