The horizontal pass get ~2x performance with the patch
under single thread.
Tested overall performance using the command(avx2 enabled):
./ffmpeg -i 1080p.mp4 -vf gblur -f null /dev/null
./ffmpeg -i 1080p.mp4 -vf gblur=threads=1 -f null /dev/null
For single thread, the fps improves from 43 to 60, about 40%.
For multi-thread, the fps improves from 110 to 130, about 20%.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>