8x8 is about 5x faster than 16x16 on 1080p input. Since a block size of
8x8 makes the filter almost usable (time wise) and it's not obvious if
8x8 or 16x16 is better from a quality PoV (it really depends on the
input and parameters), the filter now defaults to 8x8, and as a result
libavfilter is micro bumped.
This removes the avcodec dependency and make the code almost twice as
fast. More to come.
The DCT factorization is based on "Fast and numerically stable
algorithms for discrete cosine transforms" from Gerlind Plonkaa &
Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015).
Make code slightly faster, simpler, clearer.
The filter is still slow as hell, and that change won't cause any
visible performance improvement (it still takes more than one minute to
process a single 1080p frame on a Core 2 here).