It is well known that fabs and fabsf are at least as fast and sometimes
faster than the FFABS macro, at least on the gcc+glibc combination.
For instance, see the reference:
http://patchwork.sourceware.org/patch/6735/.
This was a patch to glibc in order to remove their usages of a macro.
The reason essentially boils down to fabs using the __builtin_fabs of
the compiler, while FFABS needs to infer to not use a branch and to
simply change the sign bit. Usually the inference works, but sometimes
it does not. This may be easily checked by looking at the asm.
This also has the added benefit of reducing macro usage, which has
problems with side-effects.
Note that avcodec is not handled here, as it is huge and
most things there are integer arithmetic anyway.
Tested with FATE.
Reviewed-by: Clément Bœsch <u@pkh.me>
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
ceilf() can only work if the reminder of the division is not 0.
This fixes memory errors with for instance:
ffmpeg -f lavfi -i testsrc=s=800x500 -threads 3 -vf dctdnoiz -frames:v 1 -f null -
8x8 is about 5x faster than 16x16 on 1080p input. Since a block size of
8x8 makes the filter almost usable (time wise) and it's not obvious if
8x8 or 16x16 is better from a quality PoV (it really depends on the
input and parameters), the filter now defaults to 8x8, and as a result
libavfilter is micro bumped.
This removes the avcodec dependency and make the code almost twice as
fast. More to come.
The DCT factorization is based on "Fast and numerically stable
algorithms for discrete cosine transforms" from Gerlind Plonkaa &
Manfred Tasche (DOI: 10.1016/j.laa.2004.07.015).
Make code slightly faster, simpler, clearer.
The filter is still slow as hell, and that change won't cause any
visible performance improvement (it still takes more than one minute to
process a single 1080p frame on a Core 2 here).