pow is a very wasteful function for this purpose. A low hanging fruit
would be simply to replace with exp2f, and that does yield some speedup.
However, there are 2 drawbacks of this:
1. It does not exploit the integer nature of the argument.
2. (minor) Some platforms lack a proper exp2f routine, making benefits available
only to non broken libm.
3. exp2f does not solve the same issue that plagues pow, namely terrible
worst case performance. This is a fundamental issue known as the
"table-maker's dilemma" recognized by Prof. Kahan himself and
subsequently elaborated and researched by many others. All this is clear from benchmarks below.
This exploits the IEEE-754 format to get very good performance even in
the worst case for integer powers of 2. This solves all the issues noted
above. Function tested with clang usan over [-1000, 1000] (beyond range of
relevance for this, which is [-255, 255]), patch itself with FATE.
Benchmarks obtained on x86-64, Haswell, GNU-Linux via 10^5 iterations of
the pow call, START/STOP, and command ffplay ~/samples/jpeg2000/chiens_dcinema2K.mxf.
Low number of runs also given to prove the point about worst case:
pow:
216270 decicycles in pow, 1 runs, 0 skips
110175 decicycles in pow, 2 runs, 0 skips
56085 decicycles in pow, 4 runs, 0 skips
29013 decicycles in pow, 8 runs, 0 skips
15472 decicycles in pow, 16 runs, 0 skips
8689 decicycles in pow, 32 runs, 0 skips
5295 decicycles in pow, 64 runs, 0 skips
3599 decicycles in pow, 128 runs, 0 skips
2748 decicycles in pow, 256 runs, 0 skips
2304 decicycles in pow, 511 runs, 1 skips
2072 decicycles in pow, 1022 runs, 2 skips
1963 decicycles in pow, 2044 runs, 4 skips
1894 decicycles in pow, 4091 runs, 5 skips
1860 decicycles in pow, 8184 runs, 8 skips
exp2f:
134140 decicycles in pow, 1 runs, 0 skips
68110 decicycles in pow, 2 runs, 0 skips
34530 decicycles in pow, 4 runs, 0 skips
17677 decicycles in pow, 8 runs, 0 skips
9175 decicycles in pow, 16 runs, 0 skips
4931 decicycles in pow, 32 runs, 0 skips
2808 decicycles in pow, 64 runs, 0 skips
1747 decicycles in pow, 128 runs, 0 skips
1208 decicycles in pow, 256 runs, 0 skips
952 decicycles in pow, 512 runs, 0 skips
822 decicycles in pow, 1024 runs, 0 skips
765 decicycles in pow, 2047 runs, 1 skips
722 decicycles in pow, 4094 runs, 2 skips
693 decicycles in pow, 8190 runs, 2 skips
exp2fi:
2740 decicycles in pow, 1 runs, 0 skips
1530 decicycles in pow, 2 runs, 0 skips
955 decicycles in pow, 4 runs, 0 skips
622 decicycles in pow, 8 runs, 0 skips
477 decicycles in pow, 16 runs, 0 skips
368 decicycles in pow, 32 runs, 0 skips
317 decicycles in pow, 64 runs, 0 skips
291 decicycles in pow, 128 runs, 0 skips
277 decicycles in pow, 256 runs, 0 skips
268 decicycles in pow, 512 runs, 0 skips
265 decicycles in pow, 1024 runs, 0 skips
263 decicycles in pow, 2048 runs, 0 skips
263 decicycles in pow, 4095 runs, 1 skips
260 decicycles in pow, 8191 runs, 1 skips
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
Fixes: out of array read
Fixes: 36b8096fefab16c4c9326a508053e95c/signal_sigsegv_1d9ce18_3233_1a55196b018106dfabeace071a432d9e.r3d
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes assertion failure
Fixes: 03e0abe721b1174856d41a1eb5d6a896/signal_sigabrt_7ffff6ae7cc9_3813_e71bf3541abed3ccba031cd5ba0269a4.avi
This fix is choosen to be simple to backport, better solution
for master is planed
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* commit '95a41311ac3a44773cc4dc407408aca35b1f8e26':
jpeg2000: Factor out band stepsize initialization
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
This also reduces the amount of memory needed
Fixes Ticket4672
The new code seems slightly faster as well, probably due to better cache usage
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This reduces the number of operations
Its not done for 9/7i as that would overflow thanks to JPEG2000 allowing
32 decomposition levels
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
thats how the specification defines it, this also improves numerical
accuracy of the integer wavelet implementation. It otherwise should
be equivalent, in case of overflows this can be reverted.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
These 2 are highly related so they are in the same commit
Fixes part of Ticket4605
Fixes p0_04.j2k
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'a2448cfe167a4cd4eb631318550d4eef38fca24a':
jpeg2000: do not compute the same value twice
Conflicts:
libavcodec/jpeg2000.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'fe4d5fe9361162f9033ff1bd84bfc1b2091ba785':
jpeg2000: Mark static data init functions as av_cold
Conflicts:
libavcodec/jpeg2000.c
libavcodec/jpeg2000dec.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '5bf208f659703895df7926238dcfa8a8175de36b':
jpeg2000: Use separate fields for int and float codepaths
jpeg2000: Split int/float codepaths depending on the DWT
Conflicts:
libavcodec/jpeg2000.c
libavcodec/jpeg2000dec.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'ef35d6dbc6c3b7ba6b13ac13fc8e797cc1268c8f':
jpeg2000: Propagate error code from get_cox()
jpeg2000: Check that nreslevels2decode has been initialized before use
Conflicts:
libavcodec/jpeg2000dec.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>