Some of these were made possible by moving several common macros to
libavutil/macros.h.
While just at it, also improve the other headers a bit.
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This version is able to output multiple coefficients at a time and
is able to altogether remove actual golomb code parsing.
Its also able to partially recover the last coefficient in case
the packet is incomplete.
Total decoder performance gain for 8bit 420 1080p lossless: 40%.
Total decoder performance gain for 10bit 420 1080p lossless: 40%.
clang was able to vectorize the loop much better than
my handwritten assembly, but gcc was very naive and didn't.
Lookup table is a rewritten version of vc2hqdecode.
Still much left to optimize, but it provides a significant performance
improvement - 10% for 300Mbps (1080p30), 25% for 1.5Gbps (4k 60fps) in
comparison with the default implementation.
Signed-off-by: Rostislav Pehlivanov <rpehlivanov@obe.tv>