This work is sponsored by, and copyright, Google.
The filter coefficients are signed values, where the product of the
multiplication with one individual filter coefficient doesn't
overflow a 16 bit signed value (the largest filter coefficient is
127). But when the products are accumulated, the resulting sum can
overflow the 16 bit signed range. Instead of accumulating in 32 bit,
we accumulate the largest product (either index 3 or 4) last with a
saturated addition.
(The VP8 MC asm does something similar, but slightly simpler, by
accumulating each half of the filter separately. In the VP9 MC
filters, each half of the filter can also overflow though, so the
largest component has to be handled individually.)
Examples of relative speedup compared to the C version, from checkasm:
Cortex A7 A8 A9 A53
vp9_avg4_neon: 1.71 1.15 1.42 1.49
vp9_avg8_neon: 2.51 3.63 3.14 2.58
vp9_avg16_neon: 2.95 6.76 3.01 2.84
vp9_avg32_neon: 3.29 6.64 2.85 3.00
vp9_avg64_neon: 3.47 6.67 3.14 2.80
vp9_avg_8tap_smooth_4h_neon: 3.22 4.73 2.76 4.67
vp9_avg_8tap_smooth_4hv_neon: 3.67 4.76 3.28 4.71
vp9_avg_8tap_smooth_4v_neon: 5.52 7.60 4.60 6.31
vp9_avg_8tap_smooth_8h_neon: 6.22 9.04 5.12 9.32
vp9_avg_8tap_smooth_8hv_neon: 6.38 8.21 5.72 8.17
vp9_avg_8tap_smooth_8v_neon: 9.22 12.66 8.15 11.10
vp9_avg_8tap_smooth_64h_neon: 7.02 10.23 5.54 11.58
vp9_avg_8tap_smooth_64hv_neon: 6.76 9.46 5.93 9.40
vp9_avg_8tap_smooth_64v_neon: 10.76 14.13 9.46 13.37
vp9_put4_neon: 1.11 1.47 1.00 1.21
vp9_put8_neon: 1.23 2.17 1.94 1.48
vp9_put16_neon: 1.63 4.02 1.73 1.97
vp9_put32_neon: 1.56 4.92 2.00 1.96
vp9_put64_neon: 2.10 5.28 2.03 2.35
vp9_put_8tap_smooth_4h_neon: 3.11 4.35 2.63 4.35
vp9_put_8tap_smooth_4hv_neon: 3.67 4.69 3.25 4.71
vp9_put_8tap_smooth_4v_neon: 5.45 7.27 4.49 6.52
vp9_put_8tap_smooth_8h_neon: 5.97 8.18 4.81 8.56
vp9_put_8tap_smooth_8hv_neon: 6.39 7.90 5.64 8.15
vp9_put_8tap_smooth_8v_neon: 9.03 11.84 8.07 11.51
vp9_put_8tap_smooth_64h_neon: 6.78 9.48 4.88 10.89
vp9_put_8tap_smooth_64hv_neon: 6.99 8.87 5.94 9.56
vp9_put_8tap_smooth_64v_neon: 10.69 13.30 9.43 14.34
For the larger 8tap filters, the speedup vs C code is around 5-14x.
This is significantly faster than libvpx's implementation of the same
functions, at least when comparing the put_8tap_smooth_64 functions
(compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from
libvpx).
Absolute runtimes from checkasm:
Cortex A7 A8 A9 A53
vp9_put_8tap_smooth_64h_neon: 20150.3 14489.4 19733.6 10863.7
libvpx vpx_convolve8_horiz_neon: 52623.3 19736.4 21907.7 25027.7
vp9_put_8tap_smooth_64v_neon: 14455.0 12303.9 13746.4 9628.9
libvpx vpx_convolve8_vert_neon: 42090.0 17706.2 17659.9 16941.2
Thus, on the A9, the horizontal filter is only marginally faster than
libvpx, while our version is significantly faster on the other cores,
and the vertical filter is significantly faster on all cores. The
difference is especially large on the A7.
The libvpx implementation does the accumulation in 32 bit, which
probably explains most of the differences.
Signed-off-by: Martin Storsjö <martin@martin.st>
This makes it match the pattern already used for VP8 MC functions.
This also makes the signature match ffmpeg's version of these
functions, easing porting of code in both directions.
Signed-off-by: Martin Storsjö <martin@martin.st>
Also adds a new flag to mark filters which are aware of hwframes and
will perform this task themselves, and marks all appropriate filters
with this flag.
This is required to allow software-mapped hardware frames to work,
because we need to have the frames context available for any later
mapping operation in the filter graph.
The output from the filter graph should only propagate further to an
encoder if the hardware format actually matches the visible format
(mapped frames are valid here and have an hw_frames_ctx, but this
should not be given to the encoder as its hardware context).
This matrix needs to be applied after all others have (currently only
display matrix from trak), but cannot be handled in movie box, since
streams are not allocated yet. So store it in main context, and apply
it when appropriate, that is after parsing the tkhd one.
Fate tests are updated accordingly.
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
This is needed for improved fate testing and it is modeled after
-show_format_entry. The main behavioral difference is that when a print
function is called with an empty key, rather than discarding it, the
closes key in the hierarchy is used instead.
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
The use of TLSv1_*_method() disallows newer protocol versions; instead
use SSLv23_*_method() and then explicitly disable the deprecated
protocol versions which should not be supported.
libavcodec/hapenc.c:121:20: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘size_t {aka unsigned int}’ [-Wformat=]
libavcodec/hapenc.c:121:20: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 5 has type ‘size_t {aka unsigned int}’ [-Wformat=]
Since avversion.h is a generated header it must be created before
dependencies can be determined as a side effect of compilation.
Otherwise Make stops and restarts the build process to generate
avversion.h and produces related error messages.
When the macro is expanded with a semicolon following it and the
macro itself contains a semicolon, we ended up in double semicolons,
which is treated as a statement that disallows further declarations.
This avoids errors about mixed declarations and statements on gcc,
after ee05079766.
Signed-off-by: Martin Storsjö <martin@martin.st>
The buffer map/unmap code was in an early version of this before it
was committed, but the unmap was never removed. While wrong, this
was harmless (and therefore unnoticed) because the buffers can't be
mapped at this point - all drivers just did nothing with the call.
When decoding interlaced pictures, the structure is reused to render
to the same surface twice. The parameter buffers were not being
cleared, which caused the i965 driver to error out.