* commit '8b7a9729aa162e2bbd571933f1aa40767f1ff47b':
avconv_qsv: use the actual pixel format provided by lavc
This commit is a noop, see 03cef34aa66
Merged-by: Clément Bœsch <u@pkh.me>
* commit '6f40181cad8ac04adff7bd10e1e1ab65f22bc1f0':
avconv_qsv: align the surface size to 32
This commit is a noop, see 03cef34aa66
Merged-by: Clément Bœsch <u@pkh.me>
Fixes: 732/clusterfuzz-testcase-4872990070145024
See: [FFmpeg-devel] [PATCH 2/6] avcodec/dca_xll: Fix runtime error: signed integer overflow: 2147286116 + 6298923 cannot be represented in type 'int'
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Provides a way to change bandwidth parameter inside DASH manifest after a non-CBR H.264 encoding.
Caller now is able to compute the bitrate by itself, after all packets have been written, and then set that value in AVFormatContext->streams->codecpar->bit_rate before calling av_write_trailer. As a result that value will be set in DASH manifest.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This matches the order they are in the 16 bpp version.
There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.
This makes the 8 bpp version match the 16 bpp version better.
This is cherrypicked from libav commit
b8f66c0838b4c645227f23a35b4d54373da4c60a.
Signed-off-by: Martin Storsjö <martin@martin.st>
This matches the order they are in the 16 bpp version.
There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.
This makes the 8 bpp version match the 16 bpp version better.
This is cherrypicked from libav commit
08074c092d8c97d71c5986e5325e97ffc956119d.
Signed-off-by: Martin Storsjö <martin@martin.st>
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.
This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
This is cherrypicked from libav commit
09eb88a12e008d10a3f7a6be75d18ad98b368e68.
Signed-off-by: Martin Storsjö <martin@martin.st>
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.
This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
This is cherrypicked from libav commit
de06bdfe6c8abd8266d5c6f5c68e4df0060b61fc.
Signed-off-by: Martin Storsjö <martin@martin.st>
The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.
After this, we still can skip pushing d12-d15.
Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3
This is cherrypicked from libav commit
65aa002d54433154a6924dc13e498bec98451ad0.
Signed-off-by: Martin Storsjö <martin@martin.st>
The idct32x32 function actually pushed q4-q7 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.
Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
in the idct16 function), and the lanewise vmul needs a register in
the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
while doing idct16.
While keeping these coefficients in registers, we still can skip pushing
q7.
Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_32x32_sub32_add_neon: 18553.8 17182.7 14303.3 12089.7
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 18470.3 16717.7 14173.6 11860.8
This is cherrypicked from libav commit
402546a17233a8815307df9e14ff88cd70424537.
Signed-off-by: Martin Storsjö <martin@martin.st>
For this case, with 8 inputs but only changing 4 of them, we can fit
all 16 input pixels into a q register, and still have enough temporary
registers for doing the loop filter.
The wd=8 filters would require too many temporary registers for
processing all 16 pixels at once though.
Before: Cortex A7 A8 A9 A53
vp9_loop_filter_mix2_v_44_16_neon: 289.7 256.2 237.5 181.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 221.2 150.5 177.7 138.0
This is cherrypicked from libav commit
575e31e931e4178e9f1e24407503c9b4ec0ef9ba.
Signed-off-by: Martin Storsjö <martin@martin.st>
This is one cycle faster in total, and three instructions fewer.
Before:
vp9_loop_filter_mix2_v_44_16_neon: 123.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 122.2
This is cherrypicked from libav commit
3bf9c48320f25f3d5557485b0202f22ae60748b0.
Signed-off-by: Martin Storsjö <martin@martin.st>
This fixes building with clang for linux with PIC enabled.
This is cherrypicked from libav commit
8847eeaa141898850381400000fb2b8a7adc7100.
Signed-off-by: Martin Storsjö <martin@martin.st>
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.
This is cherrypicked from libav commit
b0806088d3b27044145b20421da8d39089ae0c6a.
Signed-off-by: Martin Storsjö <martin@martin.st>
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.
This is cherrypicked from libav commit
e18c39005ad1dbb178b336f691da1de91afd434e.
Signed-off-by: Martin Storsjö <martin@martin.st>
Previously we first calculated hev, and then negated it.
Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.
Before: Cortex A7 A8 A9 A53 A53/AArch64
vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7
vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0
After:
vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7
vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0
This is cherrypicked from libav commit
e1f9de86f454861b69b199ad801adc2ec6c3b220.
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
Before: Cortex A53
vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1
After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2
vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3
This is cherrypicked from libav commit
3fcf788fbbccc4130868e7abe58a88990290f7c1.
Signed-off-by: Martin Storsjö <martin@martin.st>
No measured speedup on a Cortex A53, but other cores might benefit.
This is cherrypicked from libav commit
388e0d2515bc6bbc9d0c9af1d230bd16cf945fe7.
Signed-off-by: Martin Storsjö <martin@martin.st>
Fold the field lengths into the macro.
This makes the macro invocations much more readable, when the
lines are shorter.
This also makes it easier to use only half the registers within
the macro.
This is cherrypicked from libav commit
5e0c2158fbc774f87d3ce4b7b950ba4d42c4a7b8.
Signed-off-by: Martin Storsjö <martin@martin.st>
The ld1r is a leftover from the arm version, where this trick is
beneficial on some cores.
Use a single-lane load where we don't need the semantics of ld1r.
This is cherrypicked from libav commit
ed8d293306e12c9b79022d37d39f48825ce7f2fa.
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.
This gives a pretty substantial speedup for the smaller subpartitions.
The code size increases from 14740 bytes to 24292 bytes.
The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.
Before:
vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7
vp9_inv_dct_dct_16x16_sub2_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1387.4
vp9_inv_dct_dct_16x16_sub16_add_neon: 1387.6
vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1
vp9_inv_dct_dct_32x32_sub2_add_neon: 5198.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 5198.6
vp9_inv_dct_dct_32x32_sub8_add_neon: 5196.3
vp9_inv_dct_dct_32x32_sub12_add_neon: 6183.4
vp9_inv_dct_dct_32x32_sub16_add_neon: 6174.3
vp9_inv_dct_dct_32x32_sub20_add_neon: 7151.4
vp9_inv_dct_dct_32x32_sub24_add_neon: 7145.3
vp9_inv_dct_dct_32x32_sub28_add_neon: 8119.3
vp9_inv_dct_dct_32x32_sub32_add_neon: 8118.7
After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7
vp9_inv_dct_dct_16x16_sub2_add_neon: 640.8
vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 842.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1388.3
vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1
vp9_inv_dct_dct_32x32_sub2_add_neon: 3685.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 3685.1
vp9_inv_dct_dct_32x32_sub8_add_neon: 3684.4
vp9_inv_dct_dct_32x32_sub12_add_neon: 5312.2
vp9_inv_dct_dct_32x32_sub16_add_neon: 5315.4
vp9_inv_dct_dct_32x32_sub20_add_neon: 7154.9
vp9_inv_dct_dct_32x32_sub24_add_neon: 7154.5
vp9_inv_dct_dct_32x32_sub28_add_neon: 8126.6
vp9_inv_dct_dct_32x32_sub32_add_neon: 8127.2
This is cherrypicked from libav commit
a63da4511d0fee66695ff4afd264ba1dbf1e812d.
Signed-off-by: Martin Storsjö <martin@martin.st>
This allows reusing the macro for a separate implementation of the
pass2 function.
This is cherrypicked from libav commit
79d332ebbde8c0a3e9da094dcfd10abd33ba7378.
Signed-off-by: Martin Storsjö <martin@martin.st>
This allows reusing the macro for a separate implementation of the
pass2 function.
This is cherrypicked from libav commit
47b3c2c18d1897f3c753ba0cec4b2d7aa24526af.
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from
19496 to 14740 bytes.
This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.
Before:
vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.2
vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0
vp9_inv_dct_dct_32x32_sub32_add_neon: 8095.7
After:
vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub16_add_neon: 1390.1
vp9_inv_dct_dct_32x32_sub4_add_neon: 5199.9
vp9_inv_dct_dct_32x32_sub32_add_neon: 8125.8
This is cherrypicked from libav commit
115476018d2c97df7e9b4445fe8f6cc7420ab91f.
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
15324 to 12388 bytes.
This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.
Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub4_add_neon: 2063.4 1516.0 1719.5 1245.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 3279.3 2454.5 2525.2 1982.3
vp9_inv_dct_dct_32x32_sub4_add_neon: 10750.0 7955.4 8525.6 6754.2
vp9_inv_dct_dct_32x32_sub32_add_neon: 18574.0 17108.4 14216.7 12010.2
After:
vp9_inv_dct_dct_16x16_sub4_add_neon: 2060.8 1608.5 1735.7 1262.0
vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.2 2443.5 2546.1 1999.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 10682.0 8043.8 8581.3 6810.1
vp9_inv_dct_dct_32x32_sub32_add_neon: 18522.4 17277.4 14286.7 12087.9
This is cherrypicked from libav commit
0331c3f5e8cb6e6b53fab7893e91d1be1bfa979c.
Signed-off-by: Martin Storsjö <martin@martin.st>
This avoids concatenation, which can't be used if the whole macro
is wrapped within another macro.
This is also arguably more readable.
This is cherrypicked from libav commit
58d87e0f49bcbbc6f426328f53b657bae7430cd2.
Signed-off-by: Martin Storsjö <martin@martin.st>
This makes it more readable.
This is cherrypicked from libav commit
3bc5b28d5a191864c54bba60646933a63da31656.
Signed-off-by: Martin Storsjö <martin@martin.st>
The 'sqrt' and 'cbrt' scalers were added in commit
80262d8c86e94ff9a4bb3a9e3c2d734e04ccb399, but their symbolic option values
only made available to the showwaves filter, not showwavespic, despite
the scalers working properly by their numerical option values.
Signed-off-by: Moritz Barsnick <barsnick@gmx.net>