The ld1r is a leftover from the arm version, where this trick is
beneficial on some cores.
Use a single-lane load where we don't need the semantics of ld1r.
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.
This gives a pretty substantial speedup for the smaller subpartitions.
The code size increases from 14740 bytes to 24292 bytes.
The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.
Before:
vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7
vp9_inv_dct_dct_16x16_sub2_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1387.4
vp9_inv_dct_dct_16x16_sub16_add_neon: 1387.6
vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1
vp9_inv_dct_dct_32x32_sub2_add_neon: 5198.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 5198.6
vp9_inv_dct_dct_32x32_sub8_add_neon: 5196.3
vp9_inv_dct_dct_32x32_sub12_add_neon: 6183.4
vp9_inv_dct_dct_32x32_sub16_add_neon: 6174.3
vp9_inv_dct_dct_32x32_sub20_add_neon: 7151.4
vp9_inv_dct_dct_32x32_sub24_add_neon: 7145.3
vp9_inv_dct_dct_32x32_sub28_add_neon: 8119.3
vp9_inv_dct_dct_32x32_sub32_add_neon: 8118.7
After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7
vp9_inv_dct_dct_16x16_sub2_add_neon: 640.8
vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 842.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1388.3
vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1
vp9_inv_dct_dct_32x32_sub2_add_neon: 3685.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 3685.1
vp9_inv_dct_dct_32x32_sub8_add_neon: 3684.4
vp9_inv_dct_dct_32x32_sub12_add_neon: 5312.2
vp9_inv_dct_dct_32x32_sub16_add_neon: 5315.4
vp9_inv_dct_dct_32x32_sub20_add_neon: 7154.9
vp9_inv_dct_dct_32x32_sub24_add_neon: 7154.5
vp9_inv_dct_dct_32x32_sub28_add_neon: 8126.6
vp9_inv_dct_dct_32x32_sub32_add_neon: 8127.2
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from
19496 to 14740 bytes.
This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.
Before:
vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.2
vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0
vp9_inv_dct_dct_32x32_sub32_add_neon: 8095.7
After:
vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub16_add_neon: 1390.1
vp9_inv_dct_dct_32x32_sub4_add_neon: 5199.9
vp9_inv_dct_dct_32x32_sub32_add_neon: 8125.8
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
15324 to 12388 bytes.
This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.
Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub4_add_neon: 2063.4 1516.0 1719.5 1245.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 3279.3 2454.5 2525.2 1982.3
vp9_inv_dct_dct_32x32_sub4_add_neon: 10750.0 7955.4 8525.6 6754.2
vp9_inv_dct_dct_32x32_sub32_add_neon: 18574.0 17108.4 14216.7 12010.2
After:
vp9_inv_dct_dct_16x16_sub4_add_neon: 2060.8 1608.5 1735.7 1262.0
vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.2 2443.5 2546.1 1999.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 10682.0 8043.8 8581.3 6810.1
vp9_inv_dct_dct_32x32_sub32_add_neon: 18522.4 17277.4 14286.7 12087.9
Signed-off-by: Martin Storsjö <martin@martin.st>
This avoids having to count the number of frames sent to the codec
and the number of output packets received; instead just wait until
the encoder returns a buffer with the EOS flag set.
Signed-off-by: Martin Storsjö <martin@martin.st>
This avoids concatenation, which can't be used if the whole macro
is wrapped within another macro.
This is also arguably more readable.
Signed-off-by: Martin Storsjö <martin@martin.st>
No deprecation guards, because the old decode API (for which this field
is needed) doesn't have any either.
This field should be removed together with the old decode calls.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
The code relies on their validity and otherwise can try to access a NULL
object->rle pointer, causing segmentation faults.
Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Only do this when building for a recent VAAPI version - initial
driver implementations were confused about the interpretation of the
framerate field, but hopefully this will be consistent everywhere
once 0.40.0 is released.
Default to using VBR when a target bitrate is set, unless the max rate
is also set and matches the target. Changes to the Intel driver mean
that min_qp is also respected in this case, so set a codec default to
unset the value rather than using the current default inherited from
the MPEG-4 part 2 encoder.
This includes a backward-compatibility hack to choose CBR anyway on
old drivers which have no CBR support, so that existing programs will
continue to work their options now map to VBR.