1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-11-26 19:01:44 +02:00
Commit Graph

21412 Commits

Author SHA1 Message Date
Martin Storsjö
0c0b87f12d aarch64: vp9itxfm: Fix incorrect vertical alignment
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 23:57:06 +02:00
Martin Storsjö
8476eb0d3a aarch64: vp9itxfm: Update a comment to refer to a register with a different name
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 23:57:02 +02:00
Martin Storsjö
3dd7827258 aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 23:56:59 +02:00
Martin Storsjö
ed8d293306 aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible
The ld1r is a leftover from the arm version, where this trick is
beneficial on some cores.

Use a single-lane load where we don't need the semantics of ld1r.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 23:56:54 +02:00
Martin Storsjö
4da4b2b87f aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 23:56:50 +02:00
Martin Storsjö
3933b86bb9 arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 23:56:44 +02:00
Martin Storsjö
a63da4511d aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 14740 bytes to 24292 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:
vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
vp9_inv_dct_dct_16x16_sub2_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1387.4
vp9_inv_dct_dct_16x16_sub16_add_neon:   1387.6
vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    5198.5
vp9_inv_dct_dct_32x32_sub4_add_neon:    5198.6
vp9_inv_dct_dct_32x32_sub8_add_neon:    5196.3
vp9_inv_dct_dct_32x32_sub12_add_neon:   6183.4
vp9_inv_dct_dct_32x32_sub16_add_neon:   6174.3
vp9_inv_dct_dct_32x32_sub20_add_neon:   7151.4
vp9_inv_dct_dct_32x32_sub24_add_neon:   7145.3
vp9_inv_dct_dct_32x32_sub28_add_neon:   8119.3
vp9_inv_dct_dct_32x32_sub32_add_neon:   8118.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
vp9_inv_dct_dct_16x16_sub2_add_neon:     640.8
vp9_inv_dct_dct_16x16_sub4_add_neon:     639.0
vp9_inv_dct_dct_16x16_sub8_add_neon:     842.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1388.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    3685.5
vp9_inv_dct_dct_32x32_sub4_add_neon:    3685.1
vp9_inv_dct_dct_32x32_sub8_add_neon:    3684.4
vp9_inv_dct_dct_32x32_sub12_add_neon:   5312.2
vp9_inv_dct_dct_32x32_sub16_add_neon:   5315.4
vp9_inv_dct_dct_32x32_sub20_add_neon:   7154.9
vp9_inv_dct_dct_32x32_sub24_add_neon:   7154.5
vp9_inv_dct_dct_32x32_sub28_add_neon:   8126.6
vp9_inv_dct_dct_32x32_sub32_add_neon:   8127.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 12:32:03 +02:00
Martin Storsjö
5eb5aec475 arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 12388 bytes to 19784 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    212.0    235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:    2102.1   1521.7   1736.2   1265.8
vp9_inv_dct_dct_16x16_sub4_add_neon:    2104.5   1533.0   1736.6   1265.5
vp9_inv_dct_dct_16x16_sub8_add_neon:    2484.8   1828.7   2014.4   1506.5
vp9_inv_dct_dct_16x16_sub12_add_neon:   2851.2   2117.8   2294.8   1753.2
vp9_inv_dct_dct_16x16_sub16_add_neon:   3239.4   2408.3   2543.5   1994.9
vp9_inv_dct_dct_32x32_sub1_add_neon:     758.3    456.7    864.5    553.9
vp9_inv_dct_dct_32x32_sub2_add_neon:   10776.7   7949.8   8567.7   6819.7
vp9_inv_dct_dct_32x32_sub4_add_neon:   10865.6   8131.5   8589.6   6816.3
vp9_inv_dct_dct_32x32_sub8_add_neon:   12053.9   9271.3   9387.7   7564.0
vp9_inv_dct_dct_32x32_sub12_add_neon:  13328.3  10463.2  10217.0   8321.3
vp9_inv_dct_dct_32x32_sub16_add_neon:  14176.4  11509.5  11018.7   9062.3
vp9_inv_dct_dct_32x32_sub20_add_neon:  15301.5  12999.9  11855.1   9828.2
vp9_inv_dct_dct_32x32_sub24_add_neon:  16482.7  14931.5  12650.1  10575.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17589.5  15811.9  13482.8  11333.4
vp9_inv_dct_dct_32x32_sub32_add_neon:  18696.2  17049.2  14355.6  12089.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    211.7    235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:    1203.5    998.2   1035.3    763.0
vp9_inv_dct_dct_16x16_sub4_add_neon:    1203.5    998.1   1035.5    760.8
vp9_inv_dct_dct_16x16_sub8_add_neon:    1926.1   1610.6   1722.1   1271.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   2873.2   2129.7   2285.1   1757.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   3221.4   2520.3   2557.6   2002.1
vp9_inv_dct_dct_32x32_sub1_add_neon:     753.0    457.5    866.6    554.6
vp9_inv_dct_dct_32x32_sub2_add_neon:    7554.6   5652.4   6048.4   4920.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    7549.9   5685.0   6046.9   4925.7
vp9_inv_dct_dct_32x32_sub8_add_neon:    8336.9   6704.5   6604.0   5478.0
vp9_inv_dct_dct_32x32_sub12_add_neon:  10914.0   9777.2   9240.4   7416.9
vp9_inv_dct_dct_32x32_sub16_add_neon:  11859.2  11223.3   9966.3   8095.1
vp9_inv_dct_dct_32x32_sub20_add_neon:  15237.1  13029.4  11838.3   9829.4
vp9_inv_dct_dct_32x32_sub24_add_neon:  16293.2  14379.8  12644.9  10572.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17424.3  15734.7  13473.0  11326.9
vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.3  17457.0  14298.6  12080.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 12:32:00 +02:00
Martin Storsjö
79d332ebbd aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function
This allows reusing the macro for a separate implementation of the
pass2 function.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 12:31:56 +02:00
Martin Storsjö
47b3c2c18d arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function
This allows reusing the macro for a separate implementation of the
pass2 function.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 12:31:53 +02:00
Martin Storsjö
115476018d aarch64: vp9itxfm: Make the larger core transforms standalone functions
This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from
19496 to 14740 bytes.

This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.

Before:
vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
vp9_inv_dct_dct_32x32_sub32_add_neon:   8095.7

After:
vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub16_add_neon:   1390.1
vp9_inv_dct_dct_32x32_sub4_add_neon:    5199.9
vp9_inv_dct_dct_32x32_sub32_add_neon:   8125.8

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 12:31:45 +02:00
Martin Storsjö
0331c3f5e8 arm: vp9itxfm: Make the larger core transforms standalone functions
This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
15324 to 12388 bytes.

This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub4_add_neon:    2063.4   1516.0   1719.5   1245.1
vp9_inv_dct_dct_16x16_sub16_add_neon:   3279.3   2454.5   2525.2   1982.3
vp9_inv_dct_dct_32x32_sub4_add_neon:   10750.0   7955.4   8525.6   6754.2
vp9_inv_dct_dct_32x32_sub32_add_neon:  18574.0  17108.4  14216.7  12010.2

After:
vp9_inv_dct_dct_16x16_sub4_add_neon:    2060.8   1608.5   1735.7   1262.0
vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.2   2443.5   2546.1   1999.5
vp9_inv_dct_dct_32x32_sub4_add_neon:   10682.0   8043.8   8581.3   6810.1
vp9_inv_dct_dct_32x32_sub32_add_neon:  18522.4  17277.4  14286.7  12087.9

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 12:31:40 +02:00
Martin Storsjö
57ec83e424 omx: Use the EOS flag to handle flushing at the end
This avoids having to count the number of frames sent to the codec
and the number of output packets received; instead just wait until
the encoder returns a buffer with the EOS flag set.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-08 11:50:57 +02:00
Diego Biurrun
a25dac976a Use bitstream_init8() where appropriate 2017-02-07 18:27:21 +01:00
Alexandra Hájková
f7ec7f546f wma: Convert to the new bitstream reader 2017-02-06 15:13:34 +01:00
Martin Storsjö
58d87e0f49 aarch64: vp9itxfm: Restructure the idct32 store macros
This avoids concatenation, which can't be used if the whole macro
is wrapped within another macro.

This is also arguably more readable.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-05 13:05:32 +02:00
Martin Storsjö
3bc5b28d5a arm: vp9itxfm: Avoid .irp when it doesn't save any lines
This makes it more readable.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-05 12:59:19 +02:00
Diego Biurrun
7abdd026df asm: Consistently uppercase SECTION markers 2017-02-03 11:37:53 +01:00
Alexandra Hájková
c29da01ac9 svq3: Convert to the new bitstream reader 2017-02-02 17:06:17 +01:00
wm4
577326d430 lavc: deprecate refcounted_frames field
No deprecation guards, because the old decode API (for which this field
is needed) doesn't have any either.

This field should be removed together with the old decode calls.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2017-02-01 10:47:46 +01:00
Anton Khirnov
fd9212f2ed Mark some arrays that never change as const. 2017-02-01 10:42:59 +01:00
Alexandra Hájková
ab2539bd37 ffv1: Convert to the new bitstream reader 2017-01-31 17:54:11 +01:00
Alexandra Hájková
2d72219554 h261dec: Convert to the new bitstream reader 2017-01-31 17:54:11 +01:00
Alexandra Hájková
2b94ed12de shorten: Convert to the new bitstream reader 2017-01-31 17:54:11 +01:00
Alexandra Hájková
5a6da49dd0 ralf: Convert to the new bitstream reader 2017-01-31 17:54:11 +01:00
Alexandra Hájková
d85b37a955 loco: Convert to the new bitstream reader 2017-01-31 17:54:10 +01:00
Alexandra Hájková
0f94de8a09 fic: Convert to the new bitstream reader 2017-01-31 17:54:10 +01:00
Alexandra Hájková
6b1f559f9a dirac: Convert to the new bitstream reader 2017-01-31 17:54:10 +01:00
Alexandra Hájková
ffc00df0a6 cavs: Convert to the new bitstream reader 2017-01-31 17:54:10 +01:00
Alexandra Hájková
0c89ff82e9 aic: Convert to the new bitstream reader 2017-01-31 17:54:10 +01:00
Diego Biurrun
d4c2103bd3 golomb: Convert to the new bitstream reader 2017-01-31 17:46:19 +01:00
Andreas Cadhalpun
612cc07128 pgssubdec: reset rle_data_len/rle_remaining_len on allocation error
The code relies on their validity and otherwise can try to access a NULL
object->rle pointer, causing segmentation faults.

Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2017-01-31 09:35:54 +01:00
Mark Thompson
ca62236a89 vaapi_encode: Add VP8 support 2017-01-30 23:03:46 +00:00
Mark Thompson
ff35aa8ca4 vaapi_encode: Pass framerate parameters to driver
Only do this when building for a recent VAAPI version - initial
driver implementations were confused about the interpretation of the
framerate field, but hopefully this will be consistent everywhere
once 0.40.0 is released.
2017-01-30 22:52:54 +00:00
Mark Thompson
eddfb57210 vaapi_h264: Enable VBR mode
Default to using VBR when a target bitrate is set, unless the max rate
is also set and matches the target.  Changes to the Intel driver mean
that min_qp is also respected in this case, so set a codec default to
unset the value rather than using the current default inherited from
the MPEG-4 part 2 encoder.
2017-01-30 22:52:54 +00:00
Mark Thompson
f033ba470f vaapi_encode: Support VBR mode
This includes a backward-compatibility hack to choose CBR anyway on
old drivers which have no CBR support, so that existing programs will
continue to work their options now map to VBR.
2017-01-30 22:52:54 +00:00
Mark Thompson
ca6ae3b77a vaapi_encode: Add MPEG-2 support 2017-01-29 13:28:31 +00:00
Alexandra Hájková
381a4e31a6 tak: Convert to the new bitstream reader 2017-01-25 11:06:58 +01:00
Diego Biurrun
2e0e150144 magicyuv: Convert to the new bitstream reader 2017-01-25 10:38:43 +01:00
Diego Biurrun
b061f298f7 truemotion2rt: Convert to the new bitstream reader 2017-01-25 09:55:36 +01:00
Alexandra Hájková
e7f24c9ffc wavpack: Convert to the new bitstream reader 2017-01-25 09:55:35 +01:00
Alexandra Hájková
6668bc80b5 mpc: Convert to the new bitstream reader 2017-01-25 09:55:33 +01:00
Alexandra Hájková
fd8de7f2d8 dxtory: Convert to the new bitstream reader 2017-01-20 10:18:32 +01:00
Alexandra Hájková
4d49a4c550 apedec: Convert to the new bitstream reader 2017-01-20 10:18:32 +01:00
Anton Khirnov
b4a911c189 mpegvideoenc: make a table const 2017-01-19 09:52:21 +01:00
Anton Khirnov
296eff4d9d zmbvenc: get rid of a global table 2017-01-19 09:52:10 +01:00
Derek Buitenhuis
00b775dda2 hevc: Mark as having threadsafe init
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2017-01-19 09:51:15 +01:00
Alexandra Hájková
54dcd22885 als: Convert to the new bitstream reader 2017-01-17 09:52:11 +01:00
Luca Barbato
fb59f87ce7 nvenc: Explicitly push the cuda context on encoding
Make sure that NVENC does not misbehave if other cuda usages happen
in the application.
2017-01-17 07:37:12 +01:00
Alexandra Hájková
4795e4f61f alac: Convert to the new bitstream reader 2017-01-13 10:27:03 +01:00