1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-23 12:43:46 +02:00
Commit Graph

83867 Commits

Author SHA1 Message Date
Mark Thompson
b9acc7fbd9 Merge commit 'ad71d3276fef0ee7e791e62bbfe9c4e540047417'
* commit 'ad71d3276fef0ee7e791e62bbfe9c4e540047417':
  lavfi: add a QSV deinterlacing filter

Minor fixup for lavfi differences.

Merged-by: Mark Thompson <sw@jkqxz.net>
2017-03-12 17:00:42 +00:00
Mark Thompson
a7434ef195 Merge commit '8e07c22e508b349d145b9f142aa3ee8b3ce1d3a4'
* commit '8e07c22e508b349d145b9f142aa3ee8b3ce1d3a4':
  qsvenc: print warnings from encode/init

Merged-by: Mark Thompson <sw@jkqxz.net>
2017-03-12 15:21:41 +00:00
Mark Thompson
80fa5a0bcc Merge commit '0956fd460681e8ccbdae19f135f0d3970bf95c2f'
* commit '0956fd460681e8ccbdae19f135f0d3970bf95c2f':
  qsvenc: do not re-execute encoding on all positive status codes

Noop, see fb240a6276.

Merged-by: Mark Thompson <sw@jkqxz.net>
2017-03-12 15:19:52 +00:00
Mark Thompson
15887a410c Merge commit '95414eb2dc63a6f934275b4ed33dedd4369f2c49'
* commit '95414eb2dc63a6f934275b4ed33dedd4369f2c49':
  qsv: print more complete error messages

Merged-by: Mark Thompson <sw@jkqxz.net>
2017-03-12 15:19:05 +00:00
Mark Thompson
723a542d6c Merge commit 'd9ec3c60143babe1bb77c268e1d5547d15acd69b'
* commit 'd9ec3c60143babe1bb77c268e1d5547d15acd69b':
  qsvenc: take only the allocated dimensions from the frames context

Merged-by: Mark Thompson <sw@jkqxz.net>
2017-03-12 15:06:07 +00:00
Mark Thompson
562f386c77 Merge commit '37a9015ee84c15fec5247ba8f6577351a25fa8d2'
* commit '37a9015ee84c15fec5247ba8f6577351a25fa8d2':
  qsvenc: add support for p010

Merged-by: Mark Thompson <sw@jkqxz.net>
2017-03-12 15:04:45 +00:00
Anton Khirnov
807a3b30d2 lavfi: add a QSV scaling filter
This merges libav commit ac7bfd6967,
which was previously skipped.

(cherry picked from commit ac7bfd6967)
Signed-off-by: Mark Thompson <sw@jkqxz.net>
2017-03-12 15:02:33 +00:00
Mark Thompson
210dd7bbb2 Merge commit '21962261c74aed4df00ae8348a5e2d1ecb67c52d'
* commit '21962261c74aed4df00ae8348a5e2d1ecb67c52d':
  qsv: handle the semi-packed formats in map_fourcc as well

Merged-by: Mark Thompson <sw@jkqxz.net>
2017-03-12 14:21:37 +00:00
Clément Bœsch
5e193daaa2 Merge commit 'f65285aba0df7d46298abe0c945dfee05cbc6028'
* commit 'f65285aba0df7d46298abe0c945dfee05cbc6028':
  lavc: set sw_pix_fmt for hwaccel encoding

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-12 13:21:01 +01:00
Clément Bœsch
8d2d817098 Merge commit 'd59641abfd25a1007bdf4723d952887b1e3619c6'
* commit 'd59641abfd25a1007bdf4723d952887b1e3619c6':
  lavc: initialize AVCodecContext.sw_pix_fmt properly

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-12 13:20:57 +01:00
Clément Bœsch
15f6e5f2a9 Merge commit '8b7a9729aa162e2bbd571933f1aa40767f1ff47b'
* commit '8b7a9729aa162e2bbd571933f1aa40767f1ff47b':
  avconv_qsv: use the actual pixel format provided by lavc

This commit is a noop, see 03cef34aa6

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-12 13:13:55 +01:00
Clément Bœsch
e514309a91 Merge commit '6f40181cad8ac04adff7bd10e1e1ab65f22bc1f0'
* commit '6f40181cad8ac04adff7bd10e1e1ab65f22bc1f0':
  avconv_qsv: align the surface size to 32

This commit is a noop, see 03cef34aa6

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-12 13:13:05 +01:00
Clément Bœsch
993a9a3d72 Merge commit 'b0f36a0043d76436cc7ab8ff92ab99c94595d3c0'
* commit 'b0f36a0043d76436cc7ab8ff92ab99c94595d3c0':
  avconv: stop using setpts for input framerate forced with -r

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-12 13:08:04 +01:00
Paul B Mahol
807d5dcde9 avcodec/scpr: use correct linesize for prev frame
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-03-12 12:34:55 +01:00
Michael Niedermayer
ce010655a6 avcodec/dca_xll: Fix runtime error: signed integer overflow: 2147286116 + 6298923 cannot be represented in type 'int'
Fixes: 732/clusterfuzz-testcase-4872990070145024

See: [FFmpeg-devel] [PATCH 2/6] avcodec/dca_xll: Fix runtime error: signed integer overflow: 2147286116 + 6298923 cannot be represented in type 'int'
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-03-12 04:38:14 +01:00
Michael Niedermayer
44e2105189 avcodec/amrwbdec: Fix runtime error: left shift of negative value -1
Fixes: 763/clusterfuzz-testcase-6007567320875008

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-03-12 04:38:14 +01:00
Michael Niedermayer
f4c2302ee2 avcodec/dca_xll: Fix runtime error: signed integer overflow: 1762028192 + 698372290 cannot be represented in type 'int'
Fixes: 762/clusterfuzz-testcase-5927683747741696

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-03-12 04:38:14 +01:00
Michael Niedermayer
47cc9c1d77 avcodec/wavpack: Fix runtime error: signed integer overflow: -2147483648 + -83886075 cannot be represented in type 'int'
Fixes: 761/clusterfuzz-testcase-5442222252097536

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-03-12 04:38:14 +01:00
Muhammad Faiz
0bab78f7e7 avfilter/af_firequalizer: add av_restrict on convolution func
slightly improved speed

Reviewed-by: wm4 <nfxjfg@googlemail.com>
Signed-off-by: Muhammad Faiz <mfcc64@gmail.com>
2017-03-12 03:21:55 +07:00
Przemysław Sobala
89c0fda5f4 lavf/dashenc: update bitrates on dash_write_trailer
Provides a way to change bandwidth parameter inside DASH manifest after a non-CBR H.264 encoding.
Caller now is able to compute the bitrate by itself, after all packets have been written, and then set that value in AVFormatContext->streams->codecpar->bit_rate before calling av_write_trailer. As a result that value will be set in DASH manifest.

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-03-11 16:43:43 +01:00
Steven Liu
70a9407b50 doc/muxers: move hls_flags temp_file to after SECOND LEVEL hls example
the temp_file hls_flags describe text offset is wrong, now move it after example

Signed-off-by: Steven Liu <lq@chinaffmpeg.org>
2017-03-11 21:11:38 +08:00
Martin Storsjö
26ee83acc4 aarch64: vp9itxfm: Reorder iadst16 coeffs
This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

This is cherrypicked from libav commit
b8f66c0838.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:52 +02:00
Martin Storsjö
b2e20d8984 arm: vp9itxfm: Reorder iadst16 coeffs
This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

This is cherrypicked from libav commit
08074c092d.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:52 +02:00
Martin Storsjö
f952273019 aarch64: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

This is cherrypicked from libav commit
09eb88a12e.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:52 +02:00
Martin Storsjö
4f693b56bd arm: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

This is cherrypicked from libav commit
de06bdfe6c.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:51 +02:00
Martin Storsjö
2905657b90 aarch64: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

After this, we still can skip pushing d12-d15.

Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3

This is cherrypicked from libav commit
65aa002d54.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:51 +02:00
Martin Storsjö
600f4c9b03 arm: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually pushed q4-q7 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
in the idct16 function), and the lanewise vmul needs a register in
the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
while doing idct16.

While keeping these coefficients in registers, we still can skip pushing
q7.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
After:
vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8

This is cherrypicked from libav commit
402546a172.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:51 +02:00
Martin Storsjö
a88db8b9a0 arm: vp9lpf: Implement the mix2_44 function with one single filter pass
For this case, with 8 inputs but only changing 4 of them, we can fit
all 16 input pixels into a q register, and still have enough temporary
registers for doing the loop filter.

The wd=8 filters would require too many temporary registers for
processing all 16 pixels at once though.

Before:                          Cortex A7      A8     A9     A53
vp9_loop_filter_mix2_v_44_16_neon:   289.7   256.2  237.5   181.2
After:
vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  177.7   138.0

This is cherrypicked from libav commit
575e31e931.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:51 +02:00
Martin Storsjö
f32690a298 aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1
This is one cycle faster in total, and three instructions fewer.

Before:
vp9_loop_filter_mix2_v_44_16_neon: 123.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 122.2

This is cherrypicked from libav commit
3bf9c48320.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:50 +02:00
Martin Storsjö
3fbbad2984 arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit
The theoretical maximum value of E is 193, so we can just
saturate the addition to 255.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
After:
vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7

This is cherrypicked from libav commit
c582cb8537.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:50 +02:00
Martin Storsjö
dda45c087b aarch64: Add parentheses around the offset parameter in movrel
This fixes building with clang for linux with PIC enabled.

This is cherrypicked from libav commit
8847eeaa14.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:50 +02:00
Martin Storsjö
c8d6eec85d aarch64: vp9lpf: Fix broken indentation/vertical alignment
This is cherrypicked from libav commit
07b5136c48.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:49 +02:00
Martin Storsjö
9f3a886364 aarch64: vp9lpf: Interleave the start of flat8in into the calculation above
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.

This is cherrypicked from libav commit
b0806088d3.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:49 +02:00
Martin Storsjö
83399cf569 arm: vp9lpf: Interleave the start of flat8in into the calculation above
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.

This is cherrypicked from libav commit
e18c39005a.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:49 +02:00
Martin Storsjö
92ab8374b1 arm: vp9lpf: Use orrs instead of orr+cmp
This is cherrypicked from libav commit
435cd7bc99.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:49 +02:00
Martin Storsjö
f0ecbb13cf arm/aarch64: vp9lpf: Calculate !hev directly
Previously we first calculated hev, and then negated it.

Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     147.0   129.0   115.8    89.0         88.7
vp9_loop_filter_v_8_8_neon:     242.0   198.5   174.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    500.0   419.5   382.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   971.2   825.5   731.5   579.0        453.0
After:
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0

This is cherrypicked from libav commit
e1f9de86f4.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:48 +02:00
Martin Storsjö
148cc0bb89 aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling
This work is sponsored by, and copyright, Google.

Before:                           Cortex A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   235.3
vp9_inv_dct_dct_32x32_sub1_add_neon:   555.1
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   180.2
vp9_inv_dct_dct_32x32_sub1_add_neon:   475.3

This is cherrypicked from libav commit
3fcf788fbb.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:48 +02:00
Martin Storsjö
758302e4bc arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling
This work is sponsored by, and copyright, Google.

Before:                            Cortex A7      A8      A9     A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   273.0   189.5   211.7   235.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   752.0   459.2   862.2   553.9
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   226.5   145.0   225.1   171.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   721.2   415.7   727.6   475.0

This is cherrypicked from libav commit
a76bf8cf12.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:48 +02:00
Martin Storsjö
045e33ae3f aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter
No measured speedup on a Cortex A53, but other cores might benefit.

This is cherrypicked from libav commit
388e0d2515.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:48 +02:00
Martin Storsjö
bff0771590 arm: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter
Before:                    Cortex A7      A8     A9     A53
vp9_put_8tap_smooth_4h_neon:   378.1   273.2  340.7   229.5
After:
vp9_put_8tap_smooth_4h_neon:   352.1   222.2  290.5   229.5

This is cherrypicked from libav commit
fea92a4b57.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:47 +02:00
Martin Storsjö
ac6cb8ae5b aarch64: vp9mc: Simplify the extmla macro parameters
Fold the field lengths into the macro.

This makes the macro invocations much more readable, when the
lines are shorter.

This also makes it easier to use only half the registers within
the macro.

This is cherrypicked from libav commit
5e0c2158fb.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:47 +02:00
Martin Storsjö
16ef000799 aarch64: vp9itxfm: Fix incorrect vertical alignment
This is cherrypicked from libav commit
0c0b87f12d.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:47 +02:00
Martin Storsjö
d0fbf7f34e aarch64: vp9itxfm: Update a comment to refer to a register with a different name
This is cherrypicked from libav commit
8476eb0d3a.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:46 +02:00
Martin Storsjö
6752318c73 aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability
This is cherrypicked from libav commit
3dd7827258.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:46 +02:00
Martin Storsjö
19a0f9529c aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible
The ld1r is a leftover from the arm version, where this trick is
beneficial on some cores.

Use a single-lane load where we don't need the semantics of ld1r.

This is cherrypicked from libav commit
ed8d293306.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:46 +02:00
Martin Storsjö
3006e5253a aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function
This is cherrypicked from libav commit
4da4b2b87f.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:27 +02:00
Martin Storsjö
1d8ab576a7 arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function
This is cherrypicked from libav commit
3933b86bb9.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:26 +02:00
Martin Storsjö
9532a7d4d0 aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 14740 bytes to 24292 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:
vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
vp9_inv_dct_dct_16x16_sub2_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon:    1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1387.4
vp9_inv_dct_dct_16x16_sub16_add_neon:   1387.6
vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    5198.5
vp9_inv_dct_dct_32x32_sub4_add_neon:    5198.6
vp9_inv_dct_dct_32x32_sub8_add_neon:    5196.3
vp9_inv_dct_dct_32x32_sub12_add_neon:   6183.4
vp9_inv_dct_dct_32x32_sub16_add_neon:   6174.3
vp9_inv_dct_dct_32x32_sub20_add_neon:   7151.4
vp9_inv_dct_dct_32x32_sub24_add_neon:   7145.3
vp9_inv_dct_dct_32x32_sub28_add_neon:   8119.3
vp9_inv_dct_dct_32x32_sub32_add_neon:   8118.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon:     236.7
vp9_inv_dct_dct_16x16_sub2_add_neon:     640.8
vp9_inv_dct_dct_16x16_sub4_add_neon:     639.0
vp9_inv_dct_dct_16x16_sub8_add_neon:     842.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1388.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon:     554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    3685.5
vp9_inv_dct_dct_32x32_sub4_add_neon:    3685.1
vp9_inv_dct_dct_32x32_sub8_add_neon:    3684.4
vp9_inv_dct_dct_32x32_sub12_add_neon:   5312.2
vp9_inv_dct_dct_32x32_sub16_add_neon:   5315.4
vp9_inv_dct_dct_32x32_sub20_add_neon:   7154.9
vp9_inv_dct_dct_32x32_sub24_add_neon:   7154.5
vp9_inv_dct_dct_32x32_sub28_add_neon:   8126.6
vp9_inv_dct_dct_32x32_sub32_add_neon:   8127.2

This is cherrypicked from libav commit
a63da4511d.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:25 +02:00
Martin Storsjö
824589556c arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 12388 bytes to 19784 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    212.0    235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:    2102.1   1521.7   1736.2   1265.8
vp9_inv_dct_dct_16x16_sub4_add_neon:    2104.5   1533.0   1736.6   1265.5
vp9_inv_dct_dct_16x16_sub8_add_neon:    2484.8   1828.7   2014.4   1506.5
vp9_inv_dct_dct_16x16_sub12_add_neon:   2851.2   2117.8   2294.8   1753.2
vp9_inv_dct_dct_16x16_sub16_add_neon:   3239.4   2408.3   2543.5   1994.9
vp9_inv_dct_dct_32x32_sub1_add_neon:     758.3    456.7    864.5    553.9
vp9_inv_dct_dct_32x32_sub2_add_neon:   10776.7   7949.8   8567.7   6819.7
vp9_inv_dct_dct_32x32_sub4_add_neon:   10865.6   8131.5   8589.6   6816.3
vp9_inv_dct_dct_32x32_sub8_add_neon:   12053.9   9271.3   9387.7   7564.0
vp9_inv_dct_dct_32x32_sub12_add_neon:  13328.3  10463.2  10217.0   8321.3
vp9_inv_dct_dct_32x32_sub16_add_neon:  14176.4  11509.5  11018.7   9062.3
vp9_inv_dct_dct_32x32_sub20_add_neon:  15301.5  12999.9  11855.1   9828.2
vp9_inv_dct_dct_32x32_sub24_add_neon:  16482.7  14931.5  12650.1  10575.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17589.5  15811.9  13482.8  11333.4
vp9_inv_dct_dct_32x32_sub32_add_neon:  18696.2  17049.2  14355.6  12089.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    211.7    235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:    1203.5    998.2   1035.3    763.0
vp9_inv_dct_dct_16x16_sub4_add_neon:    1203.5    998.1   1035.5    760.8
vp9_inv_dct_dct_16x16_sub8_add_neon:    1926.1   1610.6   1722.1   1271.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   2873.2   2129.7   2285.1   1757.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   3221.4   2520.3   2557.6   2002.1
vp9_inv_dct_dct_32x32_sub1_add_neon:     753.0    457.5    866.6    554.6
vp9_inv_dct_dct_32x32_sub2_add_neon:    7554.6   5652.4   6048.4   4920.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    7549.9   5685.0   6046.9   4925.7
vp9_inv_dct_dct_32x32_sub8_add_neon:    8336.9   6704.5   6604.0   5478.0
vp9_inv_dct_dct_32x32_sub12_add_neon:  10914.0   9777.2   9240.4   7416.9
vp9_inv_dct_dct_32x32_sub16_add_neon:  11859.2  11223.3   9966.3   8095.1
vp9_inv_dct_dct_32x32_sub20_add_neon:  15237.1  13029.4  11838.3   9829.4
vp9_inv_dct_dct_32x32_sub24_add_neon:  16293.2  14379.8  12644.9  10572.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17424.3  15734.7  13473.0  11326.9
vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.3  17457.0  14298.6  12080.0

This is cherrypicked from libav commit
5eb5aec475.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:25 +02:00
Martin Storsjö
a681c793a3 aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function
This allows reusing the macro for a separate implementation of the
pass2 function.

This is cherrypicked from libav commit
79d332ebbd.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:24 +02:00