1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-11-26 19:01:44 +02:00
Commit Graph

83278 Commits

Author SHA1 Message Date
Muhammad Faiz
c4a3526b57 avfilter/showcqt: make minimum timeclamp option lower
high basefreq does not require high timeclamp

Signed-off-by: Muhammad Faiz <mfcc64@gmail.com>
2017-01-30 05:41:49 +07:00
Matthieu Bouron
2ae8278832 lavc/mjpegdec: consume SOS data even if the frame is discarded
Speeds up next marker search when a SOS marker is found but the frame is
discarded (which happens in avformat_find_stream_info).
2017-01-29 21:54:16 +01:00
Nicolas George
383057f8e7 lavfi: make ff_framequeue_skip_samples() more useful.
Instead of just updating statistics and leaving the work to the
call site, have it actually do the work.

Also: skip the samples by updating the frame data pointers
instead of moving the samples. More efficient and avoid writing
into shared frames.
Found-By: Muhammad Faiz <mfcc64@gmail.com>
2017-01-29 18:53:11 +01:00
Rostislav Pehlivanov
e05d2dd86a doc/examples/decoder_targeted: move to tools/target_dec_fuzzer.c
Name and purpose are more appropriate there since the code isn't
an ideal example.

Reviewed-by: wm4 <nfxjfg@googlemail.com>
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
2017-01-29 16:14:18 +00:00
Michael Niedermayer
bbd4d92304 doc/examples/decoder_targeted: Disable error concealment after 20 frames
This allows testing EC and non EC. Avoids spending most time in EC on
high res samples and reduces the likelyhood of hitting timeouts

Fixes: Timeout in 467/fuzz-2-ffmpeg_VIDEO_AV_CODEC_ID_H263_fuzzer

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-29 16:09:55 +01:00
Paul B Mahol
c6f7f33eec avfilter/vf_remap: add . at end of long description
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-01-29 13:29:33 +01:00
Andreas Cadhalpun
9ec8790ac4 boadec: prevent overflow during block alignment calculation
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
2017-01-29 01:20:52 +01:00
Andreas Cadhalpun
169c1cfa92 pvfdec: prevent overflow during block alignment calculation
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
2017-01-29 01:20:52 +01:00
Andreas Cadhalpun
8812d047bc electronicarts: prevent overflow during block alignment calculation
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
2017-01-29 01:20:52 +01:00
Andreas Cadhalpun
e3f13d3a87 4xm: prevent overflow during block alignment calculation
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
2017-01-29 01:20:48 +01:00
Marijn Meijles
227d602bb3 avformat/ac3dec: Fix to prevent runaway ac3 detection by looking at the actual frame rather than the first detected frame.
When detecting a swapped AC3 marker the data of the frame is swapped. However, in subsequent frames the data swapped is taken from the first frame rather than the current frame.

Signed-off-by: Marijn Meijles <marijn@bitpit.net>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-28 23:08:42 +01:00
Paul Arzelier
65862f57ad avformat: Ignore ID3v2 tags if other tags are present e.g. vorbis
Reviewed-by: wm4 <nfxjfg@googlemail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-28 23:08:42 +01:00
James Almer
dce863421b avformat/matroskaenc: don't reserve more bytes than needed for the Colour master size
Found-by: Aaron Colwell <acolwell@google.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2017-01-28 13:46:26 -03:00
Paul B Mahol
4cfa1f80a9 avformat/sccdec: attempt to fix valgrind issue
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-01-28 17:23:31 +01:00
Chris Moeller
ecd360041e avformat: fix ID3v2 parser for v2.2 comment frames
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-28 13:52:09 +01:00
Aaron Colwell
b9f2f93261 mov: Fix spherical metadata_source parsing
Signed-off-by: James Almer <jamrial@gmail.com>
2017-01-27 22:52:33 -03:00
Michael Niedermayer
6294247730 avfilter/vf_gblur: Increase supported pixel count from 31bit to 32bit in filter_postscale()
Fixes CID1396252

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-27 22:16:37 +01:00
Sasi Inguva
03e42a4fec ffmpeg.c: Add output file index and stream index to vstats file.
Signed-off-by: Sasi Inguva <isasi@google.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-27 22:16:37 +01:00
Sasi Inguva
e4a1d87ef8 lavf/matroskaenc.c: Free dyn bufs in mkv_free. Fixes memory leaks when muxing fails.
Signed-off-by: Sasi Inguva <isasi@google.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-27 22:16:37 +01:00
Paul B Mahol
d1df72a702 fate: add SCC test
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-01-27 17:06:42 +01:00
Paul B Mahol
836c8750b3 avfilter/avf_showspectrum: fix 2 possible crashes
Make sure no division by zero is done.
Make sure there are actually samples available.

Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-01-27 13:37:00 +01:00
Paul B Mahol
29fbce8f81 doc/filters: mention recently added option
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-01-27 12:13:42 +01:00
Carl Eugen Hoyos
fca3083282 lavf/img2dec: Reduce the probe score for incomplete jpgs.
Ensures that probing doesn't finish prematurely for small files.
2017-01-27 08:31:07 +01:00
Michael Niedermayer
f28299da8d avcodec/h264dec: Clear ref_count on slice header processing failure
Fixes using freed memory
Introduced in 7448019890
Fixes: 471/fuzz-1-ffmpeg_VIDEO_AV_CODEC_ID_H264_fuzzer

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-27 00:52:49 +01:00
James Almer
1ae39429e4 avformat/matroskadec: ProjectionPrivate is optional on Equirectangular projections
This reflects a recent change to the spec draft.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-01-26 19:28:19 -03:00
Joel Cunningham
f3778108d3 tcp: set socket buffer sizes before listen/connect/accept
From e24d95c0e06a878d401ee34fd6742fcaddeeb95f Mon Sep 17 00:00:00 2001
From: Joel Cunningham <joel.cunningham@me.com>
Date: Mon, 9 Jan 2017 13:37:51 -0600
Subject: [PATCH] tcp: set socket buffer sizes before listen/connect/accept

Attempting to set SO_RCVBUF and SO_SNDBUF on TCP sockets after connection
establishment is incorrect and some stacks ignore the set call on the socket at
this point.  This has been observed on MacOS/iOS.  Windows 7 has some peculiar
behavior where setting SO_RCVBUF after applies only if the buffer is increasing
from the default while decreases are ignored.  This is possibly how the incorrect
usage has gone unnoticed

Unix Network Programming Vol. 1: The Sockets Networking API (3rd edition, seciton 7.5):

"When setting the size of the TCP socket receive buffer, the ordering of the
function calls is important.  This is because of TCP's window scale option,
which is exchanged with the peer on SYN segments when the connection is
established. For a client, this means the SO_RCVBUF socket option must be
set before calling connect.  For a server, this means the socket option must
be set for the listening socket before calling listen.  Setting this option
for the connected socket will have no effect whatsoever on the possible window
scale option because accept does not return with the connected socket until
TCP's three-way handshake is complete.  This is why the option must be set on
the listening socket. (The sizes of the socket buffers are always inherited from
the listening socket by the newly created connected socket)"

Signed-off-by: Joel Cunningham <joel.cunningham@me.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-26 20:19:18 +01:00
Paul B Mahol
ee8e00b703 avfilter: add abitscope multimedia filter
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-01-26 16:21:25 +01:00
Frank Liberato
95bde49982 avformat/flacdec: Check avio_read result when reading flac block header.
Return AVERROR_INVALIDDATA if all four bytes aren't present.

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-25 22:08:28 +01:00
Sasi Inguva
f227fc4c2a ffmpeg_opt.c: Introduce a -vstats_version option and document the existing -vstats format.
Signed-off-by: Sasi Inguva <isasi@google.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-25 22:03:10 +01:00
Paul B Mahol
b4a13d442a avcodecc/ccaption_dec: remove extra word from long codec description
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-01-25 12:00:02 +01:00
Paul B Mahol
45ff6ef50e avformat: add Scenarist Closed Captions demuxer
Fixes #4767.

Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-01-25 12:00:02 +01:00
Paul B Mahol
b953aec3c4 avformat: add Sample Dump eXchange demuxer
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-01-25 12:00:02 +01:00
Carl Eugen Hoyos
a135b017de lavf/mov: Unscramble dref debug output. 2017-01-25 11:49:04 +01:00
compn
5316ed899f isom: map xalg and avlg to h264, fixes ticket #6099 2017-01-24 23:46:38 -05:00
Michael Niedermayer
2080bc3371 avcodec/utils: correct align value for interplay
Fixes out of array access
Fixes: 452/fuzz-1-ffmpeg_VIDEO_AV_CODEC_ID_INTERPLAY_VIDEO_fuzzer

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-25 00:56:37 +01:00
Carl Eugen Hoyos
0b607228bf Cosmetics: Reindent after last commit. 2017-01-25 00:55:36 +01:00
Carl Eugen Hoyos
9d5141d1fb lavd/v4l2: Avoid setting frame_size to a negative value. 2017-01-25 00:54:10 +01:00
Carl Eugen Hoyos
75bd4ea024 lavf/rtmpproto: Make bytes_read variables 64bit.
When bytes_read overflowed, last_bytes_read did not yet overflow
and no bytes-read report was created leading to a timeout.

Analyzed-by: Thomas Bernhard

Fixes ticket #5836.
2017-01-25 00:39:13 +01:00
Marton Balint
977fd88419 avfilter/formats: do not allow unknown layouts in ff_parse_channel_layout if nret is not set
Current code returned the number of channels as channel layout in that case,
and if nret is not set then unknown layouts are typically not supported.

Also use the common parsing code. Use a temporary workaround to parse an
unknown channel layout such as '13c', after a 1 year grace period only '13C'
will work.

Signed-off-by: Marton Balint <cus@passwd.hu>
2017-01-24 23:51:36 +01:00
Marton Balint
c4618f842a avutil/channel_layout: add av_get_extended_channel_layout
Return a channel layout and the number of channels based on the specified name.

This function is similar to av_get_channel_layout(), but can also parse unknown
channel layout specifications.

Unknown channel layout specifications are a decimal number and a capital 'C'
suffix, in order to not break compatibility with the lowercase 'c' suffix,
which is used for a guessed channel layout with the specified number of
channels.

Signed-off-by: Marton Balint <cus@passwd.hu>
2017-01-24 23:51:36 +01:00
Marton Balint
5049f05f27 avutil/channel_layout: fix remains of old syntax in docs and comments
Signed-off-by: Marton Balint <cus@passwd.hu>
2017-01-24 23:51:36 +01:00
Carl Eugen Hoyos
6d6faa2a2d lavc/svq3: Fail for media key encryption.
Tested-by: ami_stuff

Fixes a part of ticket #6094.
2017-01-24 23:40:13 +01:00
Michael Niedermayer
9e6a242755 avcodec/vp56: Check for the bitstream end, pass error codes on
Fixes timeout
Fixes: 446/fuzz-3-ffmpeg_VIDEO_AV_CODEC_ID_VP6_fuzzer

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-24 23:01:04 +01:00
Martin Storsjö
9f10cff610 aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter
This work is sponsored by, and copyright, Google.

This is similar to the arm version, but due to the larger registers
on aarch64, we can do 8 pixels at a time for all filter sizes.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                             ARM AArch64
vp9_loop_filter_h_4_8_10bpp_neon:          213.2   172.6
vp9_loop_filter_h_8_8_10bpp_neon:          281.2   244.2
vp9_loop_filter_h_16_8_10bpp_neon:         657.0   444.5
vp9_loop_filter_h_16_16_10bpp_neon:       1280.4   877.7
vp9_loop_filter_mix2_h_44_16_10bpp_neon:   397.7   358.0
vp9_loop_filter_mix2_h_48_16_10bpp_neon:   465.7   429.0
vp9_loop_filter_mix2_h_84_16_10bpp_neon:   465.7   428.0
vp9_loop_filter_mix2_h_88_16_10bpp_neon:   533.7   499.0
vp9_loop_filter_mix2_v_44_16_10bpp_neon:   271.5   244.0
vp9_loop_filter_mix2_v_48_16_10bpp_neon:   330.0   305.0
vp9_loop_filter_mix2_v_84_16_10bpp_neon:   329.0   306.0
vp9_loop_filter_mix2_v_88_16_10bpp_neon:   386.0   365.0
vp9_loop_filter_v_4_8_10bpp_neon:          150.0   115.2
vp9_loop_filter_v_8_8_10bpp_neon:          209.0   175.5
vp9_loop_filter_v_16_8_10bpp_neon:         492.7   345.2
vp9_loop_filter_v_16_16_10bpp_neon:        951.0   682.7

This is significantly faster than the ARM version in almost
all cases except for the mix2 functions.

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 2-3x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:36:11 +02:00
Martin Storsjö
ceb36b8178 aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm
This work is sponsored by, and copyright, Google.

Compared to the arm version, on aarch64 we can keep the full 8x8
transform in registers, and for 16x16 and 32x32, we can process
it in slices of 4 pixels instead of 2.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                                ARM  AArch64
vp9_inv_adst_adst_4x4_sub4_add_10_neon:       111.0    109.7
vp9_inv_adst_adst_8x8_sub8_add_10_neon:       914.0    733.5
vp9_inv_adst_adst_16x16_sub16_add_10_neon:   5184.0   3745.7
vp9_inv_dct_dct_4x4_sub1_add_10_neon:          65.0     65.7
vp9_inv_dct_dct_4x4_sub4_add_10_neon:         100.0     96.7
vp9_inv_dct_dct_8x8_sub1_add_10_neon:         111.0    119.7
vp9_inv_dct_dct_8x8_sub8_add_10_neon:         618.0    494.7
vp9_inv_dct_dct_16x16_sub1_add_10_neon:       295.1    284.6
vp9_inv_dct_dct_16x16_sub2_add_10_neon:      2303.2   1883.9
vp9_inv_dct_dct_16x16_sub8_add_10_neon:      2984.8   2189.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon:     3890.0   2799.4
vp9_inv_dct_dct_32x32_sub1_add_10_neon:      1044.4   1012.7
vp9_inv_dct_dct_32x32_sub2_add_10_neon:     13333.7   9695.1
vp9_inv_dct_dct_32x32_sub16_add_10_neon:    18531.3  12459.8
vp9_inv_dct_dct_32x32_sub32_add_10_neon:    24470.7  16160.2
vp9_inv_wht_wht_4x4_sub4_add_10_neon:          83.0     79.7

The larger transforms are significantly faster than the corresponding
ARM versions.

The speedup vs C code is smaller than in 32 bit mode, probably
because the 64 bit intermediates in the C code can be expressed
more efficiently in aarch64.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:36:08 +02:00
Martin Storsjö
638eceed47 aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC
This work is sponsored by, and copyright, Google.

This has mostly got the same differences to the 8 bit version as
in the arm version. For the horizontal filters, we do 16 pixels
in parallel as well. For the 8 pixel wide vertical filters, we can
accumulate 4 rows before storing, just as in the 8 bit version.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                           ARM   AArch64
vp9_avg4_10bpp_neon:                      35.7      30.7
vp9_avg8_10bpp_neon:                      93.5      84.7
vp9_avg16_10bpp_neon:                    324.4     296.6
vp9_avg32_10bpp_neon:                   1236.5    1148.2
vp9_avg64_10bpp_neon:                   4639.6    4571.1
vp9_avg_8tap_smooth_4h_10bpp_neon:       130.0     128.0
vp9_avg_8tap_smooth_4hv_10bpp_neon:      440.0     440.5
vp9_avg_8tap_smooth_4v_10bpp_neon:       114.0     105.5
vp9_avg_8tap_smooth_8h_10bpp_neon:       327.0     314.0
vp9_avg_8tap_smooth_8hv_10bpp_neon:      918.7     865.4
vp9_avg_8tap_smooth_8v_10bpp_neon:       330.0     300.2
vp9_avg_8tap_smooth_16h_10bpp_neon:     1187.5    1155.5
vp9_avg_8tap_smooth_16hv_10bpp_neon:    2663.1    2591.0
vp9_avg_8tap_smooth_16v_10bpp_neon:     1107.4    1078.3
vp9_avg_8tap_smooth_64h_10bpp_neon:    17754.6   17454.7
vp9_avg_8tap_smooth_64hv_10bpp_neon:   33285.2   33001.5
vp9_avg_8tap_smooth_64v_10bpp_neon:    16066.9   16048.6
vp9_put4_10bpp_neon:                      25.5      21.7
vp9_put8_10bpp_neon:                      56.0      52.0
vp9_put16_10bpp_neon/armv8:              183.0     163.1
vp9_put32_10bpp_neon/armv8:              678.6     563.1
vp9_put64_10bpp_neon/armv8:             2679.9    2195.8
vp9_put_8tap_smooth_4h_10bpp_neon:       120.0     118.0
vp9_put_8tap_smooth_4hv_10bpp_neon:      435.2     435.0
vp9_put_8tap_smooth_4v_10bpp_neon:       107.0      98.2
vp9_put_8tap_smooth_8h_10bpp_neon:       303.0     290.0
vp9_put_8tap_smooth_8hv_10bpp_neon:      893.7     828.7
vp9_put_8tap_smooth_8v_10bpp_neon:       305.5     263.5
vp9_put_8tap_smooth_16h_10bpp_neon:     1089.1    1059.2
vp9_put_8tap_smooth_16hv_10bpp_neon:    2578.8    2452.4
vp9_put_8tap_smooth_16v_10bpp_neon:     1009.5     933.5
vp9_put_8tap_smooth_64h_10bpp_neon:    16223.4   15918.6
vp9_put_8tap_smooth_64hv_10bpp_neon:   32153.0   31016.2
vp9_put_8tap_smooth_64v_10bpp_neon:    14516.5   13748.1

These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.

The speedup vs C code is around 4-9x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:36:05 +02:00
Martin Storsjö
48ad3fe1be aarch64: vp9dsp: Restructure the bpp checks
This work is sponsored by, and copyright, Google.

This is more in line with how it will be extended for more bitdepths.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:36:02 +02:00
Martin Storsjö
1e5d87eec3 arm: Add NEON optimizations for 10 and 12 bit vp9 loop filter
This work is sponsored by, and copyright, Google.

This is pretty much similar to the 8 bpp version, but in some senses
simpler. All input pixels are 16 bits, and all intermediates also fit
in 16 bits, so there's no lengthening/narrowing in the filter at all.

For the full 16 pixel wide filter, we can only process 4 pixels at a time
(using an implementation very much similar to the one for 8 bpp),
but we can do 8 pixels at a time for the 4 and 8 pixel wide filters with
a different implementation of the core filter.

Examples of relative speedup compared to the C version, from checkasm:
                                   Cortex    A7     A8     A9    A53
vp9_loop_filter_h_4_8_10bpp_neon:          1.83   2.16   1.40   2.09
vp9_loop_filter_h_8_8_10bpp_neon:          1.39   1.67   1.24   1.70
vp9_loop_filter_h_16_8_10bpp_neon:         1.56   1.47   1.10   1.81
vp9_loop_filter_h_16_16_10bpp_neon:        1.94   1.69   1.33   2.24
vp9_loop_filter_mix2_h_44_16_10bpp_neon:   2.01   2.27   1.67   2.39
vp9_loop_filter_mix2_h_48_16_10bpp_neon:   1.84   2.06   1.45   2.19
vp9_loop_filter_mix2_h_84_16_10bpp_neon:   1.89   2.20   1.47   2.29
vp9_loop_filter_mix2_h_88_16_10bpp_neon:   1.69   2.12   1.47   2.08
vp9_loop_filter_mix2_v_44_16_10bpp_neon:   3.16   3.98   2.50   4.05
vp9_loop_filter_mix2_v_48_16_10bpp_neon:   2.84   3.64   2.25   3.77
vp9_loop_filter_mix2_v_84_16_10bpp_neon:   2.65   3.45   2.16   3.54
vp9_loop_filter_mix2_v_88_16_10bpp_neon:   2.55   3.30   2.16   3.55
vp9_loop_filter_v_4_8_10bpp_neon:          2.85   3.97   2.24   3.68
vp9_loop_filter_v_8_8_10bpp_neon:          2.27   3.19   1.96   3.08
vp9_loop_filter_v_16_8_10bpp_neon:         3.42   2.74   2.26   4.40
vp9_loop_filter_v_16_16_10bpp_neon:        2.86   2.44   1.93   3.88

The speedup vs C code measured in checkasm is around 1.1-4x.
These numbers are quite inconclusive though, since the checkasm test
runs multiple filterings on top of each other, so later rounds might
end up with different codepaths (different decisions on which filter
to apply, based on input pixel differences).

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 2-4x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:35:59 +02:00
Martin Storsjö
2ed67eba96 arm: Add NEON optimizations for 10 and 12 bit vp9 itxfm
This work is sponsored by, and copyright, Google.

This is structured similarly to the 8 bit version. In the 8 bit
version, the coefficients are 16 bits, and intermediates are 32 bits.

Here, the coefficients are 32 bit. For the 4x4 transforms for 10 bit
content, the intermediates also fit in 32 bits, but for all other
transforms (4x4 for 12 bit content, and 8x8 and larger for both 10
and 12 bit) the intermediates are 64 bit.

For the existing 8 bit case, the 8x8 transform fit all coefficients in
registers; for 10/12 bit, when the coefficients are 32 bit, the 8x8
transform also has to be done in slices of 4 pixels (just as 16x16 and
32x32 for 8 bit).

The slice width also shrinks from 4 elements to 2 elements in parallel
for the 16x16 and 32x32 cases.

The 16 bit coefficients from idct_coeffs and similar tables also need
to be lenghtened to 32 bit in order to be used in multiplication with
vectors with 32 bit elements. This leads to the fixed coefficient
vectors needing more space, leading to more cases where they have to
be reloaded within the transform (in iadst16).

This technically would need testing in checkasm for subpartitions
in increments of 2, but that slows down normal checkasm runs
excessively.

Examples of relative speedup compared to the C version, from checkasm:
                                     Cortex    A7     A8     A9    A53
vp9_inv_adst_adst_4x4_sub4_add_10_neon:      4.83  11.36   5.22   6.77
vp9_inv_adst_adst_8x8_sub8_add_10_neon:      4.12   7.60   4.06   4.84
vp9_inv_adst_adst_16x16_sub16_add_10_neon:   3.93   8.16   4.52   5.35
vp9_inv_dct_dct_4x4_sub1_add_10_neon:        1.36   2.57   1.41   1.61
vp9_inv_dct_dct_4x4_sub4_add_10_neon:        4.24   8.66   5.06   5.81
vp9_inv_dct_dct_8x8_sub1_add_10_neon:        2.63   4.18   1.68   2.87
vp9_inv_dct_dct_8x8_sub4_add_10_neon:        4.52   9.47   4.24   5.39
vp9_inv_dct_dct_8x8_sub8_add_10_neon:        3.45   7.34   3.45   4.30
vp9_inv_dct_dct_16x16_sub1_add_10_neon:      3.56   6.21   2.47   4.32
vp9_inv_dct_dct_16x16_sub2_add_10_neon:      5.68  12.73   5.28   7.07
vp9_inv_dct_dct_16x16_sub8_add_10_neon:      4.42   9.28   4.24   5.45
vp9_inv_dct_dct_16x16_sub16_add_10_neon:     3.41   7.29   3.35   4.19
vp9_inv_dct_dct_32x32_sub1_add_10_neon:      4.52   8.35   3.83   6.40
vp9_inv_dct_dct_32x32_sub2_add_10_neon:      5.86  13.19   6.14   7.04
vp9_inv_dct_dct_32x32_sub16_add_10_neon:     4.29   8.11   4.59   5.06
vp9_inv_dct_dct_32x32_sub32_add_10_neon:     3.31   5.70   3.56   3.84
vp9_inv_wht_wht_4x4_sub4_add_10_neon:        1.89   2.80   1.82   1.97

The speedup compared to the C functions is around 1.3 to 7x for the
full transforms, even higher for the smaller subpartitions.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:35:56 +02:00
Martin Storsjö
a4d4bad75c arm: Add NEON optimizations for 10 and 12 bit vp9 MC
This work is sponsored by, and copyright, Google.

The plain pixel put/copy functions are used from the 8 bit version,
for the double size (e.g. put16 uses ff_vp9_copy32_neon), and a new
copy128 is added.

Compared with the 8 bit version, the filters can no longer use the
trick to accumulate in 16 bit with only saturation at the end, but now
the accumulators need to be 32 bit. This avoids the need to keep track
of which filter index is the largest though, reducing the size of the
executable code for these filters.

For the horizontal filters, we only do 4 or 8 pixels wide in parallel
(while doing two rows at a time), since we don't have enough register
space to filter 16 pixels wide.

For the vertical filters, we still do 4 and 8 pixels in parallel just
as in the 8 bit case, but we need to store the output after every 2
rows instead of after every 4 rows.

Examples of relative speedup compared to the C version, from checkasm:
                               Cortex    A7     A8     A9    A53
vp9_avg4_10bpp_neon:                   2.25   2.44   3.05   2.16
vp9_avg8_10bpp_neon:                   3.66   8.48   3.86   3.50
vp9_avg16_10bpp_neon:                  3.39   8.26   3.37   2.72
vp9_avg32_10bpp_neon:                  4.03  10.20   4.07   3.42
vp9_avg64_10bpp_neon:                  4.15  10.01   4.13   3.70
vp9_avg_8tap_smooth_4h_10bpp_neon:     3.38   6.22   3.41   4.75
vp9_avg_8tap_smooth_4hv_10bpp_neon:    3.89   6.39   4.30   5.32
vp9_avg_8tap_smooth_4v_10bpp_neon:     5.32   9.73   6.34   7.31
vp9_avg_8tap_smooth_8h_10bpp_neon:     4.45   9.40   4.68   6.87
vp9_avg_8tap_smooth_8hv_10bpp_neon:    4.64   8.91   5.44   6.47
vp9_avg_8tap_smooth_8v_10bpp_neon:     6.44  13.42   8.68   8.79
vp9_avg_8tap_smooth_64h_10bpp_neon:    4.66   9.02   4.84   7.71
vp9_avg_8tap_smooth_64hv_10bpp_neon:   4.61   9.14   4.92   7.10
vp9_avg_8tap_smooth_64v_10bpp_neon:    6.90  14.13   9.57  10.41
vp9_put4_10bpp_neon:                   1.33   1.46   2.09   1.33
vp9_put8_10bpp_neon:                   1.57   3.42   1.83   1.84
vp9_put16_10bpp_neon:                  1.55   4.78   2.17   1.89
vp9_put32_10bpp_neon:                  2.06   5.35   2.14   2.30
vp9_put64_10bpp_neon:                  3.00   2.41   1.95   1.66
vp9_put_8tap_smooth_4h_10bpp_neon:     3.19   5.81   3.31   4.63
vp9_put_8tap_smooth_4hv_10bpp_neon:    3.86   6.22   4.32   5.21
vp9_put_8tap_smooth_4v_10bpp_neon:     5.40   9.77   6.08   7.21
vp9_put_8tap_smooth_8h_10bpp_neon:     4.22   8.41   4.46   6.63
vp9_put_8tap_smooth_8hv_10bpp_neon:    4.56   8.51   5.39   6.25
vp9_put_8tap_smooth_8v_10bpp_neon:     6.60  12.43   8.17   8.89
vp9_put_8tap_smooth_64h_10bpp_neon:    4.41   8.59   4.54   7.49
vp9_put_8tap_smooth_64hv_10bpp_neon:   4.43   8.58   5.34   6.63
vp9_put_8tap_smooth_64v_10bpp_neon:    7.26  13.92   9.27  10.92

For the larger 8tap filters, the speedup vs C code is around 4-14x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:35:50 +02:00