The previous implementation of the parser made four passes over each input
buffer (reduced to two if the container format already guaranteed the input
buffer corresponded to frames, such as with MKV). But these buffers are
often 200K in size, certainly enough to flush the data out of L1 cache, and
for many CPUs, all the way out to main memory. The passes were:
1) locate frame boundaries (not needed for MKV etc)
2) copy the data into a contiguous block (not needed for MKV etc)
3) locate the start codes within each frame
4) unescape the data between start codes
After this, the unescaped data was parsed to extract certain header fields,
but because the unescape operation was so large, this was usually also
effectively operating on uncached memory. Most of the unescaped data was
simply thrown away and never processed further. Only step 2 - because it
used memcpy - was using prefetch, making things even worse.
This patch reorganises these steps so that, aside from the copying, the
operations are performed in parallel, maximising cache utilisation. No more
than the worst-case number of bytes needed for header parsing is unescaped.
Most of the data is, in practice, only read in order to search for a start
code, for which optimised implementations already existed in the H264 codec
(notably the ARM version uses prefetch, so we end up doing both remaining
passes at maximum speed). For MKV files, we know when we've found the last
start code of interest in a given frame, so we are able to avoid doing even
that one remaining pass for most of the buffer.
In some use-cases (such as the Raspberry Pi) video decode is handled by the
GPU, but the entire elementary stream is still fed through the parser to
pick out certain elements of the header which are necessary to manage the
decode process. As you might expect, in these cases, the performance of the
parser is significant.
To measure parser performance, I used the same VC-1 elementary stream in
either an MPEG-2 transport stream or a MKV file, and fed it through avconv
with -c:v copy -c:a copy -f null. These are the gperftools counts for
those streams, both filtered to only include vc1_parse() and its callees,
and unfiltered (to include the whole binary). Lower numbers are better:
Before After
File Filtered Mean StdDev Mean StdDev Confidence Change
M2TS No 861.7 8.2 650.5 8.1 100.0% +32.5%
MKV No 868.9 7.4 731.7 9.0 100.0% +18.8%
M2TS Yes 250.0 11.2 27.2 3.4 100.0% +817.9%
MKV Yes 149.0 12.8 1.7 0.8 100.0% +8526.3%
Yes, that last case shows vc1_parse() running 86 times faster! The M2TS
case does show a larger absolute improvement though, since it was worse
to begin with.
This patch has been tested with the FATE suite (albeit on x86 for speed).
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
The previous implementation of the parser made four passes over each input
buffer (reduced to two if the container format already guaranteed the input
buffer corresponded to frames, such as with MKV). But these buffers are
often 200K in size, certainly enough to flush the data out of L1 cache, and
for many CPUs, all the way out to main memory. The passes were:
1) locate frame boundaries (not needed for MKV etc)
2) copy the data into a contiguous block (not needed for MKV etc)
3) locate the start codes within each frame
4) unescape the data between start codes
After this, the unescaped data was parsed to extract certain header fields,
but because the unescape operation was so large, this was usually also
effectively operating on uncached memory. Most of the unescaped data was
simply thrown away and never processed further. Only step 2 - because it
used memcpy - was using prefetch, making things even worse.
This patch reorganises these steps so that, aside from the copying, the
operations are performed in parallel, maximising cache utilisation. No more
than the worst-case number of bytes needed for header parsing is unescaped.
Most of the data is, in practice, only read in order to search for a start
code, for which optimised implementations already existed in the H264 codec
(notably the ARM version uses prefetch, so we end up doing both remaining
passes at maximum speed). For MKV files, we know when we've found the last
start code of interest in a given frame, so we are able to avoid doing even
that one remaining pass for most of the buffer.
In some use-cases (such as the Raspberry Pi) video decode is handled by the
GPU, but the entire elementary stream is still fed through the parser to
pick out certain elements of the header which are necessary to manage the
decode process. As you might expect, in these cases, the performance of the
parser is significant.
To measure parser performance, I used the same VC-1 elementary stream in
either an MPEG-2 transport stream or a MKV file, and fed it through ffmpeg
with -c:v copy -c:a copy -f null. These are the gperftools counts for
those streams, both filtered to only include vc1_parse() and its callees,
and unfiltered (to include the whole binary). Lower numbers are better:
Before After
File Filtered Mean StdDev Mean StdDev Confidence Change
M2TS No 861.7 8.2 650.5 8.1 100.0% +32.5%
MKV No 868.9 7.4 731.7 9.0 100.0% +18.8%
M2TS Yes 250.0 11.2 27.2 3.4 100.0% +817.9%
MKV Yes 149.0 12.8 1.7 0.8 100.0% +8526.3%
Yes, that last case shows vc1_parse() running 86 times faster! The M2TS
case does show a larger absolute improvement though, since it was worse
to begin with.
This patch has been tested with the FATE suite (albeit on x86 for speed).
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
first_pic_header_flag needs to be set to allow the parsing code to change
some stream parameters, and not error out.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'accde1bd8756936e1757b42fc9bad2eb5d192f8a':
vc1_parser: Set field_order.
mpegvideo_parser: Set field_order.
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
mov: Use defines for sample flags in fragments
mov: Use defines for trun flags
mov: Use defines for tfhd flags
proresenc: force bitrate not to exceed given limit
vc1parse: call vc1_init_common().
wma: don't return 0 on invalid packets.
asf: prevent packet_size_left from going negative if hdrlen > pktlen.
mjpegb: don't return 0 at the end of frame decoding.
rtpdec: Identify incorrectly signalled H263
vp8dsp: split long line.
aiff: don't skip block_align==0 check on COMM-after-SSND files.
dpcm: ignore extra unpaired bytes in stereo streams.
mp3on4: require a minimum framesize.
mpc7: assign an error level + context to av_log() msg.
huffyuv: error out on bit overrun.
dct-test: Add the missing ff_ prefix to the altivec functions
dct-test: Remove a stray declaration of a nonexistent function
movenc: Write the unknown duration as 64 bit fields in ismv
movenc: Write track durations with all bits set if duration is unknown
Conflicts:
libavcodec/dct-test.c
libavcodec/wmadec.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
The parser uses VLC tables initialized in vc1_common_init(), therefore
we should call this function on parser init also.
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC: libav-stable@libav.org
* qatar/master: (38 commits)
v210enc: remove redundant check for pix_fmt
wavpack: allow user to disable CRC checking
v210enc: Use Bytestream2 functions
cafdec: Check return value of avio_seek and avoid modifying state if it fails
yop: Check return value of avio_seek and avoid modifying state if it fails
tta: Check return value of avio_seek and avoid modifying state if it fails
tmv: Check return value of avio_seek and avoid modifying state if it fails
r3d: Check return value of avio_seek and avoid modifying state if it fails
nsvdec: Check return value of avio_seek and avoid modifying state if it fails
mpc8: Check return value of avio_seek and avoid modifying state if it fails
jvdec: Check return value of avio_seek and avoid modifying state if it fails
filmstripdec: Check return value of avio_seek and avoid modifying state if it fails
ffmdec: Check return value of avio_seek and avoid modifying state if it fails
dv: Check return value of avio_seek and avoid modifying state if it fails
bink: Check return value of avio_seek and avoid modifying state if it fails
Check AVCodec.pix_fmts in avcodec_open2()
svq3: Prevent illegal reads while parsing extradata.
remove ParseContext1
vc1: use ff_parse_close
mpegvideo parser: move specific fields into private context
...
Conflicts:
libavcodec/4xm.c
libavcodec/aacdec.c
libavcodec/h264.c
libavcodec/h264.h
libavcodec/h264_cabac.c
libavcodec/h264_cavlc.c
libavcodec/mpeg4video_parser.c
libavcodec/svq3.c
libavcodec/v210enc.c
libavformat/cafdec.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
flicvideo: fix invalid reads
vorbis: Avoid some out-of-bounds reads
vqf: add more known extensions
cabac: remove unused function renorm_cabac_decoder
h264: Only use symbols from the SVQ3 decoder under proper conditionals.
add bytestream2_tell() and bytestream2_seek() functions
parsers: initialize MpegEncContext.slice_context_count to 1
spdifenc: use special alignment for DTS-HD length_code
Conflicts:
libavcodec/flicvideo.c
libavcodec/h264.c
libavcodec/mpeg4video_parser.c
libavcodec/vorbis.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
The mpeg4 video, H264 and VC-1 parser hold (directly or indirectly)
a MpegEncContext in their private context. Since they do not call the
common mpegvideo init function slice_context_count has explicitly set
to 1.
Prevents a null pointer dereference in the h264 parser and fixes
bug 193.
* qatar/master: (44 commits)
replacement Indeo 3 decoder
gsm demuxer: do not allocate packet twice.
flvenc: use first packet delay as global delay.
ac3enc: doxygen update.
imc: return error codes instead of 0 for error conditions.
imc: return meaningful error codes instead of -1
imc: do not set channel layout for stereo
imc: validate channel count
imc: check for ff_fft_init() failure
imc: check output buffer size before decoding
imc: use DSPContext.bswap16_buf() to byte-swap packet data
rtsp: add allowed_media_types option
libgsm: add flush function to reset the decoder state when seeking
libgsm: simplify decoding by using a loop
gsm: log error message when packet is too small
libgsmdec: do not needlessly set *data_size to 0
gsmdec: do not needlessly set *data_size to 0
gsmdec: add flush function to reset the decoder state when seeking
libgsmdec: check output buffer size before decoding
gsmdec: log error message when output buffer is too small.
...
Conflicts:
Changelog
ffplay.c
libavcodec/indeo3.c
libavcodec/mjpeg_parser.c
libavcodec/vp3.c
libavformat/cutils.c
libavformat/id3v2.c
libavutil/parseutils.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
vp6: partially propagate huffman tree building errors during coeff model parsing and fix misspelling
mpeg12: propagate chunk decode errors and fix conditional indentation
vc1: fix VC-1 Pulldown handling.
VC1: Fix first/last row checks with slices
mp4: Handle non-trivial ES Descriptors.
vc1: properly zero coded_block[] edges on new slice entry.
Conflicts:
libavcodec/vc1dec.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Pulldown flags are being set incorrectly and AVFrame->repeat_pict is not
being set. Also, skipped frames exit header parsing too early and do not
set pulldown flags appropriately. Ticks_per_frame needs to be set and
time_base adjusted so player can extend frame duration by a field time.
This fixes problems encountered when attempting to transcode HD-DVD EVOB
files with HandBrake. Also makes these files play smoothly in avplay.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
None of these symbols should be accessed directly, so declare them as
hidden.
Signed-off-by: Mans Rullgard <mans@mansr.com>
(cherry picked from commit d36beb3f69)
Passing an explicit filename to this command is only necessary if the
documentation in the @file block refers to a file different from the
one the block resides in.
Originally committed as revision 22921 to svn://svn.ffmpeg.org/ffmpeg/trunk
Otherwise doxygen complains about ambiguous filenames when files exist
under the same name in different subdirectories.
Originally committed as revision 16912 to svn://svn.ffmpeg.org/ffmpeg/trunk