Based on the aarch64 asm. CPU cycle counts on cortex-a9 compared to
gcc 4.8.2:
before: 475 decicycles in get_cabac_noinline, 67106035 runs, 2829 skips
after: 393 decicycles in get_cabac_noinline, 67106474 runs, 2390 skips
Overall speedup is above 2%. Code generated by clang 3.4 is slower on
the same hardware and the relative change is a little larger.
The overread avoidance fix in cbddee1cca
broke the computation for the last row since it prevented the safe
reading from the height+1-th row.
CC: libav-stable@libav.org
The vector dequantization has a test in a loop preventing effective SIMD
implementation. By moving it out of the loop, this loop can be DSPized.
Therefore, modify the current DSP implementation. In particular, the
DSP implementation no longer has to handle null loop sizes.
The decode_hf implementations have following timings:
For x86 Arrandale:
C SSE SSE2 SSE4
win32: 260 162 119 104
win64: 242 N/A 89 72
The arm NEON optimizations follow in a later patch as external asm. The
now unused check for the y modifier in arm inline asm is removed from
configure.
The scaling factor is constant so it is faster to scale the
FIR coefficients in the tables during compilation.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
* commit '5c1c6e82261b856214499b9fef3a08bf3ff6e0ae':
dca: include dcadsp.h in {arm,x86}/dca.h for checkheaders
Merged-by: Michael Niedermayer <michaelni@gmx.at>
The x86 runs short on registers because numerous elements are not static.
In addition, splitting them allows more optimized code, at least for x86.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
It is currently declared as a macro who is set to inlinable functions,
among which a Neon and a default C implementations.
Add a DSP parameter to each inline function, unused except by the
default C implementation which calls a function from the DSP context.
On an Arrandale CPU, gain for an inlined SSE2 function vs. a call:
- Win32: 29 to 26 cycles
- Win64: 25 to 23 cycles
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '5bcbb516f2ff45290ef7995b081762e668693672':
arm: Add X() around all references to extern symbols
Merged-by: Michael Niedermayer <michaelni@gmx.at>
The x86 runs short on registers because numerous elements are not static.
In addition, splitting them allows more optimized code, at least for x86.
Arm asm changes by Janne Grunau.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
It is currently declared as a macro who is set to inlinable functions,
among which a Neon and a default C implementations.
Add a DSP parameter to each inline function, unused except by the
default C implementation which calls a function from the DSP context.
On an Arrandale CPU, gain for an inlined SSE2 function vs. a call:
- Win32: 29 to 26 cycles
- Win64: 25 to 23 cycles
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
* qatar/master:
vp8: Use 2 registers for dst_stride and src_stride in neon bilin filter
Conflicts:
libavcodec/arm/vp8dsp_neon.S
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '7151c5d04aed3b496c21f713dcb603e2cbdb9c49':
arm: Use full filenames as multiple inclusion guards
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
arm: Add an option for making sure NEON registers aren't clobbered
Conflicts:
configure
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '5dae4872357613a0b51120b54a4c5221e0ec3f69':
arm: Allow overriding the alignment set in the function macro
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'b7b932f5e3602bd34c3cc634b71c8bbbc0fb8dc0':
arm: Remove a leftover define for the pld instruction
Merged-by: Michael Niedermayer <michaelni@gmx.at>
The function macro always sets .align 2 before declaring the
function label (since 5c5e1ea3) and always sets the section to
.text (since 278caa6a).
The .align 5 before certain functions, added in fc252eba, were added
before .text and .align were added to the function macro and thus
became useless/unused when the function macro got them.
This restores the original intention, to align the loop entry
points.
Signed-off-by: Martin Storsjö <martin@martin.st>
This file no longer uses the pld instruction at all, all such uses
have been split into hpeldsp_arm.S.
Signed-off-by: Martin Storsjö <martin@martin.st>
* commit 'a03a642d5ceb5f2f7c6ebbf56ff365dfbcdb65eb':
h264: do not use 422 functions for monochrome
See: 07abf13da4
Merged-by: Michael Niedermayer <michaelni@gmx.at>
For:
ff_vc1_inv_trans_{8,4}x{8,4}_{dc_,}neon
ff_put_pixels8x8_neon
ff_put_vc1_mspel_mc{0,1,2,3}{0,1,2,3}_neon (except for 00)
Based on ARM assembly code in libavcodec/arm by Rob Clark and Mans
Rullgard.
Signed-off-by: Martin Storsjö <martin@martin.st>
Before After
Mean StdDev Mean StdDev Change
This function 508.8 23.4 185.4 9.0 +174.4%
Overall 3068.5 31.7 2752.1 29.4 +11.5%
In combination with the preceding patch:
Before After
Mean StdDev Mean StdDev Change
Overall 2925.6 26.2 2752.1 29.4 +6.3%
Signed-off-by: Martin Storsjö <martin@martin.st>
When building for iOS in thumb mode, gas-preprocessor.pl doesn't
mark unused labels as thumb functions (as it does for other
local labels, where it can figure out that they are functions
due to being referenced in branch instructions). This leads to
linker warnings for some of those local labels, such as:
ld: warning: ARM function not 4-byte aligned: __a_evaluation from
libavcodec/libavcodec.a(simple_idct_arm.o)
Therefore, comment them out since they don't have any function.
They do still have a value in documenting key points in the
assembly source though.
Signed-off-by: Martin Storsjö <martin@martin.st>
* commit 'd6e4f5fef0d811e180fd7541941e07dca9e11dc0':
arm: Add VFP-accelerated version of int32_to_float_fmul_array8
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'ce9ed10ac27b9cf32a6257e083ea2f052692d971':
arm: Add VFP-accelerated version of int32_to_float_fmul_scalar
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '41ef1d360bac65032aa32f6b43ae137666507ae5':
arm: Add VFP-accelerated version of synth_filter_float
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Before After
Mean StdDev Mean StdDev Change
This function 1323.0 98.0 746.2 60.6 +77.3%
Overall 15400.0 336.4 14147.5 288.4 +8.9%
Signed-off-by: Martin Storsjö <martin@martin.st>
Before After
Mean StdDev Mean StdDev Change
This function 1389.3 4.2 967.8 35.1 +43.6%
Overall 15577.5 83.2 15400.0 336.4 +1.2%
Signed-off-by: Martin Storsjö <martin@martin.st>
Before After
Mean StdDev Mean StdDev Change
This function 868.2 33.5 436.0 27.0 +99.1%
Overall 15973.0 223.2 15577.5 83.2 +2.5%
Signed-off-by: Martin Storsjö <martin@martin.st>
Before After
Mean StdDev Mean StdDev Change
This function 2653.0 28.5 1108.8 51.4 +139.3%
Overall 17049.5 408.2 15973.0 223.2 +6.7%
Signed-off-by: Martin Storsjö <martin@martin.st>
Before After
Mean StdDev Mean StdDev Change
This function 366.2 18.3 277.8 13.7 +31.9%
Overall 18420.5 489.1 17049.5 408.2 +8.0%
Signed-off-by: Martin Storsjö <martin@martin.st>
Before After
Mean StdDev Mean StdDev Change
This function 1175.0 4.4 366.2 18.3 +220.8%
Overall 19285.5 292.0 18420.5 489.1 +4.7%
Signed-off-by: Martin Storsjö <martin@martin.st>
Before After
Mean StdDev Mean StdDev Change
This function 9295.0 114.9 4853.2 83.5 +91.5%
Overall 23699.8 397.6 19285.5 292.0 +22.9%
Signed-off-by: Martin Storsjö <martin@martin.st>
* commit '36a7df8cf1115aa37a1b0d42324ecde5ab6c2304':
arm: Only build the FFT init files if FFT is enabled
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '9b9b2e9f3036abfd42916bcf734af14b4cb686aa':
build: arm: cosmetics: Place all OBJS declarations in alphabetical order
Merged-by: Michael Niedermayer <michaelni@gmx.at>
A few of the h264qpel neon functions are shared with other
hpeldsp functions in this file.
This fixes standalone compilation of the h264 decoder on arm.
Signed-off-by: Martin Storsjö <martin@martin.st>
It was previously declared as int.
Does not change fate results for x86.
Conflicts:
libavcodec/ppc/fmtconvert_altivec.c
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '7f75f2f2bd692857c1c1ca7f414eb30ece3de93d':
ppc: Drop unnecessary ff_ name prefixes from static functions
x86: Drop unnecessary ff_ name prefixes from static functions
arm: Drop unnecessary ff_ name prefixes from static functions
Merged-by: Michael Niedermayer <michaelni@gmx.at>
This way, the special IDCT permutations are no longer needed. This
is similar to how H264 does it, and removes the dsputil dependency
imposed by the scantable code.
Also remove the unused type == 0 cases from the plain C version
of the idct.
Signed-off-by: Martin Storsjö <martin@martin.st>
With the current code it fails due to running out
of registers.
So code the store offsets manually into the assembler
instead.
Passes "make fate-dts".
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
The non-intra-pcm branch in hl_decode_mb (simple, 8bpp) goes from 700
to 672 cycles, and the complete loop of decode_mb_cabac and hl_decode_mb
(in the decode_slice loop) goes from 1759 to 1733 cycles on the clip
tested (cathedral), i.e. almost 30 cycles per mb faster.
Signed-off-by: Martin Storsjö <martin@martin.st>
This way, the special IDCT permutations are no longer needed. Bfin code
is disabled until someone updates it. This is similar to how H264 does
it, and removes the dsputil dependency imposed by the scantable code.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '76b19a3984359b3be44d4f7e4e69b7b86729a622':
Fix a number of incorrect intmath.h #includes.
avconv: remove an unused variable
Merged-by: Michael Niedermayer <michaelni@gmx.at>
The non-intra-pcm branch in hl_decode_mb (simple, 8bpp) goes from 700
to 672 cycles, and the complete loop of decode_mb_cabac and hl_decode_mb
(in the decode_slice loop) goes from 1759 to 1733 cycles on the clip
tested (cathedral), i.e. almost 30 cycles per mb faster.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'a846dccb29d2bb0798af1d47d06100eda9ca87cc':
h264chroma: x86: Fix building with yasm disabled
rv34: Drop now unnecessary dsputil dependencies
Conflicts:
libavcodec/x86/Makefile
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '620289a20e022b9c16c10d546ef86cc0bb77cc84':
sh4: Fix silly type vs. variable name search and replace typo
configure: Group all hwaccels together in a separate variable
Add av_cold attributes to arch-specific init functions
Conflicts:
configure
libavcodec/arm/mpegvideo_armv5te.c
libavcodec/x86/mlpdsp.c
libavcodec/x86/motion_est.c
libavcodec/x86/mpegvideoenc.c
libavcodec/x86/videodsp_init.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '25841dfe806a13de526ae09c11149ab1f83555a8':
Use ptrdiff_t instead of int for {avg, put}_pixels line_size parameter.
Conflicts:
libavcodec/alpha/dsputil_alpha.c
libavcodec/dsputil_template.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
Use proper "" quotes for local header #includes
ppc: fmtconvert: Drop two unused variables.
bink demuxer: set framerate.
Conflicts:
libavcodec/kbdwin.c
libavcodec/ppc/fmtconvert_altivec.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
This makes the plain-armv6 version use the same registers as the
armv6t2 version above.
This fixes fate-vp8 on plain-armv6 devices.
Signed-off-by: Martin Storsjö <martin@martin.st>
* commit '6bdb841b46d170d58488deaed720729b79223b1d':
arm: h264qpel: use neon h264 qpel functions only if supported
* bug was fixed previously (in merge of buggy code):
h264: copy h264qpel dsp context to slice thread copies
Merged-by: Michael Niedermayer <michaelni@gmx.at>
The sh4 optimizations are removed, because the code is
100% identical to the C code, so it is unlikely to
provide any real practical benefit.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Now, nellymoserenc and aacenc no longer depends on dsputil. Independent
of this patch, wmaprodec also does not depend on dsputil, so I removed
it from there also.
* commit 'ce378f0dd0c4e5350b3280e6b3e8d6b46fe4b0a3':
fate: Use wmv2 IDCT for wmv2 tests
vorbisdsp: change block_size type from int to intptr_t.
Conflicts:
tests/fate-run.sh
tests/fate/vcodec.mak
Merged-by: Michael Niedermayer <michaelni@gmx.at>
libavutil/arm/asm.S sets '.arch' depending on HAVE_ARMV5TE so that
assembling armv5te code will always succeed even if the default -march
flag does not support it. HAVE_ARMV5TE_EXTERNAL tests assembling code
with the default arch.
Fixes the missing symbol ff_prefetch_arm with --cpu= not including
armv5te.
CC: libav-stable@libav.org
* commit 'aeaf268e52fc11c1f64914a319e0edddf1346d6a':
vp3: integrate clear_blocks with idct of previous block.
mpegvideo: fix loop condition in draw_line()
dvdsubdec: parse the size from the extradata
Conflicts:
libavcodec/dvdsubdec.c
libavcodec/mpegvideo.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
This is identical to what e.g. vp8 does, and prevents the function call
overhead (plus dependency on dsputil for this particular function).
Arm asm updated by Janne Grunau <janne-libav@jannau.net>.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
* qatar/master:
lavc: Move vector_fmul_window to AVFloatDSPContext
rtpdec_mpeg4: Check the remaining amount of data before reading
Conflicts:
libavcodec/dsputil.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Move some functions from dsputil. The idea is that videodsp contains
functions that are useful for a large and varied set of video decoders.
Currently, it contains emulated_edge_mc() and prefetch().
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
* commit '9ebd45c2d58ad9241ad09718679f0cf7fb57da52':
configure: do not bypass cpuflags section if --cpu not given
dct-test: arm: indicate required cpu features for optimised funcs
snow: fix build after 594d4d5df3
arm: fix use of uninitialised value in ff_fft_fixed_init_arm()
avpicture: Don't assume a valid pix fmt in avpicture_get_size
Conflicts:
libavcodec/avpicture.c
libavcodec/snow.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
This is consistent with usual ARM nomenclature as well as with the
VFPV3 and NEON symbols which both lack the ARM prefix.
Signed-off-by: Mans Rullgard <mans@mansr.com>
When initialising an FFTContext for a plain FFT, mdct_bits is not set
and can contain a garbage value. Since nbits is always valid and for
MDCT operation is mdct_bits - 2 checking this instead avoids using an
uninitialised value while having the same effect.
Signed-off-by: Mans Rullgard <mans@mansr.com>
* commit '284ea790d89441fa1e6b2d72d3c1ed6d61972f0b':
dsputil: move vector_fmul_scalar() to AVFloatDSPContext in libavutil
aacenc: use the correct output buffer
aacdec: fix signed overflows in lcg_random()
base64: fix signed overflow in shift
Conflicts:
libavcodec/dsputil.c
libavutil/base64.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
pixfmt: support more yuva formats
swscale: support gray to 9bit and 10bit formats
configure: rewrite print_config() function using awk
FATE: fix (AD)PCM test dependencies broken in e519990
Use ptrdiff_t instead of int for intra pred "stride" function parameter.
x86: use PRED4x4/8x8/8x8L/16x16 macros to declare intrapred prototypes.
Conflicts:
libavcodec/h264pred.c
libavcodec/h264pred_template.c
libavutil/pixfmt.h
libswscale/swscale_unscaled.c
tests/ref/lavfi/pixdesc
tests/ref/lavfi/pixfmts_copy
tests/ref/lavfi/pixfmts_null
tests/ref/lavfi/pixfmts_scale
tests/ref/lavfi/pixfmts_vflip
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'c9ef43215c7d68c2cdcdbe02287aa114f27a32ed':
fate-vc1: add dependencies
ARM: fix overreads in neon h264 chroma mc
rtsp: Make sure the ret variable is initialized in ff_rtsp_fetch_packet
gitignore: ignore files created by msvc
fate: Add proper dependencies for the tests in video.mak
configure: Disable Snow decoder and encoder by default
lzo: Drop obsolete fast_memcpy reference
build: Drop OBJS declaration for non-existing PCM_DVD encoder
mpeg4videodec: Disable frame multithreading for GMC, its not implemented at all
Conflicts:
libavcodec/mpegvideo.c
libavformat/rtsp.c
tests/fate/microsoft.mak
tests/fate/video.mak
Merged-by: Michael Niedermayer <michaelni@gmx.at>
The loops were reading ahead one line, which could end up outside the
buffer for reference blocks at the edge of the picture. Removing
this readahead has no measurable performance impact.
Signed-off-by: Mans Rullgard <mans@mansr.com>
* commit '9734b8ba56d05e970c353dfd5baafa43fdb08024':
Move avutil tables only used in libavcodec to libavcodec.
Conflicts:
libavcodec/mathtables.c
libavutil/intmath.h
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'b522000e9b2ca36fe5b2751096b9a5f5ed8f87e6':
avio: introduce avio_closep
mpegtsenc: set muxing type notification to verbose
vc1dec: Use correct spelling of "opposite"
a64multienc: change mc_frame_counter to unsigned
arm: call arm-specific rv34dsp init functions under if (ARCH_ARM)
svq1: Drop a bunch of useless parentheses
parseutils-test: do not print numerical error codes
svq1: K&R formatting cosmetics
Conflicts:
doc/APIchanges
libavcodec/svq1dec.c
libavcodec/svq1enc.c
libavformat/version.h
libavutil/parseutils.c
tests/ref/fate/parseutils
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'cbcd497f384f0f8ef3f76f85b29b644b900d6b9f':
adxdec: use planar sample format
adpcmdec: use planar sample format for adpcm_thp
adpcmdec: use planar sample format for adpcm_ea_xas
adpcmdec: use planar sample format for adpcm_ea_r1/r2/r3
adpcmdec: use planar sample format for adpcm_xa
adpcmdec: use planar sample format for adpcm_ima_ws for vqa version 3
adpcmdec: use planar sample format for adpcm_4xm
adpcmdec: use planar sample format for adpcm_ima_wav
adpcmdec: use planar sample format for adpcm_ima_qt
pcmdec: use planar sample format for pcm_lxf
mace: use planar sample format
atrac1: use planar sample format
build: non-x86: Only compile mpegvideo optimizations when necessary
rtpdec_mpeg4: au_headers is a single array, simple av_free is enough
avcodec: free extended_data instead address of it
fate: Add tests of the ff_make_absolute_url function
url: Handle relative urls starting with two slashes
url: Handle relative urls being just a new query string
url: Don't treat slashes in query parameters as directory separators
Conflicts:
libavcodec/adxdec.c
libavcodec/mips/Makefile
libavcodec/pcm.c
libavcodec/utils.c
libavformat/Makefile
libavformat/utils.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
ARM: use numeric ID for Tag_ABI_align_preserved
segment: Pass the interrupt callback on to the chained AVFormatContext, too
ARM: bswap: drop armcc version of av_bswap16()
ARM: set Tag_ABI_align_preserved in all asm files
Merged-by: Michael Niedermayer <michaelni@gmx.at>
All our ARM asm preserves alignment so setting this attribute
in a common location is simpler. This removes numerous warnings
when linking with armcc.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This reverts commit d25f87f517.
This breaks decoding of some h264 files
I have tested the original patch with fate but by mistake have
forgotten to specify the fate samples so testing was limited to
the internal regression tests.
* qatar/master:
libx264: add forgotten ;
matroskadec: fix a sanity check.
matroskadec: only return corrupt packets that actually contain data
lavf: zero data/size of the packet passed to read_packet().
ARM: use 2-operand syntax for ADD Rd, PC in Apple PIC code
ARM: align PIC offset pools to 4 bytes
ARM: swap source operands in some add instructions
configure: update tms470 detection for latest version
lavf probe: prevent codec probe with no data at all seen
motion_est: fix use of inline on extern functions
Conflicts:
libavcodec/motion_est_template.c
libavformat/matroskadec.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
mpegvideo: drop unnecessary arguments to hpel_motion()
mpegvideo: drop 'inline' from some functions
nellymoserdec: drop support for s16 output.
bmpdec: only initialize palette for pal8.
build: Properly remove object files while cleaning
flacdsp: arm optimised lpc filter
compat/vsnprintf: return number of bytes required on truncation.
Merged-by: Michael Niedermayer <michaelni@gmx.at>