Change the size specifiers to match the actual element sizes
of the data. This makes no practical difference with strict
alignment checking disabled (the default) other than somewhat
documenting the code. With strict alignment checking on, it
avoids trapping the unaligned loads.
Signed-off-by: Mans Rullgard <mans@mansr.com>
The vertically interpolating variants of these functions read
ahead one line to optimise the loop. On the last line processed,
this might be outside the buffer. Fix these invalid reads by
processing the last line outside the loop.
Signed-off-by: Mans Rullgard <mans@mansr.com>
The assembler may fail to place literal pools close enough to
instructions referencing them. An explicit .ltorg directive
fixes this.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This allows masking CPU features with the -cpuflags avconv option
which is useful for testing different optimisations without rebuilding.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Quite often, the original weights are multiple of 512. By prescaling them
by 1/512 when they are computed (once per frame), no intermediate shifting
is needed, and no prescaling on each call either.
The x86 code already used that trick.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
The were broken since August of 2010 without anyone noticing until
three weeks ago. Nobody cares about it anymore and hopefully Marvell
will support NEON like in the PXA978 from now on.
There is only one caller, which does not need the shifting. Other use cases
are situations where different roundings would be needed.
The x86 and neon versions are modified accordingly.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
This prevents having to sign-extend on 64-bit systems with 32-bit ints,
such as x86-64. Also fixes crashes on systems where we don't do it and
arguments are not in registers, such as Win64 for all weight functions.
Overall almost 4% faster, idct_add down from 350 to 85 cycles, idct_dc_add
down from 83 to 30 cycles.
squash: rv34 idct rearrange partial register loads
The alignment directive must obviously precede the label.
This was never noticed in ARM mode since the location is
already aligned there.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Due to apprent bugs in the GNU assembler and/or linker, relocations
can be incorrectly processed if the alignment of a Thumb instruction
is changed in the output file compared to the input object.
This fixes crashes in h264 decoding with Thumb enabled. No effect in
ARM mode since everything is 4-byte aligned there.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Perform dequantization while decoding coefficients instead of performing it
on the entire coefficients buffer.
Since quantized coefficients are very sparse, this usually causes a small
speedup. Speedup of around 1% on Panda board compared to the removed here
neon code. Global speedup is probably around 3%.
Signed-off-by: Kostya Shishkov <kostya.shishkov@gmail.com>
This is a hand-tuned version of the code with impossible parts of
the FASTDIV function ommitted.
2-5% faster overall on Cortex-A8.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This prevents build errors when compiler and assembler default
targets differ. Ideally each file would declare the highest
level it requires. This is however not easily possible as it
complicates assembling pre-armv6t2 code in Thumb-2 mode.
HAVE_NEON is used as indicator for ARMv7-A since no other
symbol exists for this and NEON is only available in this
variant.
Signed-off-by: Mans Rullgard <mans@mansr.com>
The inline asm added in bf5d46d uses the 'y' modifier which
is only supported from gcc 4.5. This check allows building
with older compilers.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Some versions of the GNU assembler do not handle 64-bit
immediate operands containing arithmetic. Writing the
value out in full works correctly.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This function is called with only 8-byte alignment from
imdct for size 16. The fft4 function is not called for
the larger FFT or MDCT sizes, so this has no impact on
typical uses.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Current GCC versions know how to generate these instructions
properly and avoiding inline asm gives better code. The MULH
function for ARMv5 uses the same instruction and is also not
needed any more.
The MLS64 macro remains since negating an input would normally
not be allowed as it would fail for INT_MIN. In our uses, the
inputs never have this value and thus negating is safe.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Prior to ARMv6, the destination registers of the SMULL instruction
must be distinct from the first source register. Marking the
output early-clobber ensures it is allocated unique registers.
This restriction is dropped in ARMv6 and later, so allowing overlap
between input and output registers there might give better code.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This enables UAL syntax for all asm files instead of only those
which happen to be incompatible with the old, deprecated syntax.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This function uses old-style vector operations deprecated in VFPv3.
Some implementations, e.g. Cortex-A9, support them only through
slow software emulation. Cortex-A8 does have this functionality
in hardware, but as it also has NEON, this function is not used
there regardless.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Surround memset and ff_vp8_dct_cat_prob by X() in order to fix iOS build
Includes patch by Luca Barbato <lu_zero@gentoo.org>.
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
This patch lets e.g. dsputil_init chose dsp functions with respect to
the bit depth to decode. The naming scheme of bit depth dependent
functions is <base name>_<bit depth>[_<prefix>] (i.e. the old
clear_blocks_c is now named clear_blocks_8_c).
Note: Some of the functions for high bit depth is not dependent on the
bit depth, but only on the pixel size. This leaves some room for
optimizing binary size.
Preparatory patch for high bit depth h264 decoding support.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>