The sh4 optimizations are removed, because the code is
100% identical to the C code, so it is unlikely to
provide any real practical benefit.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Now, nellymoserenc and aacenc no longer depends on dsputil. Independent
of this patch, wmaprodec also does not depend on dsputil, so I removed
it from there also.
This is identical to what e.g. vp8 does, and prevents the function call
overhead (plus dependency on dsputil for this particular function).
Arm asm updated by Janne Grunau <janne-libav@jannau.net>.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
Move some functions from dsputil. The idea is that videodsp contains
functions that are useful for a large and varied set of video decoders.
Currently, it contains emulated_edge_mc() and prefetch().
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Use this in VP8/H264-8bit loopfilter functions so they can be used if
there is no aligned stack (e.g. MSVC 32bit or ICC 10.x).
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Start and end index are multiple of 2, therefore guaranteeing aligned access.
Also, this allows to generate 4 floats per loop, keeping the alignment all
along.
Timing:
- 32 bits: 326c -> 172c
- 64 bits: 323c -> 156c
Signed-off-by: Diego Biurrun <diego@biurrun.de>
The qpel functions referenced here are not related to h264 and should
thus never have been under CONFIG_H264QPEL.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Signed-off-by: Diego Biurrun <diego@biurrun.de>
The only CPUs that have 3dnow and don't have mmxext are 12 years old.
Moreover, AMD has dropped 3dnow extensions from newer CPUs.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Replace INIT_AVX by INIT_XMM avx. Port the whole file to use cpuflag
based function declarations. Remove (now unused) cputype argument in
function declaration macros. Change function prototypes to have mmx2
instead of mmxext as suffix, since that's required by cpuflags.
Instead of using an evil VLA, fall back to C version when edge
emulation is needed. MPEG4 GMC is a rarely used fringe feature
so the speed loss is an acceptable cost for safer code.
Signed-off-by: Mans Rullgard <mans@mansr.com>
For some reason add_hfyu_median_prediction_cmov is only selected
on 3Dnow-capable CPUs, even though it uses no 3Dnow instructions.
This patch allows it to be selected on any cpu with cmov with the
possibility of being overridden by the mmxext version.
Signed-off-by: Mans Rullgard <mans@mansr.com>
The optimized dct_quantize template functions reference optimized
fdct symbols, so these functions must only be enabled if the relevant
optimizations have been enabled by configure.
This fixes two issues preventing suncc from building this code.
The undocumented 'a' operand modifier, causing gcc to omit a $ in
front of immediate operands (as required in addresses), is not
supported by suncc. Luckily, the also undocumented 'c' modifer
has the same effect and is supported.
On some asm statements with a large number of operands, suncc for no
obvious reason fails to correctly substitute some of the operands.
Fortunately, some of the operands in these statements are plain
numbers which can be inserted directly into the code block instead
of passed as operands.
With these changes, the code builds correctly with both gcc and
suncc.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This code contains a C array of addresses of labels defined in
inline asm. To do this, the names must be declared as external
in C. The declared type does not matter since only the address is
used, and for some reason, the author of the code used the 'void'
type despite taking the address of a void expression being invalid.
Changing the type to char, a reasonable choice since the alignment
of the code labels cannot be known or guaranteed, eliminates gcc
warnings and allows building with suncc.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This macro is only used in two places, both in libavcodec, so this
is a more sensible place for it.
Two small tweaks to the macro are made:
- removing the trailing semicolon
- dropping unnecessary 'volatile' from the x86 asm
Signed-off-by: Mans Rullgard <mans@mansr.com>
yasm tolerates mismatch between movd/movq and source register size,
adjusting the instruction according to the register. nasm is more
strict.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Refactoring mmx2/mmxext YASM code with cpuflags will force renames.
So switching to a consistent naming scheme beforehand is sensible.
The name "mmxext" is more official and widespread and also the name
of the CPU flag, as reported e.g. by the Linux kernel.
This makes add_hfyu_left_prediction_sse4() handle sources that are not
16-byte aligned in its own function rather than by proxying the call to
add_hfyu_left_prediction_ssse3(). This fixes a crash on Win64, since the
sse4 version clobberes xmm6, but the ssse3 version (which uses MMX regs)
does not restore it, thus leading to XMM clobbering and RSP being off.
Fixes bug 342.
Currently there is a wild mix of 3dn2/3dnow2/3dnowext. Switching to
"3dnowext", which is a more common name of the CPU flag, as reported
e.g. by the Linux kernel, unifies this.
Some calculations were changed in b6a3849 to use mmsize, which was not correct
for the AVX version, which uses INIT_YMM and therefore has mmsize == 32.
Fixes Bug 341.
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
These functions are not faster than other mmx implementations on
any hardware I have been able to test on, and they are horribly
inaccurate. There is thus no reason to ever use them.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This completes the conversion of h264dsp to yasm; note that h264 also
uses some dsputil functions, most notably qpel. Performance-wise, the
yasm-version is ~10 cycles faster (182->172) on x86-64, and ~8 cycles
faster (201->193) on x86-32.
Without this, cglobal will expand "z" to "zh" to access the high byte
in a register's word, which causes a name collision with the ZH(x) macro
further up in this file.
Mixing yasm and inline asm is a bad idea, since if either yasm or inline
asm is not supported by your toolchain, all of the asm stops working.
Thus, better to use either one or the other alone.
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
In ff_put_pixels_clamped_mmx(), there are two assembly code blocks.
In the first block (in the unrolled loop), the instructions
"movq 8%3, %%mm1 \n\t", and so forth, have problems.
From above instruction, it is clear what the programmer wants: a load from
p + 8. But this assembly code doesn’t guarantee that. It only works if the
compiler puts p in a register to produce an instruction like this:
"movq 8(%edi), %mm1". During compiler optimization, it is possible that the
compiler will be able to constant propagate into p. Suppose p = &x[10000].
Then operand 3 can become 10000(%edi), where %edi holds &x. And the instruction
becomes "movq 810000(%edx)". That is, it will stride by 810000 instead of 8.
This will cause a segmentation fault.
This error was fixed in the second block of the assembly code, but not in
the unrolled loop.
How to reproduce:
This error is exposed when we build using Intel C++ Compiler, with
IPO+PGO optimization enabled. Crashed when decoding an MJPEG video.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
This moves all VP3-specific function pointers from dsputil to a
new vp3dsp context. There is no reason to ever use the VP3 IDCT
where an MPEG2 IDCT is expected or vice versa.
Signed-off-by: Mans Rullgard <mans@mansr.com>