1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-11-26 19:01:44 +02:00
Commit Graph

517 Commits

Author SHA1 Message Date
Ronald S. Bultje
bec207f9f9 snowdsp: explicitily state instruction size.
Fixes a compile error with clang at -O0.
2012-05-02 09:57:12 -07:00
Christophe GISQUET
e75d1d4f73 dsputil x86: revert a test back to its previous value
Commit 356ee8d caused the initial inversion.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 11:00:51 -07:00
Christophe Gisquet
fe5ed69dc7 rv34dsp x86: implement MMX2 inverse transform
141 cycles down to 51.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 10:58:47 -07:00
Roland Scheidegger
9b9df1cdff h264: new assembly version of get_cabac for x86_64 with PIC
This adds a hand-optimized assembly version for get_cabac much like the
existing one, but it works if the table offsets are RIP-relative.
Compared to the non-RIP-relative version this adds 2 lea instructions
and it needs one extra register. get_cabac() gets about 40% faster, for
an overall speedup of about 5%.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 09:43:25 -07:00
Roland Scheidegger
14e9ffc1e4 h264: use one table instead of several for cabac functions
The reason is this is easier for PIC code (in particular on darwin...).
Keep the old names as pointers (static in cabac_functions.h so gcc
knows these are just immediate offsets) so the c code can nicely stay the same
(alternatively could use offsets directly in the functions needing the
tables). This should produce the same code as before with non-pic and better
code (confirmed) with pic.

The assembly uses the new table but still won't work for PIC case.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 08:26:12 -07:00
Roland Scheidegger
444f47b55c h264: (trivial) remove unneeded macro argument in x86/cabac.h
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 08:24:56 -07:00
Mans Rullgard
2bcbd98459 Remove lowres video decoding
This feature is complex, of questionable utility, and slows down
normal decoding.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-04-21 18:56:19 +01:00
Mans Rullgard
95510be8c3 avcodec: remove AVCodecContext.dsp_mask
This removes all references to AVCodecContext.dsp_mask and marks
it for eviction at the next version bump.  It has been superseded
by av_set_cpu_flag_mask() which, unlike this field, works everywhere.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-04-21 18:30:01 +01:00
Ronald S. Bultje
87a246341b h264: use proper PROLOGUE statement for a function using 8 registers.
Fixes crashes when using biweight on win64.
2012-04-16 08:07:21 -07:00
Ronald S. Bultje
b089ca871a dsputil: fix optimized emu_edge function on Win64.
Recent register allocation changes (x86inc.asm update) changed the
register order and thus opcodes for the inner loops. One of them became
>128bytes, which confuses other parts of this function where it jumps
to fixed-offset positions to extend the edge by fixed amounts. A simple
register change fixes this.
2012-04-13 11:28:30 -07:00
Justin Ruggles
de7f22ab0c ac3dsp: call femms/emms at the end of float_to_fixed24() for 3DNow and SSE
Fixes ac3-encode and eac3-encode FATE test failures with SSE2 disabled.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-12 21:33:04 -07:00
Ronald S. Bultje
76538d7a78 h264: fix 10bit biweight functions after recent x86inc.asm fixes.
This should have been updated in the x86inc.asm update, but was
accidently forgotten.
2012-04-12 21:13:57 -07:00
Diego Biurrun
7bb3a302fe build: Consistently handle conditional compilation for all optimization OBJS. 2012-04-12 09:00:49 +02:00
Henrik Gramner
729f90e268 x86inc improvements for 64-bit
Add support for all x86-64 registers
Prefer caller-saved register over callee-saved on WIN64
Support up to 15 function arguments

Also (by Ronald S. Bultje)
Fix up our asm to work with new x86inc.asm.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
2012-04-11 15:47:00 -04:00
Christophe GISQUET
2130bd8f5b rv40dsp x86: use only one register, for both increment and loop counter
Around 10 cycles faster for luma.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-10 10:07:09 -07:00
Christophe GISQUET
272b252c01 rv40dsp: implement prescaled versions for biweight.
Quite often, the original weights are multiple of 512. By prescaling them
by 1/512 when they are computed (once per frame), no intermediate shifting
is needed, and no prescaling on each call either.

The x86 code already used that trick.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-10 10:06:48 -07:00
Christophe GISQUET
6b81da2fd0 dsputil x86: use SSE float instruction instead of SSE2 integer equivalent
All the more required since the users are pure SSE functions.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-04 11:24:27 -07:00
Christophe GISQUET
cd88105f6f dsputil x86: remove deprecated parameter from scalarproduct_int16 prototype
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-04 11:24:08 -07:00
Christophe GISQUET
f9888520cc vp8dsp x86: perform rounding shift with a single instruction
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-04 11:23:36 -07:00
Ronald S. Bultje
a940198130 cabac: add overread protection to BRANCHLESS_GET_CABAC().
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
2012-03-28 08:01:29 -07:00
Ronald S. Bultje
448dc42571 cabac: increment jump locations by one in callers of BRANCHLESS_GET_CABAC(). 2012-03-28 08:01:29 -07:00
Ronald S. Bultje
16f6e83f74 cabac: remove unused argument from BRANCHLESS_GET_CABAC_UPDATE(). 2012-03-28 08:01:29 -07:00
Ronald S. Bultje
951014e5bb cabac: use struct+offset instead of memory operand in BRANCHLESS_GET_CABAC(). 2012-03-28 08:01:29 -07:00
Ronald S. Bultje
a0bdcb019e h264: add overread protection to get_cabac_bypass_sign_x86(). 2012-03-28 08:01:29 -07:00
Ronald S. Bultje
95bfa4ead7 h264: reindent get_cabac_bypass_sign_x86(). 2012-03-28 08:01:29 -07:00
Ronald S. Bultje
db025929f2 h264: use struct offsets in get_cabac_bypass_sign_x86(). 2012-03-28 08:01:29 -07:00
Diego Biurrun
ad0e31f134 build: prettyprinting cosmetics 2012-03-26 13:00:10 +02:00
Diego Biurrun
62ce9defb8 x86: dsputil: prettyprint gcc inline asm 2012-03-25 11:50:48 +02:00
Diego Biurrun
3b54912113 x86: K&R prettyprinting cosmetics for dsputil_mmx.c 2012-03-25 11:50:48 +02:00
Diego Biurrun
915a2a0a65 x86: conditionally compile H.264 QPEL optimizations 2012-03-25 11:50:45 +02:00
Diego Biurrun
3816642eab dsputil_mmx: Surround QPEL macros by "do { } while (0);" blocks.
This makes them safe to use in non-fully braced if-blocks and similar.
2012-03-25 11:48:37 +02:00
Ronald S. Bultje
71ea26811c aacsbr: handle m_max values smaller than 4.
Prevents a signflip in the counter, and a subsequent crash because of
overreads/overwrites.

Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC: libav-stable@libav.org
2012-03-23 12:56:08 -07:00
Ronald S. Bultje
a928ed3751 vp8: convert mbedge loopfilter x86 assembly to use named arguments. 2012-03-10 11:36:33 -08:00
Ronald S. Bultje
bee330e300 vp8: convert inner loopfilter x86 assembly to use named arguments. 2012-03-10 11:36:33 -08:00
Reimar Döffinger
6eda85e15b sbrdsp.asm: convert all instructions to float/SSE ones.
Since the values are floats, using the float operations
makes sense, improves performance on some CPUs and
makes the code SSE compatible instead of needing SSE2.

Based on suggestion by Jason.

Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-03-07 13:50:13 -08:00
Christophe GISQUET
7e1ce6a6ac dsputil: remove shift parameter from scalarproduct_int16
There is only one caller, which does not need the shifting. Other use cases
are situations where different roundings would be needed.

The x86 and neon versions are modified accordingly.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-03-07 10:29:52 -08:00
Diego Biurrun
1e9d55e45e x86: Remove duplicated AVG_3DNOW_OP / AVG_MMX2_OP macros from h264_qpel_mmx.c. 2012-03-07 09:36:04 +01:00
Reimar Döffinger
b5161908e0 SBR DSP: fix SSE code to not use SSE2 instructions.
movq from SSE register _to_ memory is an SSE2 instruction.
Use the SSE movlps function instead that does the same thing.

Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-03-06 13:40:35 -08:00
Mans Rullgard
356ee8d7de x86: clean up ff_dsputil_init_mmx()
This splits ff_dsputil_init_mmx() into multiple functions, one for
each MMX/SSE level, somewhat simplifying the nested conditions.

Signed-off-by: Mans Rullgard <mans@mansr.com>
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-03-05 14:40:03 +01:00
Ronald S. Bultje
b4188f0d46 vp8: convert simple loopfilter x86 assembly to use named arguments. 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
8476ca3b4e vp8: convert idct x86 assembly to use named arguments. 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
21ffc78fd7 vp8: convert mc x86 assembly to use named arguments. 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
28170f1a39 vp8: convert loopfilter x86 assembly to use cpuflags(). 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
e25be47154 vp8: convert idct/mc x86 assembly to use cpuflags(). 2012-03-03 20:39:59 -08:00
Ronald S. Bultje
291c9b6285 h264: change underread for 10bit QPEL to overread.
This prevents us from reading before the start of the buffer, and thus
prevents crashes resulting from this behaviour. Fixes bug 237.
2012-03-02 10:33:05 -08:00
Ronald S. Bultje
45549339bc vp8: disable mmx functions with sse/sse2 counterparts on x86-64.
x86-64 is guaranteed to have at least SSE2, therefore the MMX/MMX2
functions will never be used in practice.
2012-03-02 10:32:05 -08:00
Ronald S. Bultje
bd66f073fe vp8: change int stride to ptrdiff_t stride.
On 64bit platforms with 32bit int, this means we won't have to sign-
extend the integer anymore.
2012-03-02 10:31:50 -08:00
Ronald S. Bultje
b0c4f04338 h264: fix mmxext chroma deblock to use correct TC values. 2012-02-27 09:38:44 -08:00
Christophe GISQUET
2784d18791 SBR DSP x86: implement SSE sbr_hf_g_filt
Unrolling the main loop to process, instead of 4 elements:
- 8: minor gain of 2 cycles (not worth the extra object size)
- 2: loss of 8 cycles.

Assigning STEP to a register is a loss. Output address (Y) is almost always
unaligned.

Timings:
- C (32/64 bits): 117/109 cycles
- SSE: 57 cycles

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-02-23 15:50:09 -08:00
Christophe GISQUET
34454c761f SBR DSP x86: implement SSE sbr_sum_square_sse
The 32bits targets have been compiled with -mfpmath=sse for proper reference.
sbr_sum_square C  /32bits: 82c (unrolled)/102c
               C  /64bits: 69c (unrolled)/82c
               SSE/32bits: 42c
               SSE/64bits: 31c

Use of SSE4.1 dpps to perform the final sum is slower.
Not unrolling to perform 8 operations in a loop yields 10 more cycles.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-02-23 15:50:06 -08:00