These smaller samples do not need to be unpacked to double words
allowing the code to process more pixels every iteration (still 2 in MMX
but 6 in SSE2). It also avoids emulating the missing double word
instructions on older instruction sets.
Like with the previous code for 16-bit samples this has been tested on
an Athlon64 and a Core2Quad.
Athlon64:
1809275 decicycles in C, 32718 runs, 50 skips
911675 decicycles in mmx, 32727 runs, 41 skips, 2.0x faster
495284 decicycles in sse2, 32747 runs, 21 skips, 3.7x faster
Core2Quad:
921363 decicycles in C, 32756 runs, 12 skips
486537 decicycles in mmx, 32764 runs, 4 skips, 1.9x faster
293296 decicycles in sse2, 32759 runs, 9 skips, 3.1x faster
284910 decicycles in ssse3, 32759 runs, 9 skips, 3.2x faster
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This is a fairly dumb copy of the assembly for 8-bit samples but it
works and produces identical output to the C version. The options have
been tested on an Athlon64 and a Core2Quad.
Athlon64:
1810385 decicycles in C, 32726 runs, 42 skips
1080744 decicycles in mmx, 32744 runs, 24 skips, 1.7x faster
818315 decicycles in sse2, 32735 runs, 33 skips, 2.2x faster
Core2Quad:
924025 decicycles in C, 32750 runs, 18 skips
623995 decicycles in mmx, 32767 runs, 1 skips, 1.5x faster
406223 decicycles in sse2, 32764 runs, 4 skips, 2.3x faster
387842 decicycles in ssse3, 32767 runs, 1 skips, 2.4x faster
307726 decicycles in sse4, 32763 runs, 5 skips, 3.0x faster
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Always use the special filter for the first and last 3 columns (only).
Changes made in 64ed397 slowed the filter to just under 3/4 of what it
was. This commit restores the speed while maintaining identical output.
For reference, on my Athlon64:
1733222 decicycles in old
2358563 decicycles in new
1727558 decicycles in this
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '238614de679a71970c20d7c3fee08a322967ec40':
cdgraphics: do not rely on get_buffer() initializing the frame.
svq1: replace struct svq1_frame_size with an array.
vf_yadif: silence a warning.
Conflicts:
libavcodec/svq1dec.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Manually load registers to avoid using 8 registers on x86_32 with
compilers that do not align the stack (e.g. MSVC).
Signed-off-by: Diego Biurrun <diego@biurrun.de>
* commit 'fa8fcab1e0d31074c0644c4ac5194474c6c26415':
x86: h264_chromamc_10bit: drop pointless PAVG %define
x86: mmx2 ---> mmxext in function names
swscale: do not forget to swap data in formats with different endianness
Conflicts:
libavcodec/x86/dsputil_mmx.c
libavfilter/x86/gradfun.c
libswscale/input.c
libswscale/utils.c
libswscale/x86/swscale.c
tests/ref/lavfi/pixfmts_scale
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '6860b4081d046558c44b1b42f22022ea341a2a73':
x86: include x86inc.asm in x86util.asm
cng: Reindent some incorrectly indented lines
cngdec: Allow flushing the decoder
cngdec: Make the dbov variable have the right unit
cngdec: Fix the memset size to cover the full array
cngdec: Update the LPC coefficients after averaging the reflection coefficients
configure: fix print_config() with broke awks
Conflicts:
libavcodec/x86/ac3dsp.asm
libavcodec/x86/dct32.asm
libavcodec/x86/deinterlace.asm
libavcodec/x86/dsputil.asm
libavcodec/x86/dsputilenc.asm
libavcodec/x86/fft.asm
libavcodec/x86/fmtconvert.asm
libavcodec/x86/h264_chromamc.asm
libavcodec/x86/h264_deblock.asm
libavcodec/x86/h264_deblock_10bit.asm
libavcodec/x86/h264_idct.asm
libavcodec/x86/h264_idct_10bit.asm
libavcodec/x86/h264_intrapred.asm
libavcodec/x86/h264_intrapred_10bit.asm
libavcodec/x86/h264_weight.asm
libavcodec/x86/vc1dsp.asm
libavcodec/x86/vp3dsp.asm
libavcodec/x86/vp56dsp.asm
libavcodec/x86/vp8dsp.asm
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'f6c38c5f4ed6683a6a61db2ed418a68bbe5f5507':
avfilter: call x86 init functions under if (ARCH_X86), not if (HAVE_MMX)
rtspdec: Set the default port for listen mode, if none is specified
tscc2: Fix an out of array access
rtmpproto: Fix an out of array write
rtspdec: Fix use of uninitialized byte
vp8: reset loopfilter delta values at keyframes.
avutil: add yuva422p and yuva444p formats
Conflicts:
libavutil/pixdesc.c
libavutil/pixfmt.h
tests/ref/lavfi/pixdesc
tests/ref/lavfi/pixfmts_copy
tests/ref/lavfi/pixfmts_null
tests/ref/lavfi/pixfmts_scale
tests/ref/lavfi/pixfmts_vflip
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
MSS1 and MSS2: set final pixel format after common stuff has been initialised
MSS2 decoder
configure: handle --disable-asm before check_deps
x86: Split inline and external assembly #ifdefs
configure: x86: Separate inline from standalone assembler capabilities
pktdumper: Use a custom define instead of PATH_MAX for buffers
pktdumper: Use av_strlcpy instead of strncpy
pktdumper: Use sizeof(variable) instead of the direct buffer length
Conflicts:
Changelog
configure
libavcodec/allcodecs.c
libavcodec/avcodec.h
libavcodec/codec_desc.c
libavcodec/dct-test.c
libavcodec/imgconvert.c
libavcodec/mss12.c
libavcodec/version.h
libavfilter/x86/gradfun.c
libswscale/x86/yuv2rgb.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'ec36aa69448f20a78d8c4588265022e0b2272ab5':
x86: Fix linking with some or all of yasm, mmx, optimizations disabled
configure: Add more fine-grained SSE CPU capabilities flags
avfilter: x86: Use more precise compile template names
x86: cosmetics: Comment some #endifs for better readability
g723_1: add comfort noise generation
utvideoenc: Switch to dsputils' median prediction
utvideoenc: Avoid writing into the input picture
avtools: remove the distinction between func_arg and func2_arg.
avconv: make the -passlogfile option per-stream.
avconv: make the -pass option per-stream.
cmdutils: make -codecs print lossy/lossless flags.
lavc: add lossy/lossless codec properties.
Conflicts:
Changelog
cmdutils.c
configure
doc/APIchanges
ffmpeg.h
ffmpeg_opt.c
ffprobe.c
libavcodec/codec_desc.c
libavcodec/g723_1.c
libavcodec/utvideoenc.c
libavcodec/version.h
libavcodec/x86/mpegaudiodec.c
libavcodec/x86/rv40dsp_init.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
audio_frame_queue: Clean up ff_af_queue_log_state debug function
dwt: Remove unused code.
cavs: convert cavsdata.h to a .c file
cavs: Move inline functions only used in one file out of the header
cavs: Move data tables used in only one place to that file
fate: Add a single symbol Ut Video decoder test
vf_hqdn3d: x86 asm
vf_hqdn3d: support 16bit colordepth
avconv: prefer user-forced input framerate when choosing output framerate
Conflicts:
ffmpeg.c
libavcodec/audio_frame_queue.c
libavcodec/dwt.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
Fix even more missing includes after the common.h removal
build: Factor out rangecoder dependencies to CONFIG_RANGECODER
build: Factor out error resilience dependencies to CONFIG_ERROR_RESILIENCE
x86: avcodec: Consistently name all init files
Add more missing includes after removing the implicit common.h
Add some more missing includes after removing the implicit common.h
Don't include common.h from avutil.h
rtmp: Automatically compute the hash for SWFVerification
Conflicts:
configure
doc/APIchanges
doc/examples/decoding_encoding.c
libavcodec/Makefile
libavcodec/assdec.c
libavcodec/audio_frame_queue.c
libavcodec/avpacket.c
libavcodec/dv_profile.c
libavcodec/dwt.c
libavcodec/libtheoraenc.c
libavcodec/rawdec.c
libavcodec/rv40dsp.c
libavcodec/tiff.c
libavcodec/tiffenc.c
libavcodec/v210dec.h
libavcodec/vc1dsp.c
libavcodec/x86/Makefile
libavfilter/asrc_anullsrc.c
libavfilter/avfilter.c
libavfilter/buffer.c
libavfilter/formats.c
libavfilter/vf_ass.c
libavfilter/vf_drawtext.c
libavfilter/vf_fade.c
libavfilter/vf_select.c
libavfilter/video.c
libavfilter/vsrc_testsrc.c
libavformat/version.h
libavutil/audioconvert.c
libavutil/error.h
libavutil/version.h
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Under some circumstances, suncc will use a single register for the
address of all memory operands, inserting lea instructions loading
the correct address prior to each memory operand being used in the
code. In the yadif code, the branch in the asm block bypasses such
an lea instruction, causing an incorrect address to be used in the
following load.
This patch replaces the tmpX arrays with a single array and uses a
register operand to hold its address. Although this prevents using
offsets from the stack pointer to access these locations, the code
still builds as 32-bit PIC even with old compilers.
Signed-off-by: Mans Rullgard <mans@mansr.com>
* qatar/master:
mpegvideo: reduce excessive inlining of mpeg_motion()
mpegvideo: convert mpegvideo_common.h to a .c file
build: factor out mpegvideo.o dependencies to CONFIG_MPEGVIDEO
Move MASK_ABS macro to libavcodec/mathops.h
x86: move MANGLE() and related macros to libavutil/x86/asm.h
x86: rename libavutil/x86_cpu.h to libavutil/x86/asm.h
aacdec: Don't fall back to the old output configuration when no old configuration is present.
rtmp: Add message tracking
rtsp: Support mpegts in raw udp packets
rtsp: Support receiving plain data over UDP without any RTP encapsulation
rtpdec: Remove an unused include
rtpenc: Remove an av_abort() that depends on user-supplied data
vsrc_movie: discourage its use with avconv.
avconv: allow no input files.
avconv: prevent invalid reads in transcode_init()
avconv: rename OutputStream.is_past_recording_time to finished.
Conflicts:
configure
doc/filters.texi
ffmpeg.c
ffmpeg.h
libavcodec/Makefile
libavcodec/aacdec.c
libavcodec/mpegvideo.c
libavformat/version.h
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
lavr: fix handling of custom mix matrices
fate: force pix_fmt in lagarith-rgb32 test
fate: add tests for lagarith lossless video codec.
ARMv6: vp8: fix stack allocation with Apple's assembler
ARM: vp56: allow inline asm to build with clang
fft: 3dnow: fix register name typo in DECL_IMDCT macro
x86: dct32: port to cpuflags
x86: build: replace mmx2 by mmxext
Revert "wmapro: prevent division by zero when sample rate is unspecified"
wmapro: prevent division by zero when sample rate is unspecified
lagarith: fix color plane inversion for YUY2 output.
lagarith: pad RGB buffer by 1 byte.
dsputil: make add_hfyu_left_prediction_sse4() support unaligned src.
Conflicts:
doc/APIchanges
libavcodec/lagarith.c
libavfilter/x86/gradfun.c
libavutil/cpu.h
libavutil/version.h
libswscale/utils.c
libswscale/version.h
libswscale/x86/yuv2rgb.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Refactoring mmx2/mmxext YASM code with cpuflags will force renames.
So switching to a consistent naming scheme beforehand is sensible.
The name "mmxext" is more official and widespread and also the name
of the CPU flag, as reported e.g. by the Linux kernel.