This speeds up build_filter by ~ 50%. This gain should be pretty
consistent across all architectures and platforms.
Essentially, this relies on a observation that the filters have some
even/odd symmetry that may be exploited during the construction of the
polyphase filter bank. In particular, phases (scaled to [0, 1]) in [0.5, 1] are
easily derived from [0, 0.5] and expensive reevaluation of function
points are unnecessary. This requires some rather annoying even/odd
bookkeeping as can be seen from the patch.
I vaguely recall from signal processing theory more general symmetries allowing even greater
optimization of the construction. At a high level, "even functions"
correspond to 2, and one can imagine variations. Nevertheless, for the sake
of some generality and because of existing filters, this is all that is
being exploited.
Currently, this patch relies on phase_count being even or (trivially) 1,
though this is not an inherent limitation to the approach. This
assumption is safe as phase_count is 1 << phase_bits, and is hence a
power of two. There is no way for user API to set it to a nontrivial odd
number. This assumption has been placed as an assert in the code.
To repeat, this assumes even symmetry of the filters, which is the most common
way to get generalized linear phase anyway and is true of all currently
supported filters.
As a side note, accuracy should be identical or perhaps slightly better
due to this "forcing" filter symmetries leading to a better phase
characteristic. As before, I can't test this claim easily, though it may
be of interest.
Patch tested with FATE.
Sample benchmark (x86-64, Haswell, GNU/Linux):
test: swr-resample-dblp-44100-2626
new:
527376779 decicycles in build_filter(loop 1000), 256 runs, 0 skips
524361765 decicycles in build_filter(loop 1000), 512 runs, 0 skips
516552574 decicycles in build_filter(loop 1000), 1024 runs, 0 skips
old:
974178658 decicycles in build_filter(loop 1000), 256 runs, 0 skips
972794408 decicycles in build_filter(loop 1000), 512 runs, 0 skips
954350046 decicycles in build_filter(loop 1000), 1024 runs, 0 skips
Note that lower level optimizations are entirely possible, I focussed on
getting the high level semantics correct. In any case, this should
provide a good foundation.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
This adds const-correctness when needed for the comparators.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
It is well known that fabs and fabsf are at least as fast and sometimes
faster than the FFABS macro, at least on the gcc+glibc combination.
For instance, see the reference:
http://patchwork.sourceware.org/patch/6735/.
This was a patch to glibc in order to remove their usages of a macro.
The reason essentially boils down to fabs using the __builtin_fabs of
the compiler, while FFABS needs to infer to not use a branch and to
simply change the sign bit. Usually the inference works, but sometimes
it does not. This may be easily checked by looking at the asm.
This also has the added benefit of reducing macro usage, which has
problems with side-effects.
Note that avcodec is not handled here, as it is huge and
most things there are integer arithmetic anyway.
Tested with FATE.
Reviewed-by: Clément Bœsch <u@pkh.me>
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
This will trigger a few warnings that need to be fixed.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
Proper names should be capitalized in all user facing API as far as
possible. The option names themselves have not been changed since:
1. We consistently keep option names in lower case.
2. Changing them would break existing scripts.
3. I suspect that we want to be similar to Sox and its relevant options.
The converse is also true: improper names should not be capitalized
generally.
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This fixes a -Wabsolute-value reported by clang 3.5+ complaining about misuse of fabs() for integer absolute value.
An additional benefit is the removal of floating point calculations.
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Only two functions that use xop multiply-accumulate instructions where the
first operand is the same as the fourth actually took advantage of the macros.
This further reduces differences with x264's x86inc.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
or if no rematrix and no resampling is performed and the input is 16bit
note reampling and rematrix itself always use more than 16bit internally
the "internal" sampling format is the format between these steps
Its unlikely the difference from this commit is audible in any case
unless there is some bug either before or after the change.
but multiple people prefer this and it slightly improves the precission
of computations.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This avoids leaks if the user doest call swr_close() after a failed init
Found-by: James Almer <jamrial@gmail.com>
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Previous version reviewed-by: Pavel Koshevoy <pkoshevoy@gmail.com>
Previous version reviewed-by: wm4 <nfxjfg@googlemail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Fix crash when doing 8 ch conversion from apps compiled with MSVS
Thanks to Ronald for giving this hint:
https://ffmpeg.org/pipermail/ffmpeg-devel/2015-May/173049.html
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
There's no benefit from using blendps here except on CPUs with AVX, where
it's faster than shufps according to Intel's documentation.
As such, rename the sse4 functions to sse/sse2 and use shufps instead.
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
This avoids several issue like calculating sum/maxcoef
incorrectly due to adding up matrix entries that will
be overwritten, as well as out-of-range writes to
s->matrix if the maximum allowed number of channels is used.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
We only actually need to use a tiny part of it.
Unfortunately we seem to have no real test coverage on
the code, so this is a bit risky.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
Based on commit fb1ddcdc8f by Luca Barbato <lu_zero@gentoo.org>
Adapted for libswresample by Michael Niedermayer
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
The swresample_ prefix is not for internal functions
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Prototypes are not needed anymore now that the x86 functions don't
include resample_template.c
The DO_RESAMPLE_ONE macro is removed for that same reason as well.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Linear interpolation goes from 63 (llvm) or 58 (gcc) to 48 (yasm)
cycles/sample on 64bit, or from 66 (llvm/gcc) to 52 (yasm) cycles/
sample on 32bit. Bon-linear goes from 43 (llvm) or 38 (gcc) to
32 (yasm) cycles/sample on 64bit, or from 46 (llvm) or 44 (gcc) to
38 (yasm) cycles/sample on 32bit (all testing on OSX 10.9.2, llvm
5.1 and gcc 4.8/9).
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Should fix compilation failures with MSVC and any other compiler
without inline asm support.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
DSP bits of swri_resample go into their own mini-DSP functions; DSP
init goes from a per-call branch in multiple_resample to a proper
DSP init routine; x86 bits go into x86/; swri_resample() moves out of
resample_template.c into resample.c because it's independent of DSP
code or sample type; multiple_resample() is simplified.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
I don't see dst_incr/dst_incr_frac ever being changed from their
initial value (which is the inverse of this operation), so it seems
to me that this is a no-op.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
I think there's an off-by-one in terms of the switchpoint where we
switch from dst_incr to ideal_dst_incr, I don't think that's a massive
issue, but just be aware of that. It's probably trivial to prevent but
I don't care.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
I could not reproduce any off by 1 error, results are bit exact (michael)
This removes a branch at a performance-sensitive point (in the middle
of the loop). In fate-swr-resample-s32p-8000-2626, this makes the code
about 10% faster. It also simplifies the loops, allowing us to rewrite
it in yasm at some later point.
The compensation_distance != 0 code and index < 0 code are still kind
of hairy. For compensation_distance != 0, this should likely be handled
in the caller, so that it calls swri_resample twice (once until the
dst_incr switch-point, and once with the remainder of the samples). For
index < 0, the code should probably be rewritten to break out of the
loop once sample_index >= 0, and then resume (e.g. as a tail-call) to
the common or linear resampling loops.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This should avoid slight differences in the output causes by input
size alignment differences between archs
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Related to CID1197063
The limit choosen is arbitrary and much larger than what makes sense.
It avoids the need for checking arithmetic operations with the length for overflow
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
pshuf+paddd is slightly faster than phaddd.
The real gain is in pre-ssse3 processors like AMD K8 and K10, which get
a big boost in performance compared to the mmxext version
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
These are not supported by all compilers (gcc 2.95 but also older SPARC
compilers, see gcc bug #33304 for example), and there is no real need for them.
One use of this feature remains in libavdevice/v4l2.c which can't be
replaced quite as easily.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
The constant may change in libavutil but the library may be compiled
against an older version, thus rejecting a value which is otherwise
supported by the new libavutil.
INT_MAX is used here to denote the max allowed value for a sample format.
The opt-test code is changed to provide a valid reference example.
Originally written by James Almer <jamrial@gmail.com>
With the following contributions by Timothy Gu <timothygu99@gmail.com>
* Use descriptions of libraries from the pkg-config file generation function
* Use "FFmpeg Project" as CompanyName (suggested by Alexander Strasser)
* Use "FFmpeg" for ProductName as MSDN says "name of the product with which the
file is distributed" [1].
* Use FFmpeg's version (N-xxxxx-gxxxxxxx) for ProductVersion per MSDN [1].
* Only build the .rc files when --enable-small is not enabled.
[1] http://msdn.microsoft.com/en-us/library/windows/desktop/aa381058.aspx
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>