changes since v1:
- made into fate test
- fixed c90 warnings
- tests more intermediate formats
- tested on BE mips too
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Regression since: 3adffab073
-1 is consistent what other error paths return
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
We return 0 for this particular architecture but should instead be
returning the number of lines.
Fixes users who check the return value matches what they expect.
256 bits is just wide enough to fit all the operands needed to vectorize
the software implementation, but AVX2 is needed to for a couple of
instructions like cross-lane permutation.
Output is bit-for-bit identical to C.
Signed-off-by: Nelson Gomez <nelson.gomez@microsoft.com>
Extracting information from SwsContext in assembly is difficult, and
rearranging SwsContext just for asm access didn't look good. These
functions only need a couple of fields from it anyway, so just make
them parameters in their own right.
Signed-off-by: Nelson Gomez <nelson.gomez@microsoft.com>
The NEON hscale function only supports X8 filter sizes and should only
be selected when these are being used. At the moment filterAlign is
set to 8 but in the future when extra NEON assembly for specific sizes is
added they will need to have checks here too.
The immediate usecase for this change is making the hscale checkasm
test easier and without NEON specific edge-cases (x86 already has these
guards).
This applies the same fix from 718c8f9aa5
on the 32 bit arm version of the function, fixing fate-checkasm-sw_scale
there.
Signed-off-by: Martin Storsjö <martin@martin.st>
The NEON hscale function only supports X8 filter sizes and should only
be selected when these are being used. At the moment filterAlign is
set to 8 but in the future when extra NEON assembly for specific sizes is
added they will need to have checks here too.
The immediate usecase for this change is making the hscale checkasm
test easier and without NEON specific edge-cases (x86 already has these
guards).
Signed-off-by: Josh de Kock <josh@itanimul.li>
libswscale/vscale.c makes extensive use of function pointers and in
doing so it converts these function pointers to and from a pointer to
void. Yet this is actually against the C standard:
C90 only guarantees that one can convert a pointer to any incomplete
type or object type to void* and back with the result comparing equal
to the original which makes pointers to void generic pointers to
incomplete or object type. Yet C90 lacks a generic function pointer
type.
C99 additionally guarantees that a pointer to a function of one type may
be converted to a pointer to a function of another type with the result
and the original comparing equal when converting back.
This makes any function pointer type a generic function pointer type.
Yet even this does not make pointers to void generic function pointers.
Both GCC and Clang emit warnings for this when in pedantic mode.
This commit fixes this by using a union that can hold one member of any
of the required function pointer types to store the function pointer.
This works even for C90.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
The x18 is a reserved platform register on Darwin and Windows.
x8/w8 seems to be unused in this function though (and same about
x10 and x14), so there's really no reason to use x18 here - just change
the uses of x18/w18 into x8/w8 instead without any further rewrites.
Signed-off-by: Martin Storsjö <martin@martin.st>
Fixes: signed integer overflow: 1169365504 + 981452800 cannot be represented in type 'int'
Fixes: ticket8293
Found-by: Suhwan
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: signed integer overflow: 524280 * 4432 cannot be represented in type 'int'
Fixes: ticket8322
Found-by: Suhwan
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Add swscale input support for Y210LE, output support and fate
test could be added later if there is requirement for software
CSC to this packed format.
Signed-off-by: Linjie Fu <linjie.fu@intel.com>
Tested using this command:
/ffmpeg -pix_fmt yuv420p -s 1920*1080 -i ArashRawYuv420.yuv \
-vcodec rawvideo -s 1920*1080 -pix_fmt rgb24 -f null /dev/null
The fps increase from 389 to 640 on Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Signed-off-by: Ting Fu <ting.fu@intel.com>
Bug #8255 points out a double free error in libwscale/utils.c file.
The double free is because the pointer to cascaded_context of an
sw_context is not set to NULL after freeing it. When the sw_context
is later freed, sws_freeContext is called on the cascaded_context,
causing a double free.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
The original inline assembly and nasm code have the same fps when called by command.
NASM code almost has no impact on the perfromance.
Signed-off-by: Ting Fu <ting.fu@intel.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This patch rewrites the innermost loop of ff_yuv2planeX_8_neon to avoid zips and
horizontal adds by using fused multiply adds. The patch also uses ld1r to load
one element and replicate it across all lanes of the vector. The patch also
improves the clipping code by removing the shift right instructions and
performing the shift with the shift-right narrow instructions.
I see 8% difference on an m6g instance with neoverse-n1 CPUs:
$ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
before: t:0.014015 avg:0.014096 max:0.015018 min:0.013971
after: t:0.012985 avg:0.013013 max:0.013996 min:0.012818
Tested with `make check` on aarch64-linux.
Signed-off-by: Sebastian Pop <spop@amazon.com>
Reviewed-by: Clément Bœsch <u@pkh.me>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate
and bumps the vectorization factor from 2 to 4.
The speedup is of 25% on Graviton1 A1 instances based on A-72 cpus:
$ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214
after: t:0.032168 avg:0.032215 max:0.033081 min:0.032146
The speedup is of 39% on Graviton2 m6g instances based on Neoverse-N1 cpus:
$ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
before: t:0.019446 avg:0.019423 max:0.019493 min:0.019181
after: t:0.014015 avg:0.014096 max:0.015018 min:0.013971
Tested with `make check` on aarch64-linux.
Signed-off-by: Sebastian Pop <spop@amazon.com>
Reviewed-by: Jean-Baptiste Kempf <jb@videolan.org>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
The argument to vec_splat_u16 must be a literal. By making the
function always inline and marking the arguments const, gcc can
turn those into literals, and avoid build errors like:
swscale_vsx.c:165:53: error: argument 1 must be a 5-bit signed literal
Fixes#7861.
Signed-off-by: Daniel Kolesa <daniel@octaforge.org>
Signed-off-by: Lauri Kasanen <cand@gmx.com>
While this technically compiles in current ffmpeg, this is only
because ffmpeg is compiled in strict ISO C mode, which disables
the builtin 'vector' keyword for AltiVec/VSX. Instead this gets
replaced with a macro inside altivec.h, which defines vector to
be actually __vector, which accepts random types.
Normally, the vector keyword should be used only with plain
scalar non-typedef types, such as unsigned int. But we have the
vec_(s|u)(8|16|32) macros, which can be used in a portable manner,
in util_altivec.h in libavutil.
This is also consistent with other AltiVec/VSX code elsewhere in
the tree.
Fixes#7861.
Signed-off-by: Daniel Kolesa <daniel@octaforge.org>
Signed-off-by: Lauri Kasanen <cand@gmx.com>
Affected the FATE-tests vsynth_lena-dv-411, vsynth1-dv-411,
vsynth2-dv-411 and hevc-paramchange-yuv420p.yuv420p10.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This affected many FATE-tests: The number of failing tests went down
from 663 to 344. (Both numbers exclude tests that failed because of
unaligned accesses in code that is inside #if HAVE_FAST_UNALIGNED.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
gcc 6.x and 7.x generate wrong code for little endian machines
for the vec_lvsl/vec_perm instruction combos in some cases.
The bug was fixed in version 8.x
If these instructions are replaced with vec_xl, the problem goes
away for all versions of the compilers.
Fixes ticket #7124.
In libswcale/tests/swcale.c, the function fileTest() calls sscanf in
an argument of "%12s" on character srcStr[] and dstStr[], which are
only 12 bytes. So, if the input string is 12 characters, a
terminating null byte can be written past the end of these arrays.
This bug was found by cppcheck.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
The implementation is pretty straight-forward. Most of the existing
NV12 codepaths work regardless of subsampling and are re-used as is.
Where necessary I wrote the slightly different NV24 versions.
Finally, the one thing that confused me for a long time was the
asm specific x86 path that did an explicit exclusion check for NV12.
I replaced that with a semi-planar check and also updated the
equivalent PPC code, which Lauri kindly checked.
./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags fast_bilinear \
-s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \
-cpuflags 0 -v error -
32-bit mul, power8 only.
~2x speedup:
rgb24
24431 UNITS in yuv2packed2, 16384 runs, 0 skips
13783 UNITS in yuv2packed2, 16383 runs, 1 skips
bgr24
24396 UNITS in yuv2packed2, 16384 runs, 0 skips
14059 UNITS in yuv2packed2, 16384 runs, 0 skips
rgba
26815 UNITS in yuv2packed2, 16383 runs, 1 skips
12797 UNITS in yuv2packed2, 16383 runs, 1 skips
bgra
27060 UNITS in yuv2packed2, 16384 runs, 0 skips
13138 UNITS in yuv2packed2, 16384 runs, 0 skips
argb
26998 UNITS in yuv2packed2, 16384 runs, 0 skips
12728 UNITS in yuv2packed2, 16381 runs, 3 skips
bgra
26651 UNITS in yuv2packed2, 16384 runs, 0 skips
13124 UNITS in yuv2packed2, 16384 runs, 0 skips
This is a low speedup, but the x86 mmx version also gets only ~2x. The mmx version
is also heavily inaccurate, while the vsx version has high accuracy.
./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags fast_bilinear \
-s 1200x1440 -f null -vframes 100 -pix_fmt $i -nostats \
-cpuflags 0 -v error -
32-bit mul, power8 only.
1.8-2.3x speedup:
rgb24
18192 UNITS in yuv2packed1, 32767 runs, 1 skips
9983 UNITS in yuv2packed1, 32760 runs, 8 skips
bgr24
18665 UNITS in yuv2packed1, 32766 runs, 2 skips
9925 UNITS in yuv2packed1, 32763 runs, 5 skips
rgba
20239 UNITS in yuv2packed1, 32767 runs, 1 skips
8794 UNITS in yuv2packed1, 32759 runs, 9 skips
bgra
20354 UNITS in yuv2packed1, 32768 runs, 0 skips
8770 UNITS in yuv2packed1, 32761 runs, 7 skips
argb
20185 UNITS in yuv2packed1, 32768 runs, 0 skips
8761 UNITS in yuv2packed1, 32761 runs, 7 skips
bgra
20360 UNITS in yuv2packed1, 32766 runs, 2 skips
8759 UNITS in yuv2packed1, 32764 runs, 4 skips
This is a low speedup, but the x86 mmx version also gets only ~2x. The mmx version
is also heavily inaccurate, while the vsx version has high accuracy.
Signed-off-by: Dong, Jerry <jerry.dong@intel.com>
Signed-off-by: Decai Lin <decai.lin@intel.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \
-s 1200x1440 -f null -vframes 100 -pix_fmt $i -nostats \
-cpuflags 0 -v error -
This uses 32-bit mul, so POWER8 only.
The following output formats get about 4.5x speedup:
rgb24
39980 UNITS in yuv2packed1, 32768 runs, 0 skips
8774 UNITS in yuv2packed1, 32768 runs, 0 skips
bgr24
40069 UNITS in yuv2packed1, 32768 runs, 0 skips
8772 UNITS in yuv2packed1, 32766 runs, 2 skips
rgba
39759 UNITS in yuv2packed1, 32768 runs, 0 skips
8681 UNITS in yuv2packed1, 32767 runs, 1 skips
bgra
39729 UNITS in yuv2packed1, 32768 runs, 0 skips
8696 UNITS in yuv2packed1, 32766 runs, 2 skips
argb
39766 UNITS in yuv2packed1, 32768 runs, 0 skips
8672 UNITS in yuv2packed1, 32766 runs, 2 skips
bgra
39784 UNITS in yuv2packed1, 32768 runs, 0 skips
8659 UNITS in yuv2packed1, 32767 runs, 1 skips
./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p16be \
-s 1920x1728 -f null -vframes 100 -v error -nostats -
9-14 bit funcs get about 6x speedup, 16-bit gets about 15x.
Fate passes, each format tested with an image to video conversion.
Only POWER8 includes 32-bit vector multiplies, so POWER7 is locked out
of the 16-bit function. This includes the vec_mulo/mule functions too,
not just vmuluwm.
With TIMER_REPORT skips disabled:
yuv420p9le
12412 UNITS in planarX, 131072 runs, 0 skips
73136 UNITS in planarX, 131072 runs, 0 skips
yuv420p9be
12481 UNITS in planarX, 131072 runs, 0 skips
73410 UNITS in planarX, 131072 runs, 0 skips
yuv420p10le
12322 UNITS in planarX, 131072 runs, 0 skips
72546 UNITS in planarX, 131072 runs, 0 skips
yuv420p10be
12291 UNITS in planarX, 131072 runs, 0 skips
72935 UNITS in planarX, 131072 runs, 0 skips
yuv420p12le
12316 UNITS in planarX, 131072 runs, 0 skips
72708 UNITS in planarX, 131072 runs, 0 skips
yuv420p12be
12319 UNITS in planarX, 131072 runs, 0 skips
72577 UNITS in planarX, 131072 runs, 0 skips
yuv420p14le
12259 UNITS in planarX, 131072 runs, 0 skips
72516 UNITS in planarX, 131072 runs, 0 skips
yuv420p14be
12440 UNITS in planarX, 131072 runs, 0 skips
72962 UNITS in planarX, 131072 runs, 0 skips
yuv420p16le
10548 UNITS in planarX, 131072 runs, 0 skips
73429 UNITS in planarX, 131072 runs, 0 skips
yuv420p16be
10634 UNITS in planarX, 131072 runs, 0 skips
150959 UNITS in planarX, 131072 runs, 0 skips
Signed-off-by: Lauri Kasanen <cand@gmx.com>
This function wouldn't benefit from VSX instructions, so I put it
under altivec.
./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt grayf32le \
-f null -vframes 100 -v error -nostats -
3743 UNITS in planar1, 65495 runs, 41 skips
-cpuflags 0
23511 UNITS in planar1, 65530 runs, 6 skips
grayf32be
4647 UNITS in planar1, 65449 runs, 87 skips
-cpuflags 0
28608 UNITS in planar1, 65530 runs, 6 skips
The native speedup is 6.28133, and the bswapping one 6.15623.
Fate passes, each format tested with an image to video conversion.
Signed-off-by: Lauri Kasanen <cand@gmx.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Passes fate on LE (with "lavc/jrevdct: Avoid an aliasing violation" applied).
Signed-off-by: Lauri Kasanen <cand@gmx.com>
Tested-by: Michael Kostylev on BE
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \
-f null -vframes 100 -v error -nostats -
1158 UNITS in planar1, 65528 runs, 8 skips
-cpuflags 0
19082 UNITS in planar1, 65533 runs, 3 skips
16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version
takes as many cycles as the x86 SSE2 version, yikes it's fast.
Note that this function uses VSX instructions, but is not marked so.
This is because several existing functions also make that mistake.
I'll submit a patch moving them once this is reviewed.
Signed-off-by: Lauri Kasanen <cand@gmx.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>