We introduced a ff_horiz_slice_avx2/512() implemented on a new algorithm.
In a nutshell, the new algorithm does three things, gathering data from
8/16 rows, blurring data, and scattering data back to the image buffer.
Here we used a customized transpose 8x8/16x16 to avoid the huge overhead
brought by gather and scatter instructions, which is dependent on the
temporary buffer called localbuf added newly.
Performance data:
ff_horiz_slice_avx2(old): 109.89
ff_horiz_slice_avx2(new): 666.67
ff_horiz_slice_avx512: 1000
Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
The new vertical slice with AVX2/512 acceleration can significantly
improve the performance of Gaussian Filter 2D.
Performance data:
ff_verti_slice_c: 32.57
ff_verti_slice_avx2: 476.19
ff_verti_slice_avx512: 833.33
Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
Up until now, an AVFilter's lists of input and output AVFilterPads
were terminated by a sentinel and the only way to get the length
of these lists was by using avfilter_pad_count(). This has two
drawbacks: first, sizeof(AVFilterPad) is not negligible
(i.e. 64B on 64bit systems); second, getting the size involves
a function call instead of just reading the data.
This commit therefore changes this. The sentinels are removed and new
private fields nb_inputs and nb_outputs are added to AVFilter that
contain the number of elements of the respective AVFilterPad array.
Given that AVFilter.(in|out)puts are the only arrays of zero-terminated
AVFilterPads an API user has access to (AVFilterContext.(in|out)put_pads
are not zero-terminated and they already have a size field) the argument
to avfilter_pad_count() is always one of these lists, so it just has to
find the filter the list belongs to and read said number. This is slower
than before, but a replacement function that just reads the internal numbers
that users are expected to switch to will be added soon; and furthermore,
avfilter_pad_count() is probably never called in hot loops anyway.
This saves about 49KiB from the binary; notice that these sentinels are
not in .bss despite being zeroed: they are in .data.rel.ro due to the
non-sentinels.
Reviewed-by: Nicolas George <george@nsup.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The current way of doing it involves writing the ctx parameter twice.
Reviewed-by: Nicolas George <george@nsup.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Several combinations of functions happen quite often in query_format
functions; e.g. ff_set_common_formats(ctx, ff_make_format_list(sample_fmts))
is very common. This commit therefore adds functions that are equivalent
to commonly used function combinations in order to reduce code
duplication.
Reviewed-by: Nicolas George <george@nsup.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is possible now that the next-API is gone.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Signed-off-by: James Almer <jamrial@gmail.com>
Removed by mistake in 2b4da1cb8c2984b37e5c912e103a1b8b734e7c1f where it
should have been replaced instead.
Signed-off-by: James Almer <jamrial@gmail.com>
x86_32 ABI does not pass float arguments directly on xmm regs, and the Win64
ABI uses only the first four regs for this purpose.
Signed-off-by: James Almer <jamrial@gmail.com>
The horizontal pass get ~2x performance with the patch
under single thread.
Tested overall performance using the command(avx2 enabled):
./ffmpeg -i 1080p.mp4 -vf gblur -f null /dev/null
./ffmpeg -i 1080p.mp4 -vf gblur=threads=1 -f null /dev/null
For single thread, the fps improves from 43 to 60, about 40%.
For multi-thread, the fps improves from 110 to 130, about 20%.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Instead of doing each column one by one, doing several columns
together gives about 30% better performance.
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Ruiling Song <ruiling.song@intel.com>