Up until now, an AVFilter's lists of input and output AVFilterPads
were terminated by a sentinel and the only way to get the length
of these lists was by using avfilter_pad_count(). This has two
drawbacks: first, sizeof(AVFilterPad) is not negligible
(i.e. 64B on 64bit systems); second, getting the size involves
a function call instead of just reading the data.
This commit therefore changes this. The sentinels are removed and new
private fields nb_inputs and nb_outputs are added to AVFilter that
contain the number of elements of the respective AVFilterPad array.
Given that AVFilter.(in|out)puts are the only arrays of zero-terminated
AVFilterPads an API user has access to (AVFilterContext.(in|out)put_pads
are not zero-terminated and they already have a size field) the argument
to avfilter_pad_count() is always one of these lists, so it just has to
find the filter the list belongs to and read said number. This is slower
than before, but a replacement function that just reads the internal numbers
that users are expected to switch to will be added soon; and furthermore,
avfilter_pad_count() is probably never called in hot loops anyway.
This saves about 49KiB from the binary; notice that these sentinels are
not in .bss despite being zeroed: they are in .data.rel.ro due to the
non-sentinels.
Reviewed-by: Nicolas George <george@nsup.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The current way of doing it involves writing the ctx parameter twice.
Reviewed-by: Nicolas George <george@nsup.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Several combinations of functions happen quite often in query_format
functions; e.g. ff_set_common_formats(ctx, ff_make_format_list(sample_fmts))
is very common. This commit therefore adds functions that are equivalent
to commonly used function combinations in order to reduce code
duplication.
Reviewed-by: Nicolas George <george@nsup.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is possible now that the next-API is gone.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Signed-off-by: James Almer <jamrial@gmail.com>
Remove the pdiff_lut_scale in nlmeans and increase weight_lut table size
from 2^9 to 500000, this change will avoid using pdiff_lut_scale in
nlmeans_slice() for weight_lut table search, improving the performance
by about 12%. (in 1080P size picture case).
Use the profiling command like:
perf stat -a -d -r 5 ./ffmpeg -i input -an -vf nlmeans=s=30 -vframes 10 \
-f null /dev/null
without this change:
when s=1.0(default value) 63s
s=30.0 72s
after this change:
s=1.0(default value) 56s
s=30.0 63s
Reviewed-by: Carl Eugen Hoyos <ceffmpeg@gmail.com>
Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Signed-off-by: Clément Bœsch <u@pkh.me>
This helps figuring out where the filter is slow:
70.53% ffmpeg_g ffmpeg_g [.] nlmeans_slice
25.73% ffmpeg_g ffmpeg_g [.] compute_safe_ssd_integral_image_c
1.74% ffmpeg_g ffmpeg_g [.] compute_unsafe_ssd_integral_image
0.82% ffmpeg_g ffmpeg_g [.] ff_mjpeg_decode_sos
0.51% ffmpeg_g [unknown] [k] 0xffffffff91800a80
0.24% ffmpeg_g ffmpeg_g [.] weight_averages
(Tested with a large image that takes several seconds to process)
Since this function is irrelevant speed wise, the file's TODO is
updated.
before: ssd_integral_image_c: 49204.6
after: ssd_integral_image_c: 44272.8
Unrolling by 4 made the biggest difference on odroid-c2 (aarch64);
unrolling by 2 or 8 both raised 46k cycles vs 44k for 4.
Additionally, this is a much better reference when writing SIMD (SIMD
vectorization will just target 16 instead of 4).
SIMD code will not have to deal with padding itself. Overwriting in that
function may have been possible but involve large overreading of the
sources. Instead, we simply make sure the width to process is always a
multiple of 16. Additionally, there must be some actual area to process
so the SIMD code can have its boundary checks after processing the first
pixels.