This is a major rewrite of the exising nlmeans vulkan code, with bug
fixes and major performance improvement.
Fix visual artifacts found in ticket #10661, #10733. Add OOB checks for
image loading and patch sized area around the border. Correct chroma
plane height, strength and buffer barrier index.
Improve parallelism with component workgroup axis and more but smaller
workgroups. Split weights pass into vertical/horizontal (integral) and
weights passes. Remove h/v order logic to always calculate sum on
vertical pass. Remove atomic float requirement, which causes high memory
locking contentions, at the cost of higher memory usage of w/s buffer.
Use cache blocking in h pass to reduce memory bandwidth usage.
Write the moov tag at the end first, before overwriting the mdat size
at the start of the file.
In case writing the final moov box fails (e.g. due to being out
of disk), we haven't broken the initial moov box yet.
Thus if writing stops between these steps, we could end up with
a file with two moov boxes - which arguably is more feasible to
recover from, than from a file with no moov boxes at all.
The documentation states that this field is for enabling "extra" usage
flags. This conflicts with the implementation, and the rest of the comment,
though.
In resolving this ambiguity, I think it's better to lean towards the first
sentence and treat this field purely as specifying *extra* usage flags to
enable. Otherwise, this may break vulkan encoding or subsequent hwdownload
if the upstream filter did not specifically advertise this.
Change the default behavior and update the documentation slightly to more
clearly document the semantics.
This avoids having to fix up ABI violations via emms_c and
also leads to a 73% speedup for the line noise average version
here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
When using averaged noise with height > MAX_RES (i.e. 4096),
multiple threads would access the same prev_shift slot,
leading to races. Fix this by disabling slice threading
in such scenarios.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is not only UB, but also leads to races and nondeterministic
output, because the write one last the end of the buffer actually
conflicts with accesses by the thread that actually owns it.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
"all" only exists to set options; it does not need the big arrays
contained in FilterParams.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This reverts commit 301141b576.
cluster[0].dts, pts and frag_info[0].time are already in presentation
timeline, so they shouldn't be shift by start_pts.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
The diff and var functions benefit from psadbw, comb from wider
registers which allows to avoid reloading values, reducing the number
of loads from 48 to 10. Performance increased by 117% (the loop
in compute_metric() has been timed); codesize decreased by 144B.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This allows to remove an emms_c from the filter. It also gives
25% speedup here (when timing the calls to store_slice using
START/STOP_TIMER).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Improves performance and no longer breaks the ABI (by forgetting
to call emms).
Old benchmarks:
add_8x8basis_c: 43.6 ( 1.00x)
add_8x8basis_ssse3: 12.3 ( 3.55x)
New benchmarks:
add_8x8basis_c: 43.0 ( 1.00x)
add_8x8basis_ssse3: 6.3 ( 6.79x)
Notice that the output of try_8x8basis_ssse3 changes a bit:
Before this commit, it computes certain values and adds the values
for i,i+1,i+4 and i+5 before right shifting them; now it adds
the values for i,i+1,i+8,i+9. The second pair in these lists
could be avoided (by shifting xmm0 and xmm1 before adding both together
instead of only shifting xmm0 after adding them), but the former
i,i+1 is inherent in using pmaddwd. This is the reason that this
function is not bitexact.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The only requirement of this code (and essentially the pmulhrsw
instruction) is that the scaled scale fits into an int16_t.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This loosens the coupling between CBS and the decoder by no longer using
CodedBitstreamH266Context (containing the most recently parsed PSs & PH)
to retrieve the PSs & PH in the decoder. Doing so is beneficial in two
ways:
1. It improves robustness to the case in which an AVPacket doesn't
contain precisely one PU.
2. It allows the decoder parameter set manager to properly handle the
case in which a single PU (erroneously) contains conflicting
parameter sets.
Signed-off-by: Frank Plowman <post@frankplowman.com>
Check only on arches that need said check.
(Btw: I do not see how h_loop_filter benefits from alignment
at all and why h_loop_filter_unaligned exists.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The old code operated on bytes and did lots of tricks
due to their limited range; it did not completely succeed,
which is why the old versions were not used when bitexact
output was requested.
In contrast, the new version is much simpler: It operates
on signed 16 bit words whose range is more than sufficient.
This means that these functions don't need a check for bitexactness
(and can be used in FATE).
Old benchmarks (for this, the AV_CODEC_FLAG_BITEXACT check has been
removed from checkasm):
h_loop_filter_c: 29.8 ( 1.00x)
h_loop_filter_mmxext: 32.2 ( 0.93x)
h_loop_filter_unaligned_c: 29.9 ( 1.00x)
h_loop_filter_unaligned_mmxext: 31.4 ( 0.95x)
v_loop_filter_c: 39.3 ( 1.00x)
v_loop_filter_mmxext: 14.2 ( 2.78x)
v_loop_filter_unaligned_c: 38.9 ( 1.00x)
v_loop_filter_unaligned_mmxext: 14.3 ( 2.72x)
New benchmarks:
h_loop_filter_c: 29.2 ( 1.00x)
h_loop_filter_sse2: 28.6 ( 1.02x)
h_loop_filter_unaligned_c: 29.0 ( 1.00x)
h_loop_filter_unaligned_sse2: 26.9 ( 1.08x)
v_loop_filter_c: 38.3 ( 1.00x)
v_loop_filter_sse2: 11.0 ( 3.47x)
v_loop_filter_unaligned_c: 35.5 ( 1.00x)
v_loop_filter_unaligned_sse2: 11.2 ( 3.18x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Add "license" as a long-form command line option alongside the existing
"L" short option for showing license information. This maintains
consistent option naming patterns with other commands that provide both
short and long forms (help/?/help, etc.) and improves command line
usability by providing more descriptive option names.
This SSSE3 function uses MMX registers (of course without emms
at the end) and processes eight bytes of input by unpacking
it into two MMX registers. This is very suboptimal given
that one can just use XMM registers to process eight words.
This commit switches them to using XMM registers.
Old benchmarks:
avg_pixels_tab[1][3]_c: 114.5 ( 1.00x)
avg_pixels_tab[1][3]_ssse3: 43.6 ( 2.62x)
put_pixels_tab[1][3]_c: 83.6 ( 1.00x)
put_pixels_tab[1][3]_ssse3: 34.0 ( 2.46x)
New benchmarks:
avg_pixels_tab[1][3]_c: 115.3 ( 1.00x)
avg_pixels_tab[1][3]_ssse3: 24.6 ( 4.69x)
put_pixels_tab[1][3]_c: 83.8 ( 1.00x)
put_pixels_tab[1][3]_ssse3: 19.7 ( 4.24x)
Reviewed-by: Kieran Kunhya <kieran@kunhya.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Given that one has to deal with 16 byte intermediates it is
unsurprising that SSE2 wins against MMX; the MMX version has
therefore been removed (as well as the now unused inline_asm.h).
The new function is even 32B smaller than the old MMX one.
Old benchmarks:
put_no_rnd_pixels_tab[1][3]_c: 84.1 ( 1.00x)
put_no_rnd_pixels_tab[1][3]_mmx: 41.1 ( 2.05x)
New benchmarks:
put_no_rnd_pixels_tab[1][3]_c: 84.0 ( 1.00x)
put_no_rnd_pixels_tab[1][3]_ssse3: 22.1 ( 3.80x)
Reviewed-by: Kieran Kunhya <kieran@kunhya.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Also remove the now superseded MMX versions (the new functions have the
exact same codesize as the removed ones).
Old benchmarks:
avg_no_rnd_pixels_tab[0][3]_c: 233.7 ( 1.00x)
avg_no_rnd_pixels_tab[0][3]_mmx: 121.5 ( 1.92x)
put_no_rnd_pixels_tab[0][3]_c: 171.4 ( 1.00x)
put_no_rnd_pixels_tab[0][3]_mmx: 82.6 ( 2.08x)
New benchmarks:
avg_no_rnd_pixels_tab[0][3]_c: 233.3 ( 1.00x)
avg_no_rnd_pixels_tab[0][3]_sse2: 45.0 ( 5.18x)
put_no_rnd_pixels_tab[0][3]_c: 172.1 ( 1.00x)
put_no_rnd_pixels_tab[0][3]_sse2: 40.9 ( 4.21x)
Reviewed-by: Kieran Kunhya <kieran@kunhya.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Hint: The parts of this patch in decode_block_progressive()
and decode_block_refinement() rely on the fact that GET_VLC
returns -1 on error, so that it enters the codepaths for
actually coded block coefficients.
Reviewed-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The v_lowpass wrappers (which are instantiated by this macro)
are only used in the put (and not the avg) form for SSSE3
(the avg form is only used for mc02, which doesn't exist
for SSSE3). Clang warns about the unused functions.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
When we parse a MakerNote, we first try to parse it as an IFD and if
that fails, we try to re-parse it as a binary blob. This is because
MakerNote is not well-documented in its nature.
However, if we fail to parse it the first time, we should not av_log
error messages about the parse failure, so instead we log these as
AV_LOG_DEBUG.
Signed-off-by: Leo Izen <leo.izen@gmail.com>
Reported-by: Ramiro Polla <ramiro.polla@gmail.com>
This value is only useful when dtls handshake is NONBLOCK mode,
dtls handshake just need to call ffurl_handshake once since it
force block mode.
Signed-off-by: Jack Lau <jacklau1222@qq.com>
See RFC 5245 Section 4.3
If an agent is a lite implementation, it MUST include an "a=ice-lite"
session-level attribute in its SDP. If an agent is a full
implementation, it MUST NOT include this attribute.
Signed-off-by: Jack Lau <jacklau1222@qq.com>
The udp buffer size might be too small to easily
be full temporarily and return WSAEWOULDBLOCK.
The udp code will handle the windows error code
and convert it to AVERROR(EAGAIN).
This issue just can be reproduced on windows.
If sleep a interval and retry to send pkt when hit
EAGAIN, it will increase latency, and appropriate
interval is hard to define.
So this patch just remind user increase the buffer
size via -buffer_size to avoid this issue.
Signed-off-by: Jack Lau <jacklau1222@qq.com>