The sequence section starts with a number, which tells how sequences are present in the section.
If this number if 0, the section automatically ends.
The number 0 can be represented using the 1 byte or the 2 bytes formats.
That's because the 2-bytes formats fully overlaps the 1 byte format.
However, when 0 is represented using the 2-bytes format,
the decoder was expecting the sequence section to continue,
and was looking for FSE tables, which is incorrect.
Fixed this behavior, in both the reference decoder and the educational behavior.
In practice, this behavior never happens,
because the encoder will always select the 1-byte format to represent 0,
since this is more efficient.
Completed the fix with a new golden sample for tests,
a clarification of the specification,
and a decoder errata paragraph.
The split literals buffer patch increased streaming decompression memory
by 64KB (shrunk lit buffer from 128KB to 64KB, and added 128KB). This
patch removes the added 128KB buffer, because it isn't necessary.
The buffer was there because the literals compression code didn't know
the true `blockSizeMax` of the frame, and always put split literals so
they ended 128KB - 32 from the beginning of the block. Instead, we can
pass down the true `blockSizeMax` and ensure that the split literals
end up at `blockSizeMax - 32` from the beginning of the block. We
already reserve a full `blockSizeMax` bytes in streaming mode, so we
won't be overwriting the extDict window.
* add check for valid dest buffer and fuzz on random dest ptr when malloc 0
* add uptrval to linux-kernel
* remove bin files
* get rid of uptrval
* restrict max pointer value check to platforms where sizeof(size_t) == sizeof(void*)
Looking at the __builtin_expect in ZSTD_decodeSequence:
{ size_t offset;
#if defined(__clang__)
if (LIKELY(ofBits > 1)) {
#else
if (ofBits > 1) {
#endif
ZSTD_STATIC_ASSERT(ZSTD_lo_isLongOffset == 1);
From profile-annotated assembly, the probability of ofBits > 1 is about 75%
(101k counts out of 135k counts). This is much smaller than the recommended
likelihood to use __builtin_expect which is 99%. As a result, clang moved the
else block further away which hurts cache locality. Removing this
__built_expect along with two others in ZSTD_decodeSequence gave better
performance when PGO is enabled. I suggest to remove these branch hints and
rely on PGO which leverages runtime profiles from actual workload to calculate
branch probability instead.
* Mark all bufferless and block level functions as deprecated
* Update documentation to suggest not using these functions
* Add `_deprecated()` wrappers for functions that we use internally and
call those instead
In 32-bit mode, ZSTD_getOffsetInfo() can be called when nbSeq == 0, and
in this case the offset table is uninitialized. The function should just
return 0 for both values, because there are no sequences.
Credit to OSS-Fuzz
The 32-bit decoder could corrupt the regenerated data by using regular
offset mode when there were actually long offsets. This is because we
were only considering the window size in the calculation, not the
dictionary size. So a large dictionary could allow longer offsets.
Fix this in two ways:
1. Instead of looking at the window size, look at the total referencable
bytes in the history buffer. Use this in the comparison instead of
the window size. Additionally, we were comparing against the wrong
value, it was too low. Fix that by computing exactly the maximum
offset for regular sequence decoding.
2. If it is possible that we have long offsets due to (1), then check
the offset code decoding table, and if the decoding table's maximum
number of additional bits is no more than STREAM_ACCUMULATOR_MIN,
then we can't have long offsets.
This gates us to be using the long offsets decoder only when we are very
likely to actually have long offsets.
Note that this bug only affects the decoding of the data, and the
original compressed data, if re-read with a patched decoder, will
correctly regenerate the orginal data. Except that the encoder also had
the same issue previously.
This fixes both the open OSS-Fuzz issues.
Credit to OSS-Fuzz
The previous code had an issue when `bitsConsumed == 32` it would read 0
bits for the `ofBits` read, which violates the precondition of
`BIT_readBitsFast()`. This can happen when the stream is corrupted.
Fix thie issue by always reading the maximum possible number of extra
bits. I've measured neutral decoding performance, likely because this
branch is unlikely, but this should be faster anyways. And if not, it is
only 32-bit decoding, so performance isn't as critical.
Credit to OSS-Fuzz
Add generic C versions of the fast decoding loops to serve architectures
that don't have an assembly implementation. Also allow selecting the C
decoding loop over the assembly decoding loop through a zstd
decompression parameter `ZSTD_d_disableHuffmanAssembly`.
I benchmarked on my Intel i9-9900K and my Macbook Air with an M1 processor.
The benchmark command forces zstd to compress without any matches, using
only literals compression, and measures only Huffman decompression speed:
```
zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar
```
The new fast decoding loops outperform the previous implementation uniformly,
but don't beat the x86-64 assembly. Additionally, the fast C decoding loops suffer
from the same stability problems that we've seen in the past, where the assembly
version doesn't. So even though clang gets close to assembly on x86-64, it still
has stability issues.
| Arch | Function | Compiler | Default (MB/s) | Assembly (MB/s) | Fast (MB/s) |
|---------|----------------|--------------|----------------|-----------------|-------------|
| x86-64 | decompress 4X1 | gcc-12.2.0 | 1029.6 | 1308.1 | 1208.1 |
| x86-64 | decompress 4X1 | clang-14.0.6 | 1019.3 | 1305.6 | 1276.3 |
| x86-64 | decompress 4X2 | gcc-12.2.0 | 1348.5 | 1657.0 | 1374.1 |
| x86-64 | decompress 4X2 | clang-14.0.6 | 1027.6 | 1659.9 | 1468.1 |
| aarch64 | decompress 4X1 | clang-12.0.5 | 1081.0 | N/A | 1234.9 |
| aarch64 | decompress 4X2 | clang-12.0.5 | 1270.0 | N/A | 1516.6 |
Reported by @shulib :
the specification for 4-streams mode
doesn't work when the amount of literals to compress is 5 bytes.
Extending it, it also doesn't work for sizes 1 or 2.
This patch updates the specification and the implementation
to require a minimum of 6 literals to trigger or accept the 4-streams mode.
The impact is expected to be a no-op :
the 4-streams mode is never triggered for such small quantity of literals anyway,
since it would be wasteful (it costs ~7.3 bytes more than single-stream mode).
An informal lower limit is set at ~256 bytes,
so the technical minimum is very far from this limit.
This is just meant for completeness of the specification.
```
for f in $(find . \( -path ./.git -o -path ./tests/fuzz/corpora \) -prune -o -type f);
do
sed -i 's/Facebook, Inc\./Meta Platforms, Inc. and affiliates./' $f;
done
```
Fixes#3212.
Long literal and match lengths had an off-by-one error in ZSTD_getSequenceLength.
Fix the off-by-one error, and add a golden compression test that catches the bug.
Also run all the golden tests in the cli-tests framework.
With statistic data of test data files of silesia
the chance of position beyond highThreshold is very
low (~1.3%@L8 in most cases, all <2.5%), and is in
"lowprob area". Add the branch hint so compiler can
get better pipiline codegen.
With this change it is observed ~1% of mozilla and
xml, and slight (0.3%~0.8%) but consistent uplift on
other files on Arm N1.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: Id9ba1d5c767e975290b5c1bf0ecce906544f4ade
match is used for following sequence copy. It is
only updated when extDict is needed, which is a
low probability case. So it can be prefetched to
reduce cache miss.
The benchmarks on various Arm platforms showed
uplift from 1% ~ 14% with gcc-11/clang-14.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: If201af4799d2455d74c79f8387404439d7f684ae
* Intial commit to address 3090. Added support to decompress empty block
* Update zstd_decompress_block.c
Addressed review comments for the case of 'set_basic'
* Update lib/decompress/zstd_decompress_block.c
Co-authored-by: Nick Terrell <nickrterrell@gmail.com>
* Update lib/decompress/zstd_decompress_block.c
Co-authored-by: Nick Terrell <nickrterrell@gmail.com>
Co-authored-by: Nick Terrell <nickrterrell@gmail.com>
ZSTD_seqSymbol is a structure with total of 64 bits
wide. So it can be loaded in one operation and
extract its fields by simply shifting or extracting
on aarch64.
GCC doesn't recognize this and generates more
unnecessary ldr/ldrb/ldrh operations that cause
performance drop.
With this change it is observed 2~4% uplift of
silesia and 2.5~6% of cantrbry @L8 on Arm N1.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I7748909204cf78a17eb9d4f2333692d53239daa8
This saves some 1.7Kb in rodata section (x86_64, zstd tool),
while assembler code stays the same except
the type of a few load/extend instructions.
Should not have negative performance implications.
* When dynamic dispatching to bmi2 add lzcnt and bmi to the
TARGET_ATTRIBUTE.
* Centralize the bmi2 TARGET_ATTRIBUTE definition to
BMI2_TARGET_ATTRIBUTE so we can change it in the future.
* Only enable bmi2 when both bmi1 & bmi2 are supported. There shouldn't
be any cases where bmi2 is supported but bmi1 isn't. But, since we are
using the instruction we should check bmi1 as well.
Switch to a macro `ZSTD_FALLTHROUGH;` instead of a comment. On supported
compilers this uses an attribute, otherwise it becomes a comment.
This is necessary to be compatible with clang's `-Wfall-through`, and
gcc's `-Wfall-through=2` which don't support comments. Without this the
linux build emits a bunch of warnings.
Also add a test to CI to ensure that we don't regress.
the new alignment setting is better for gcc-9 and gcc-10
by about ~+5%.
Unfortunately, it's worse for essentially all other compilers.
Make the new alignment setting conditional to gcc-9+.
changed strategy,
now unconditionally prefetch the first 2 cache lines,
instead of cache lines corresponding to the first and last bytes of the match.
This better corresponds to cpu expectation,
which should auto-prefetch following cachelines on detecting the sequential nature of the read.
This is globally positive, by +5%,
though exact gains depend on compiler (from -2% to +15%).
The only negative counter-example is gcc-9.
pipeline increased from 4 to 8 slots.
This change substantially improves decompression speed when there are long distance offsets.
example with enwik9 compressed at level 22 :
gcc-9 : 947 -> 1039 MB/s
clang-10: 884 -> 946 MB/s
I also checked the "cold dictionary" scenario,
and found a smaller benefit, around ~2%
(measurements are more noisy for this scenario).
* Switch to yearless copyright per FB policy
* Fix up SPDX-License-Identifier lines in `contrib/linux-kernel` sources
* Add zstd copyright/license header to the `contrib/linux-kernel` sources
* Update the `tests/test-license.py` to check for yearless copyright
* Improvements to `tests/test-license.py`
* Check `contrib/linux-kernel` in `tests/test-license.py`
Following #2545,
I noticed that one field in `seq_t` is optional,
and only used in combination with prefetching.
(This may have contributed to static analyzer failure to detect correct initialization).
I then wondered if it would be possible to rewrite the code
so that this optional part is handled directly by the prefetching code
rather than delegated as an option into `ZSTD_decodeSequence()`.
This resulted into this refactoring exercise
where the prefetching responsibility is better isolated into its own function
and `ZSTD_decodeSequence()` is streamlined to contain strictly Sequence decoding operations.
Incidently, due to better code locality,
it reduces the need to send information around,
leading to simplified interface, and smaller state structures.