Every 256 bytes the lazy match finders process without finding a match,
they will increase their step size by 1. So for bytes [0, 256) they search
every position, for bytes [256, 512) they search every other position,
and so on. However, they currently still insert every position into
their hash tables. This is different from fast & dfast, which only
insert the positions they search.
This PR changes that, so now after we've searched 2KB without finding
any matches, at which point we'll only be searching one in 9 positions,
we'll stop inserting every position, and only insert the positions we
search. The exact cutoff of 2KB isn't terribly important, I've just
selected a cutoff that is reasonably large, to minimize the impact on
"normal" data.
This PR only adds skipping to greedy, lazy, and lazy2, but does not
touch btlazy2.
| Dataset | Level | Compiler | CSize ∆ | Speed ∆ |
|---------|-------|--------------|---------|---------|
| Random | 5 | clang-14.0.6 | 0.0% | +704% |
| Random | 5 | gcc-12.2.0 | 0.0% | +670% |
| Random | 7 | clang-14.0.6 | 0.0% | +679% |
| Random | 7 | gcc-12.2.0 | 0.0% | +657% |
| Random | 12 | clang-14.0.6 | 0.0% | +1355% |
| Random | 12 | gcc-12.2.0 | 0.0% | +1331% |
| Silesia | 5 | clang-14.0.6 | +0.002% | +0.35% |
| Silesia | 5 | gcc-12.2.0 | +0.002% | +2.45% |
| Silesia | 7 | clang-14.0.6 | +0.001% | -1.40% |
| Silesia | 7 | gcc-12.2.0 | +0.007% | +0.13% |
| Silesia | 12 | clang-14.0.6 | +0.011% | +22.70% |
| Silesia | 12 | gcc-12.2.0 | +0.011% | -6.68% |
| Enwik8 | 5 | clang-14.0.6 | 0.0% | -1.02% |
| Enwik8 | 5 | gcc-12.2.0 | 0.0% | +0.34% |
| Enwik8 | 7 | clang-14.0.6 | 0.0% | -1.22% |
| Enwik8 | 7 | gcc-12.2.0 | 0.0% | -0.72% |
| Enwik8 | 12 | clang-14.0.6 | 0.0% | +26.19% |
| Enwik8 | 12 | gcc-12.2.0 | 0.0% | -5.70% |
The speed difference for clang at level 12 is real, but is probably
caused by some sort of alignment or codegen issues. clang is
significantly slower than gcc before this PR, but gets up to parity with
it.
I also measured the ratio difference for the HC match finder, and it
looks basically the same as the row-based match finder. The speedup on
random data looks similar. And performance is about neutral, without the
big difference at level 12 for either clang or gcc.
* Mark all bufferless and block level functions as deprecated
* Update documentation to suggest not using these functions
* Add `_deprecated()` wrappers for functions that we use internally and
call those instead
* patch-from speed optimization: only load portion of dictionary into normal matchfinders
* test regression for x8 multiplier
* fix off-by-one error for bit shift bound
* restrict patchfrom speed optimization to strategy < ZSTD_btultra
* update results.csv
* update regression test
Part 2 of #3528
Adds hash salt that helps to avoid regressions where consecutive compressions use the same tag space with similar data (running zstd -b5e7 enwik8 -B128K reproduces this regression).
- Adds memory type that is guaranteed to have been initialized at least once in the workspace's lifetime.
- Changes tag space in row hash to be based on init once memory.
#3543 decreases the size of the tagTable by a factor of 2, which requires using the first tag position in each row for head position instead of a tag.
Although position 0 stopped being a valid match, it still persisted in mask calculation resulting in the matches loops possibly terminating before it should have. The fix skips position 0 to solve this problem.
Allocate half the memory for tag space, which means that we get one less slot for an actual tag (needs to be used for next position index).
The results is a slight loss in compression ratio (up to 0.2%) and some regressions/improvements to speed depending on level and sample. In turn, we get to save 16% of the hash table's space (5 bytes per entry instead of 6 bytes per entry).
* Add ZSTD_setFParams() and ZSTD_setParams()
* Modify ZSTD_setCParams() to use ZSTD_setParameter() to avoid a second path setting parameters
* Add unit tests
* Update documentation to suggest using them to replace deprecated functions
Fixes#3396.
- Initializes clevel in `ZSTD_CCtxParams_init`
- Adds CI workflow for msan fuzzers runs without optimization (`-O0`)
- Fixes Makefile to correctly pass on user defined `MOREFLAGS` and `FUZZER_FLAGS` in cases they have been overwritten
The block splitter confuses sequences with literal length == 65536 that use a
repeat offset code. It interprets this as literal length == 0 when deciding the
meaning of the repeat offset, and corrupts the repeat offset history. This is
benign, merely causing suboptimal compression performance, if the confused
history is flushed before the end of the block, e.g. if there are 3 consecutive
non-repeat code sequences after the mistake. It also is only triggered if the
block splitter decided to split the block.
All that to say: This is a rare bug, and requires quite a few conditions to
trigger. However, the good news is that if you have a way to validate that the
decompressed data is correct, e.g. you've enabled zstd's checksum or have a
checksum elsewhere, the original data is very likely recoverable. So if you were
affected by this bug please reach out.
The fix is to remind the block splitter that the literal length is actually 64K.
The test case is a bit tricky to set up, but I've managed to reproduce the issue.
Thanks to @danlark1 for alerting us to the issue and providing us a reproducer!
* Fixes zstd-dll build (https://github.com/facebook/zstd/issues/3492):
- Adds pool.o and threading.o dependency to the zstd-dll target
- Moves custom allocation functions into header to avoid needing to add dependency on common.o
- Adds test target for zstd-dll
- Adds github workflow that buildis zstd-dll
* fix and test MSVC AVX2 build
* treat msbuild warnings as errors
* fix incorrect MSVC 2019 compiler warning
* fix MSVC error D9035: option 'Gm' has been deprecated and will be removed in a future release
In 32-bit mode, ZSTD_getOffsetInfo() can be called when nbSeq == 0, and
in this case the offset table is uninitialized. The function should just
return 0 for both values, because there are no sequences.
Credit to OSS-Fuzz
The 32-bit decoder could corrupt the regenerated data by using regular
offset mode when there were actually long offsets. This is because we
were only considering the window size in the calculation, not the
dictionary size. So a large dictionary could allow longer offsets.
Fix this in two ways:
1. Instead of looking at the window size, look at the total referencable
bytes in the history buffer. Use this in the comparison instead of
the window size. Additionally, we were comparing against the wrong
value, it was too low. Fix that by computing exactly the maximum
offset for regular sequence decoding.
2. If it is possible that we have long offsets due to (1), then check
the offset code decoding table, and if the decoding table's maximum
number of additional bits is no more than STREAM_ACCUMULATOR_MIN,
then we can't have long offsets.
This gates us to be using the long offsets decoder only when we are very
likely to actually have long offsets.
Note that this bug only affects the decoding of the data, and the
original compressed data, if re-read with a patched decoder, will
correctly regenerate the orginal data. Except that the encoder also had
the same issue previously.
This fixes both the open OSS-Fuzz issues.
Credit to OSS-Fuzz
The previous code had an issue when `bitsConsumed == 32` it would read 0
bits for the `ofBits` read, which violates the precondition of
`BIT_readBitsFast()`. This can happen when the stream is corrupted.
Fix thie issue by always reading the maximum possible number of extra
bits. I've measured neutral decoding performance, likely because this
branch is unlikely, but this should be faster anyways. And if not, it is
only 32-bit decoding, so performance isn't as critical.
Credit to OSS-Fuzz
Delete all unused FSE functions, now that we are no longer syncing
to/from upstream.
This avoids confusion about Zstd's stack usage like in Issue #3453.
It also removes dead code, which is always a plus.
The input bounds checks were buggy because they were only breaking from
the inner loop, not the outer loop. The fuzzers found this immediately.
The fix is to use `goto _out` instead of `break`.
This condition can happen on corrupted inputs.
I've benchmarked before and after on x86-64 and there were small changes
in performance, some positive, and some negative, and they end up about
balacing out.
Credit to OSS-Fuzz
Add generic C versions of the fast decoding loops to serve architectures
that don't have an assembly implementation. Also allow selecting the C
decoding loop over the assembly decoding loop through a zstd
decompression parameter `ZSTD_d_disableHuffmanAssembly`.
I benchmarked on my Intel i9-9900K and my Macbook Air with an M1 processor.
The benchmark command forces zstd to compress without any matches, using
only literals compression, and measures only Huffman decompression speed:
```
zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar
```
The new fast decoding loops outperform the previous implementation uniformly,
but don't beat the x86-64 assembly. Additionally, the fast C decoding loops suffer
from the same stability problems that we've seen in the past, where the assembly
version doesn't. So even though clang gets close to assembly on x86-64, it still
has stability issues.
| Arch | Function | Compiler | Default (MB/s) | Assembly (MB/s) | Fast (MB/s) |
|---------|----------------|--------------|----------------|-----------------|-------------|
| x86-64 | decompress 4X1 | gcc-12.2.0 | 1029.6 | 1308.1 | 1208.1 |
| x86-64 | decompress 4X1 | clang-14.0.6 | 1019.3 | 1305.6 | 1276.3 |
| x86-64 | decompress 4X2 | gcc-12.2.0 | 1348.5 | 1657.0 | 1374.1 |
| x86-64 | decompress 4X2 | clang-14.0.6 | 1027.6 | 1659.9 | 1468.1 |
| aarch64 | decompress 4X1 | clang-12.0.5 | 1081.0 | N/A | 1234.9 |
| aarch64 | decompress 4X2 | clang-12.0.5 | 1270.0 | N/A | 1516.6 |