The previous code had an issue when `bitsConsumed == 32` it would read 0
bits for the `ofBits` read, which violates the precondition of
`BIT_readBitsFast()`. This can happen when the stream is corrupted.
Fix thie issue by always reading the maximum possible number of extra
bits. I've measured neutral decoding performance, likely because this
branch is unlikely, but this should be faster anyways. And if not, it is
only 32-bit decoding, so performance isn't as critical.
Credit to OSS-Fuzz
Delete all unused FSE functions, now that we are no longer syncing
to/from upstream.
This avoids confusion about Zstd's stack usage like in Issue #3453.
It also removes dead code, which is always a plus.
The input bounds checks were buggy because they were only breaking from
the inner loop, not the outer loop. The fuzzers found this immediately.
The fix is to use `goto _out` instead of `break`.
This condition can happen on corrupted inputs.
I've benchmarked before and after on x86-64 and there were small changes
in performance, some positive, and some negative, and they end up about
balacing out.
Credit to OSS-Fuzz
Add generic C versions of the fast decoding loops to serve architectures
that don't have an assembly implementation. Also allow selecting the C
decoding loop over the assembly decoding loop through a zstd
decompression parameter `ZSTD_d_disableHuffmanAssembly`.
I benchmarked on my Intel i9-9900K and my Macbook Air with an M1 processor.
The benchmark command forces zstd to compress without any matches, using
only literals compression, and measures only Huffman decompression speed:
```
zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar
```
The new fast decoding loops outperform the previous implementation uniformly,
but don't beat the x86-64 assembly. Additionally, the fast C decoding loops suffer
from the same stability problems that we've seen in the past, where the assembly
version doesn't. So even though clang gets close to assembly on x86-64, it still
has stability issues.
| Arch | Function | Compiler | Default (MB/s) | Assembly (MB/s) | Fast (MB/s) |
|---------|----------------|--------------|----------------|-----------------|-------------|
| x86-64 | decompress 4X1 | gcc-12.2.0 | 1029.6 | 1308.1 | 1208.1 |
| x86-64 | decompress 4X1 | clang-14.0.6 | 1019.3 | 1305.6 | 1276.3 |
| x86-64 | decompress 4X2 | gcc-12.2.0 | 1348.5 | 1657.0 | 1374.1 |
| x86-64 | decompress 4X2 | clang-14.0.6 | 1027.6 | 1659.9 | 1468.1 |
| aarch64 | decompress 4X1 | clang-12.0.5 | 1081.0 | N/A | 1234.9 |
| aarch64 | decompress 4X2 | clang-12.0.5 | 1270.0 | N/A | 1516.6 |
Fix the following documentation bugs:
* Note that `ZSTD_estimate*` functions are not compatible with the external matchfinder API
* Note that `ZSTD_estimateCStreamSize_usingCCtxParams()` is not compatible with `nbWorkers >= 1`
* Remove incorrect warning that the legacy streaming API is incompatible with advanced parameters and/or dictionary compression
* Note that `ZSTD_initCStream()` is incompatible with dictionary compression
* Warn that
* Cap shortCache chainLog to 24
* Cap row match finder hashLog so that rowLog <= 24
* Add unit tests to expose all cases. The row match finder unit tests
are only run in 64-bit mode, because they allocate ~1GB.
Fixes#3336
* deprecate advanced streaming functions
* remove internal usage of the deprecated functions
* nit
* suppress warnings in tests/zstreamtest.c
* purge ZSTD_initDStream_usingDict
* nits
* c90 compat
* zstreamtest.c already disables deprecation warnings!
* fix initDStream() return value
* fix typo
* wasn't able to import private symbol properly, this commit works around that
* new strategy for zbuff
* undo zbuff deprecation warning changes
* move ZSTD_DISABLE_DEPRECATE_WARNINGS from .h to .c
* Add a function and macro ZSTD_decompressionMargin() that computes the
decompression margin for in-place decompression. The function computes
a tight margin that works in all cases, and the macro computes an upper
bound that will only work if flush isn't used.
* When doing in-place decompression, make sure that our output buffer
doesn't overlap with the input buffer. This ensures that we don't
decide to use the portion of the output buffer that overlaps the input
buffer for temporary memory, like for literals.
* Add a simple unit test.
* Add in-place decompression to the simple_round_trip and
stream_round_trip fuzzers. This should help verify that our margin stays
correct.
A minor change in 5434de0 changed a `<=` into a `<`,
and as an indirect consequence allowed compression attempt of literals when there are only 6 literals to compress
(previous limit was effectively 7 literals).
This is not in itself a problem, as the threshold is merely an heuristic,
but it emerged a bug that has always been there, and was just never triggered so far due to the previous limit.
This bug would make the literal compressor believes that all literals are the same symbol,
but for the exact case where nbLiterals==6, plus a pretty wild combination of other limit conditions,
this outcome could be false, resulting in data corruption.
Replaced the blind heuristic by an actual test for all limit cases,
so that even if the threshold is changed again in the future,
the detection of RLE mode will remain reliable.
fix#3328
In situations where the alphabet size is very small,
the evaluation of literal costs from the Optimal Parser is initially incorrect.
It takes some time to converge, during which compression is less efficient.
This is especially important for small files,
because there will not be enough data to converge,
so most of the parsing is selected based on incorrect metrics.
After this patch, the scenario ##3328 gets fixed,
delivering the expected 29 bytes compressed size (smallest known compressed size).