This is a bug in the streaming implementation of the v0.5 decoder.
The bug has always been there.
It requires an uncommon block configuration, which wasn't tested at the time.
v0.5 is deprecated now,
latest version to produce such format is v0.5.1 from February 2016.
It was superceded in April 2016.
So it's both short lived and very old.
Another PR will remove support of this format,
but it will still be possible to explicitely request this support on demand,
so better fix the issue.
Summary:
Completes the transition to disabled legacy support by default across all build systems. This follows up on the previous Makefile and CMake changes to ensure consistent default behavior regardless of the build system used.
Updated build configurations: Meson, tests/Makefile, Visual Studio 2008/2010 projects, and BUCK.
Test Plan:
Verified changes compile correctly via `make lib-release`. Build system configurations have been updated consistently across all platforms.
Revert a branch optimization that was based on an incorrect
assumption in the AArch64 part of ZSTD_decodeSequence. In extreme
cases the existing implementation could lead to data corruption.
Insert an UNLIKELY hint to guide the compilers toward generating more
efficient machine code.
When pthread_mutex_init() or pthread_cond_init() fails in the debug
implementation (DEBUGLEVEL >= 1), the previously allocated memory was
not freed, causing a memory leak.
This fix ensures that allocated memory is properly freed when pthread
initialization functions fail, preventing resource leaks in error
conditions.
The issue affects:
- ZSTD_pthread_mutex_init() at lib/common/threading.c:146
- ZSTD_pthread_cond_init() at lib/common/threading.c:167
This is particularly important for long-running applications or
scenarios with resource constraints where pthread initialization
might fail due to system limits.
Fixes the build on OpenBSD and NetBSD. It is too easy for _GNU_SOURCE
to be defined even on non-Linux systems. Found via py-zstandard with
the embedded copy of zstandard and Python defines _GNU_SOURCE.
Also simplify the Linux checking, there is no need to check the rest
of the symbol names.
Add a 4-way Neon implementation for the convertSequences_noRepcodes
function. Remove 'static' keywords from all of its implementations to
be able to add unit tests.
Relative performance to Clang-18 using: `./fullbench -b18 -l5 enwik5`
Neoverse-V2 before after
Clang-18: 100.000% 311.703%
Clang-19: 100.191% 311.714%
Clang-20: 100.181% 311.723%
GCC-13: 107.520% 252.309%
GCC-14: 107.652% 253.158%
GCC-15: 107.674% 253.168%
Cortex-A720 before after
Clang-18: 100.000% 204.512%
Clang-19: 102.825% 204.600%
Clang-20: 102.807% 204.558%
GCC-13: 110.668% 203.594%
GCC-14: 110.684% 203.978%
GCC-15: 102.864% 204.299%
Co-authored by, Thomas Daubney <Thomas.Daubney@arm.com>
Add a faster scalar implementation of ZSTD_get1BlockSummary which
removes the data dependency of the accumulators in the hot loop to
leverage the superscalar potential of recent out-of-order CPUs.
The new algorithm leverages SWAR (SIMD Within A Register) methodology
to exploit the capabilities of 64-bit architectures. It achieves this
by packing two 32-bit data elements into a single 64-bit register,
enabling parallel operations on these subcomponents while ensuring
that the 32-bit boundaries prevent overflow, thereby optimizing
computational efficiency.
Corresponding unit tests are included.
Relative performance to GCC-13 using: `./fullbench -b19 -l5 enwik5`
Neoverse-V2 before after
GCC-13: 100.000% 290.527%
GCC-14: 100.000% 291.714%
GCC-15: 99.914% 291.495%
Clang-18: 148.072% 264.524%
Clang-19: 148.075% 264.512%
Clang-20: 148.062% 264.490%
Cortex-A720 before after
GCC-13: 100.000% 235.261%
GCC-14: 101.064% 234.903%
GCC-15: 112.977% 218.547%
Clang-18: 127.135% 180.359%
Clang-19: 127.149% 180.297%
Clang-20: 127.154% 180.260%
Co-authored by, Thomas Daubney <Thomas.Daubney@arm.com>
ZSTDMT_freeCCtx calls ZSTDMT_releaseAllJobResources, but ZSTDMT_releaseAllJobResources may be called when ZSTDMT_freeCCtx is called when initialization fails, resulting in a NULL pointer dereference.
LLVM's alias-analysis sometimes fails to see that a static-array member
of a struct cannot alias other members. This patch:
- Reduces array accesses via struct indirection to aid load/store alias
analysis under Clang.
- Converts dynamic array indexing into conditional-move arithmetic,
eliminating branches and extra loads/stores on out-of-order CPUs.
- Reloads the bitstream only when match-length bits are consumed
(assuming each reload only needs to happen once per match-length
read), improving branch-prediction rates.
- Removes the UNLIKELY() hint, which recent compilers already handle
well without cost.
Decompression uplifts on a Neoverse V2 system, using Zstd-1.5.8
compiled with "-O3 -march=armv8.2-a+sve2":
Clang-19 Clang-20 Clang-* GCC-14 GCC-15
1#silesia.tar: +11.556% +16.203% +0.240% +2.216% +7.891%
2#silesia.tar: +15.493% +21.140% -0.041% +2.850% +9.926%
3#silesia.tar: +16.887% +22.570% -0.183% +3.056% +10.660%
4#silesia.tar: +17.785% +23.315% -0.262% +3.343% +11.187%
5#silesia.tar: +18.125% +24.175% -0.466% +3.350% +11.228%
6#silesia.tar: +17.607% +23.339% -0.591% +3.175% +10.851%
7#silesia.tar: +17.463% +22.837% -0.486% +3.292% +10.868%
* Requires Clang-21 support from LLVM commit hash
`a53003fe23cb6c871e72d70ff2d3a075a7490da2`
(Clang-21 hasn’t been released as of this writing)
Co-authored by:
David Sherwood, David.Sherwood@arm.com
Ola Liljedahl, Ola.Liljedahl@arm.com
In the multi-stream multi-symbol Huffman decoder GCC generates
suboptimal code - emitting more loads for HUF_DEltX2 struct member
accesses. Forcing it to use 32-bit loads and bit arithmetic to extract
the necessary parts (UBFX) improves the overall decode speed.
Also avoid integer type conversions in the symbol decodes, which
leads to better instruction selection in table lookup accesses.
On AArch64 the decoder no longer runs into register-pressure limits,
so we can simplify the hot path and improve throughput
Decompression uplifts on a Neoverse V2 system, using Zstd-1.5.8
compiled with "-O3 -march=armv8.2-a+sve2":
Clang-20 Clang-* GCC-13 GCC-14 GCC-15
1#silesia.tar: +0.820% +1.365% +2.480% +1.348% +0.987%
2#silesia.tar: +0.426% +0.784% +1.218% +0.665% +0.554%
3#silesia.tar: +0.112% +0.389% +0.508% +0.188% +0.261%
* Requires Clang-21 support from LLVM commit hash
`a53003fe23cb6c871e72d70ff2d3a075a7490da2`
(Clang-21 hasn’t been released as of this writing)
The existing scalar implementation uses a 4-way pipelined histogram
calculation which is very efficient on out-of-order CPUs. However,
this can be further accelerated using the SVE2 HISTSEG instructions -
which compute a histogram for 16 byte chunks in a vector register.
On a system with 128-bit vectors (VL128) we need 16 HISTSEG executions
to compute the histogram for the whole symbol space (0..255) of 16
bytes input. However we can only accumulate 15 of such 16 byte strips
before possible overflow. So we need to extend and save the 8-bit
histogram accumulators to 16-bit after every 240 byte chunks of input.
To store all in registers we would need 32 128-bit registers. Longer
SVE2 vectors could help here, if such machines become available.
The maximum input block size in Zstd is 128 KiB, so 16-bit accumulators
would not be enough. However an LZ pass will prepend the histogram
calculation, so it is impossible (my assumption) to overflow the 16-bit
accumulators.
The symbol distribution is also not uniform, the lower values are more
common, so we used a 3 pass algorithm to prevent stack spilling. In the
first pass we only compute histograms for 64 symbols (4-way SIMD) while
also computing the maximum symbol value. If we have symbol values
larger than 64 we start the second pass to compute the next 96 elements
of the histogram. The final pass calculates the remaining part of the
histogram (256 symbols in total) if needed. This split of histogram
generation gave the best overall results for performance.
This implementation is the best performing of a number of different
cache blocking schemes tested.
Compression uplifts on a Neoverse V2 system, using Zstd-1.5.8
(e26dde3d) as a baseline, compiled with "-O3 -march=armv8.2-a+sve2":
Clang-20 GCC-14
1#silesia.tar: +6.173% +5.987%
2#silesia.tar: +5.200% +5.011%
3#silesia.tar: +4.332% +5.031%
4#silesia.tar: +2.789% +3.064%
5#silesia.tar: +2.028% +1.838%
6#silesia.tar: +1.562% +1.340%
7#silesia.tar: +1.160% +0.959%
After the update to MacOS 15.4, the dynamic loader dyld treats duplicated LC_RPATH as an error.
The `FLAGS` variable already contains `LDFLAGS`, thus using both `FLAGS` and `LDFLAGS`
duplicates all `LDFLAGS`, including `-Wl,rpath` parameters.
The duplicate LC_RPATH causes this kind of errors:
```
dyld[29361]: Library not loaded: @loader_path/../lib/libzstd.1.dylib
Referenced from: <7131C877-3CF0-33AC-AA05-257BA4FDD770> /Users/foobar/...
Reason: tried: '/Users/foobar/..../lib/libzstd.1.dylib' (duplicate LC_RPATH '/usr/mypath.../lib')
```
Closes https://github.com/facebook/zstd/issues/4369
Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com>
otherwise will cause dev-python/zstandard build failed when compiling with
clang as reported at https://bugs.gentoo.org/950259
the root cause is pycparser, which is unfixed since reported 2.5 years
ago, :(
Signed-off-by: Z. Liu <zhixu.liu@gmail.com>