krak/zstd - zstd - Gitea: Git with a cup of tea

krak/zstd

mirror of https://github.com/facebook/zstd.git synced 2025-07-04 14:48:45 +02:00

Author	SHA1	Message	Date
W. Felix Handte	59c7b2a492	Reorder Definitions in zstd_lazy.c to Group Under Macro Guards	2023-05-22 12:37:03 -04:00
W. Felix Handte	50cdf84f58	Macro-Exclude Block Compressors from Declaration/Definition	2023-05-04 12:18:58 -04:00
Nick Terrell	a3c3a38b9b	[lazy] Skip over incompressible data Every 256 bytes the lazy match finders process without finding a match, they will increase their step size by 1. So for bytes [0, 256) they search every position, for bytes [256, 512) they search every other position, and so on. However, they currently still insert every position into their hash tables. This is different from fast & dfast, which only insert the positions they search. This PR changes that, so now after we've searched 2KB without finding any matches, at which point we'll only be searching one in 9 positions, we'll stop inserting every position, and only insert the positions we search. The exact cutoff of 2KB isn't terribly important, I've just selected a cutoff that is reasonably large, to minimize the impact on "normal" data. This PR only adds skipping to greedy, lazy, and lazy2, but does not touch btlazy2. \| Dataset \| Level \| Compiler \| CSize ∆ \| Speed ∆ \| \|---------\|-------\|--------------\|---------\|---------\| \| Random \| 5 \| clang-14.0.6 \| 0.0% \| +704% \| \| Random \| 5 \| gcc-12.2.0 \| 0.0% \| +670% \| \| Random \| 7 \| clang-14.0.6 \| 0.0% \| +679% \| \| Random \| 7 \| gcc-12.2.0 \| 0.0% \| +657% \| \| Random \| 12 \| clang-14.0.6 \| 0.0% \| +1355% \| \| Random \| 12 \| gcc-12.2.0 \| 0.0% \| +1331% \| \| Silesia \| 5 \| clang-14.0.6 \| +0.002% \| +0.35% \| \| Silesia \| 5 \| gcc-12.2.0 \| +0.002% \| +2.45% \| \| Silesia \| 7 \| clang-14.0.6 \| +0.001% \| -1.40% \| \| Silesia \| 7 \| gcc-12.2.0 \| +0.007% \| +0.13% \| \| Silesia \| 12 \| clang-14.0.6 \| +0.011% \| +22.70% \| \| Silesia \| 12 \| gcc-12.2.0 \| +0.011% \| -6.68% \| \| Enwik8 \| 5 \| clang-14.0.6 \| 0.0% \| -1.02% \| \| Enwik8 \| 5 \| gcc-12.2.0 \| 0.0% \| +0.34% \| \| Enwik8 \| 7 \| clang-14.0.6 \| 0.0% \| -1.22% \| \| Enwik8 \| 7 \| gcc-12.2.0 \| 0.0% \| -0.72% \| \| Enwik8 \| 12 \| clang-14.0.6 \| 0.0% \| +26.19% \| \| Enwik8 \| 12 \| gcc-12.2.0 \| 0.0% \| -5.70% \| The speed difference for clang at level 12 is real, but is probably caused by some sort of alignment or codegen issues. clang is significantly slower than gcc before this PR, but gets up to parity with it. I also measured the ratio difference for the HC match finder, and it looks basically the same as the row-based match finder. The speedup on random data looks similar. And performance is about neutral, without the big difference at level 12 for either clang or gcc.	2023-03-20 11:18:29 -07:00
Yonatan Komornik	91f4c23e63	Add salt into row hash (#3528 part 2) (#3533 ) Part 2 of #3528 Adds hash salt that helps to avoid regressions where consecutive compressions use the same tag space with similar data (running zstd -b5e7 enwik8 -B128K reproduces this regression).	2023-03-13 15:34:13 -07:00
Yonatan Komornik	a91e91d614	[Bugfix] row hash tries to match position 0 (#3548 ) #3543 decreases the size of the tagTable by a factor of 2, which requires using the first tag position in each row for head position instead of a tag. Although position 0 stopped being a valid match, it still persisted in mask calculation resulting in the matches loops possibly terminating before it should have. The fix skips position 0 to solve this problem.	2023-03-13 10:00:03 -07:00
Yonatan Komornik	33e39094e7	Reduce RowHash's tag space size by x2 (#3543 ) Allocate half the memory for tag space, which means that we get one less slot for an actual tag (needs to be used for next position index). The results is a slight loss in compression ratio (up to 0.2%) and some regressions/improvements to speed depending on level and sample. In turn, we get to save 16% of the hash table's space (5 bytes per entry instead of 6 bytes per entry).	2023-03-10 14:15:04 -08:00
Nick Terrell	666944fbe6	Cap hashLog & chainLog to ensure that we only use 32 bits of hash * Cap shortCache chainLog to 24 * Cap row match finder hashLog so that rowLog <= 24 * Add unit tests to expose all cases. The row match finder unit tests are only run in 64-bit mode, because they allocate ~1GB. Fixes #3336	2023-01-20 14:05:26 -08:00
W. Felix Handte	5d693cc38c	Coalesce Almost All Copyright Notices to Standard Phrasing ``` for f in $(find . $ -path ./.git -o -path ./tests/fuzz/corpora -o -path ./tests/regression/data-cache -o -path ./tests/regression/cache $ -prune -o -type f); do sed -i '/Copyright .* $Yann Collet$\\|$Meta Platforms$/ s/Copyright ./Copyright (c) Meta Platforms, Inc. and affiliates./' $f; done git checkout HEAD -- build/VS2010/libzstd-dll/libzstd-dll.rc build/VS2010/zstd/zstd.rc tests/test-license.py contrib/linux-kernel/test/include/linux/xxhash.h examples/streaming_compression_thread_pool.c lib/legacy/zstd_v0.c lib/legacy/zstd_v0*.h nano ./programs/windres/zstd.rc nano ./build/VS2010/zstd/zstd.rc nano ./build/VS2010/libzstd-dll/libzstd-dll.rc ```	2022-12-20 12:52:34 -05:00
W. Felix Handte	8927f985ff	Update Copyright Headers 'Facebook' -> 'Meta Platforms' ``` for f in $(find . $ -path ./.git -o -path ./tests/fuzz/corpora $ -prune -o -type f); do sed -i 's/Facebook, Inc\./Meta Platforms, Inc. and affiliates./' $f; done ```	2022-12-20 12:37:57 -05:00
Carl Woffenden	0547c3d3f8	Random edit to re-run the CI I don't believe the (x64) Mac failure is related to error since it would take the SSE path.	2022-11-19 19:04:08 +01:00
Carl Woffenden	0168914490	Fix for MSVC C4267 error	2022-11-18 11:31:17 +01:00
Nick Terrell	dcc7228de9	[lazy] Use switch instead of indirect function calls. (#3295 ) Use a switch statement to select the search function instead of an indirect function call. This results in a sizable performance win. This PR is a modification of the approach taken in PR #2828. When I measured performance for that commit, it was neutral. However, I now see a performance regression on gcc, but still neutral on clang. I'm measuring on the same platform, but with newer compilers. The new approach beats both the current dev branch and the baseline before PR #2828 was merged. This PR is necessary for Issue #3275, to update zstd in the kernel. Without this PR there is a large regression in greedy - btlazy2 compression speed. With this PR it is about neutral. gcc version: 12.2.0 clang version: 14.0.6 dataset: silesia.tar \| Compiler \| Level \| Dev Speed (MB/s) \| PR Speed (MB/s) \| Delta \| \|----------\|-------\|------------------\|-----------------\|--------\| \| gcc \| 5 \| 102.6 \| 113.7 \| +10.8% \| \| gcc \| 7 \| 66.6 \| 74.8 \| +12.3% \| \| gcc \| 9 \| 51.5 \| 58.9 \| +14.3% \| \| gcc \| 13 \| 14.3 \| 14.3 \| +0.0% \| \| clang \| 5 \| 108.1 \| 114.8 \| +6.2% \| \| clang \| 7 \| 68.5 \| 72.3 \| +5.5% \| \| clang \| 9 \| 53.2 \| 56.2 \| +5.6% \| \| clang \| 13 \| 14.3 \| 14.7 \| +2.8% \| The binary size stays just about the same for clang and gcc, measured using the `size` command: \| Compiler \| Branch \| Text \| Data \| BSS \| Total \| \|----------\|--------\|---------\|------\|-----\|---------\| \| gcc \| dev \| 1127950 \| 3312 \| 280 \| 1131542 \| \| gcc \| PR \| 1123422 \| 2512 \| 280 \| 1126214 \| \| clang \| dev \| 1046254 \| 3256 \| 216 \| 1049726 \| \| clang \| PR \| 1048198 \| 2296 \| 216 \| 1050710 \|	2022-10-21 17:14:02 -07:00
Jun He	ce52acd7dc	compress:check more bytes to reduce ZSTD_count call Comparing 4B instead of comparing 1B in ZSTD_noDict mode, thus it can avoid cases like match in match[ml] but mismatch in match[ml-3]..match[ml-1]. So the call count of ZSTD_count can be reduced. Signed-off-by: Jun He <jun.he@arm.com> Change-Id: I3449ea423d5c8e8344f75341f19a2d1643c703f6	2022-09-18 14:45:41 +08:00
Daniel Kutenin	05f3f415ce	Fix big endian ARM NEON path It is not using the NEON acceleration but the bit grouping was applied	2022-06-13 09:16:24 +01:00
Danila Kutenin	9166c6ae20	Again unused error warning. Fixed	2022-05-23 14:51:47 +00:00
Danila Kutenin	6b561d230f	Move NEON version to a separate function and fix indentation	2022-05-23 14:49:35 +00:00
Danila Kutenin	778f639be9	Disable unused variable warning	2022-05-22 10:50:33 +00:00
Danila Kutenin	e11783b04d	[lazy] Optimize ZSTD_row_getMatchMask for level 8-10 We found that movemask is not used properly or consumes too much CPU. This effort helps to optimize the movemask emulation on ARM. For level 8-9 we saw 3-5% improvements. For level 10 we say 1.5% improvement. The key idea is not to use pure movemasks but to have groups of bits. For rowEntries == 16, 32 we are going to have groups of size 4 and 2 respectively. It means that each bit will be duplicated within the group Then we do AND to have only one bit set in the group so that iteration with lowering bit `a &= (a - 1)` works as well. Also, aarch64 does not have rotate instructions for 16 bit, only for 32 and 64, that's why we see more improvements for level 8-9. vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by 4 every u16 and narrows to 8 lower bits. See the picture below. It's also used in [Folly](`c570259008/folly/container/detail/F14Table.h (L446)`). It also uses 2 cycles according to Neoverse-N{1,2} guidelines. 64 bit movemask is already well optimized. We have ongoing experiments but were not able to validate other implementations work reliably faster.	2022-05-22 10:44:24 +00:00
Elliot Gorokhovsky	3620a0a565	Nits	2022-05-12 12:53:15 -04:00
Elliot Gorokhovsky	97aabc496e	Correct and clarify repcode offset history logic	2022-05-09 21:01:38 -04:00
Dominique Pelle	b772f53952	Typo and grammar fixes	2022-03-12 08:58:04 +01:00
Elliot Gorokhovsky	529cd7b821	Fix nits	2022-02-14 14:24:50 -05:00
Elliot Gorokhovsky	db2f4a6532	Move bitwise builtins into bits.h	2022-02-14 11:16:03 -05:00
Yann Collet	03903f5701	fixed minor compression difference in btlazy2 subtle dependency on sumtype numeric representation	2021-12-29 18:51:03 -08:00
Yann Collet	7a18d709ae	updated all names to offBase convention	2021-12-29 17:30:43 -08:00
Yann Collet	8da414231d	found a few more places which were dependent on seqStore offcode sumtype numeric representation	2021-12-28 17:03:24 -08:00
Yann Collet	de9f52e945	regroup all mentions of ZSTD_REP_MOVE within zstd_compress_internal.h	2021-12-28 13:47:57 -08:00
Yann Collet	a34ccad9a6	fixed minor conversion warnings	2021-12-28 13:21:22 -08:00
Yann Collet	92a08eec72	abstracted storeSeq() sumtype numeric representation from zstd_lazy.c	2021-12-28 12:23:39 -08:00
Yann Collet	321583ccf5	fixed minor typecast warnings	2021-12-28 11:38:21 -08:00
Yann Collet	b7630a474b	abstracted usage of offBase sumtype within zstd_lazy.c	2021-12-28 10:59:47 -08:00
Yann Collet	1aed962216	introduce macros STORE_OFFSET() and STORE_REPCODE() this meant to abstract the sumtype representation required to transfert `offcode` to `ZSTD_storeSeq()`. Unfortunately, the sumtype numeric representation is currently a leaky abstraction that has permeated many other parts of the code, especially within `zstd_lazy.c` and also within `zstd_opt.c` and `zstd_compress.c`. While this PR makes a good job a transfering a large nb of call sites to using the new macros, there are still a few sites where this transformation is more complex, or where the numeric representation itself it used "as is". One of the problematics area is the decision to use the numeric format of the sumtype within the match finders of `zstd_lazy`. This commit doesn't change the behavior, it only introduces and employes the macros, but eventually the resulting code remains identical. At target, if the numeric representation of the sumtype can be completely abstracted and no other part of the code depends on it, it will be possible to move it towards something slightly more efficient.	2021-12-23 22:03:30 -08:00
Yann Collet	b77fcac61f	change ZSTD_storeSeq() interface to accept matchLength instead of mlBase. This removes the need to do `- MINMATCH` at every call site. The new interface contract is checked with an `assert()`.	2021-12-23 12:03:33 -08:00
Yann Collet	c8d6067615	fixed incorrect rowlog initialization the variable has only very limited usage, being only used once at the beginning of the block for prefetching only, hence the error had no impact on compression ratio.	2021-12-15 14:37:05 -08:00
Yann Collet	05430b25a8	roll SSE implementation of row_lazy match finder mostly for maintenance convenience. Performance wise, there is very little change, slightly faster for slog 3 & 4, neutral or very slightly negative for slot 5 & 6.	2021-12-14 10:44:23 -08:00
Nick Terrell	91f5891dd0	[CircleCI] Fix short-tests-0 short-tests-0 were silently failing. I think because of the && make clean construction. Switch to ; instead. Also fix all the test failures that were exposed. `make all` is failing on CircleCI because it is missing Docker. Move that test to GitHub actions, and switch the pedantic CircleCI test to `make allmost`.	2021-12-01 17:43:46 -08:00
Yann Collet	aba88fa996	Merge pull request #2829 from facebook/ZSTD_DECODER_INTERNAL_BUFFER minor : change build macro to ZSTD_DECODER_INTERNAL_BUFFER	2021-10-26 10:48:16 -07:00
Yann Collet	518f06b281	added minimum for decoder buffer also : introduced macro BOUNDED()	2021-10-26 08:21:31 -07:00
Nick Terrell	13cad3abb1	[lazy] Speed up compilation times Speed up compilation times by moving each specialized search function into its own function. This is faster because compilers can handle many smaller functions much faster than one gigantic function. The previous approach generated one giant function with `switch` statements and inlining to select the implementation. \| Compiler \| Flags \| Dev Time (s) \| PR Time (s) \| Delta \| \|----------\|-------------------------------------\|--------------\|-------------\|-------\| \| gcc \| -O3 \| 16.5 \| 5.6 \| -66% \| \| gcc \| -O3 -g -fsanitize=address,undefined \| 158.9 \| 38.2 \| -75% \| \| clang \| -O3 \| 36.5 \| 5.5 \| -85% \| \| clang \| -O3 -g -fsanitize=address,undefined \| 27.8 \| 17.5 \| -37% \| This also reduces the binary size because the search functions are no longer inlined into the main body. \| Compiler \| Dev libzstd.a Size (B) \| PR libzstd.a Size (B) \| Delta \| \|----------\|------------------------\|-----------------------\|-------\| \| gcc \| 1563868 \| 1308844 \| -16% \| \| clang \| 1924372 \| 1376020 \| -28% \| Finally, the performance is not impacted significantly by this change, in fact we generally see a small speed boost. \| Compiler \| Level \| Dev Speed (MB/s) \| PR Speed (MB/s) \| Delta \| \|----------\|-------\|------------------\|-----------------\|-------\| \| gcc \| 5 \| 110.6 \| 110.0 \| -0.5% \| \| gcc \| 7 \| 70.4 \| 72.2 \| +2.5% \| \| gcc \| 9 \| 53.2 \| 53.5 \| +0.5% \| \| gcc \| 13 \| 12.7 \| 12.9 \| +1.5% \| \| clang \| 5 \| 113.9 \| 110.4 \| -3.0% \| \| clang \| 7 \| 67.7 \| 70.6 \| +4.2% \| \| clang \| 9 \| 51.9 \| 52.2 \| +0.5% \| \| clang \| 13 \| 12.4 \| 13.3 \| +7.2% \| The compression strategy is unmodified in this PR, so the compressed size should be exactly the same. I may have a follow up PR to slightly improve the compression ratio, if it doesn't cost too much speed.	2021-10-22 13:45:26 -07:00
Yann Collet	9d62957b31	Merge pull request #2800 from animalize/fix_c89 Fix a C89 error in msvc	2021-10-18 14:32:04 -07:00
Nick Terrell	b77d95b053	Merge pull request #2820 from terrelln/nb-compares [binary-tree] Fix underflow of nbCompares	2021-10-11 09:59:57 -07:00
Nick Terrell	c6c482fe07	[binary-tree] Fix underflow of nbCompares Fix underflow of `nbCompares` by switching to an `int` and comparing `nbCompares > 0`. This is a minimal fix, because I don't want to change the logic. These loops seem to be doing `nbCompares + 1` comparisons. The bug was reported by Dan Carpenter and found by Smatch static checker. https://lore.kernel.org/all/20211008063704.GA5370@kili/	2021-10-08 13:22:55 -07:00
Nick Terrell	399644b1f1	[nit] Fix buggy indentation The bug was reported by Dan Carpenter and found by Smatch static checker. https://lore.kernel.org/all/20211008063704.GA5370@kili/	2021-10-08 11:13:11 -07:00
Sen Huang	4b7f45cb04	Pull hot loop into its own function	2021-09-28 08:19:44 -07:00
Sen Huang	ccdcbf4621	Try beginning and end of match	2021-09-28 08:19:44 -07:00
Sen Huang	b8fd6bf30c	Skip most long matches in lazy hash table update	2021-09-28 08:19:39 -07:00
Ma Lin	ae986fcdb8	Use __assume(0) for unreachable code path in msvc msvc will optimize away the condition check.	2021-09-27 19:23:57 +08:00
Ma Lin	e5ba858270	Don't initialize the first parameter of _BitScanForward* functions Like the document example, no need to initialize `r` to 0. https://docs.microsoft.com/en-us/cpp/intrinsics/bitscanforward-bitscanforward64	2021-09-25 16:36:53 +08:00
Ma Lin	cc22042da0	Fix a C89 error in msvc Variables (r) must be declared at the beginning of a code block. This causes msvc2012 to fail to compile 64-bit build.	2021-09-25 16:32:06 +08:00
Yann Collet	fa2a4d77c7	constify MatchState* parameter when possible turns out, it's possible to constify MatchState* parameter in some parts of the binary tree algorithm, making it a pure read-only parameter, as opposed to a mutable state. This is supposed to be helpful for both maintenance and the compiler.	2021-09-23 08:27:44 -07:00

1 2 3 4

184 Commits