1
0
mirror of https://github.com/facebook/zstd.git synced 2025-07-05 15:09:03 +02:00
Commit Graph

50 Commits

Author SHA1 Message Date
0932de54bc [dictBuilder] Fix deadlock in *COVER error case
The COVER and FASTCOVER dictionary builders can deadlock when
dictionary construction errors, likely because there are too few
samples, or too few distinct dmers. The deadlock only occurs when
there are errors.

Fixes #1746.
2019-08-26 18:19:29 -07:00
c55d2e7ba3 Adding shrinking flag for cover and fastcover (#1656)
* Changed ERROR(GENERIC) excluding inits

* editing git ignore

* Edited init functions to size_t returns

* moved declarations earlier

* resolved issues with changes to init functions

* fixed style and an error check

* attempting to add tests that might trigger changes

* added && die to cases expecting to fail

* resolved no die on expected failed command

* fixed accel to be incorrect value

* Adding an automated shrinking option

* Fixing build

* finalizing fixes

* fix?

* Removing added comment in cover.h

* Styling fixes

* Merging with fb dev

* removing megic number for default regression

* Requested revisions

* fixing support for fast cover

* fixing casting errors

* parenthesis fix

* fixing some build nits

* resolving travis ci syntax

* might resolve all compilation issues

* removed unused variable

* remodeling the selectDict function

* fixing bad memory access

* fixing error checks

* fixed erroring check in selectDict

* fixing mixed declarations

* modify mixed declaration

* fixing nits and adding test cases

* Adding requested changes + fixed bug for error checking

* switched double comparison from != to <

* fixed declaration typing

* refactoring COVER_best_finish() and changing shrinkDict

* removing the const's

* modifying ZDICT_optimizeTrainFromBuffer_cover functions

* fixing potential bad memcpy

* fixing the error function for dict size
2019-06-27 16:26:57 -07:00
cb47871a0a [dictBuilder] Be more specific than ERROR(generic) (#1616)
* Specify errors at a finer granularity than `ERROR(generic)`.
* Add tests for bad parameters in the dictionary builder.
2019-05-22 18:57:50 -07:00
a880ca239b Spelling (#1582)
* spelling: accidentally

* spelling: across

* spelling: additionally

* spelling: addresses

* spelling: appropriate

* spelling: assumed

* spelling: available

* spelling: builder

* spelling: capacity

* spelling: compiler

* spelling: compressibility

* spelling: compressor

* spelling: compression

* spelling: contract

* spelling: convenience

* spelling: decompress

* spelling: description

* spelling: deflate

* spelling: deterministically

* spelling: dictionary

* spelling: display

* spelling: eliminate

* spelling: preemptively

* spelling: exclude

* spelling: failure

* spelling: independence

* spelling: independent

* spelling: intentionally

* spelling: matching

* spelling: maximum

* spelling: meaning

* spelling: mishandled

* spelling: memory

* spelling: occasionally

* spelling: occurrence

* spelling: official

* spelling: offsets

* spelling: original

* spelling: output

* spelling: overflow

* spelling: overridden

* spelling: parameter

* spelling: performance

* spelling: probability

* spelling: receives

* spelling: redundant

* spelling: recompression

* spelling: resources

* spelling: sanity

* spelling: segment

* spelling: series

* spelling: specified

* spelling: specify

* spelling: subtracted

* spelling: successful

* spelling: return

* spelling: translation

* spelling: update

* spelling: unrelated

* spelling: useless

* spelling: variables

* spelling: variety

* spelling: verbatim

* spelling: verification

* spelling: visited

* spelling: warming

* spelling: workers

* spelling: with
2019-04-12 11:18:11 -07:00
e649fad7aa [dictBuilder] Fix displayLevel for corpus warning
Pass the displaylevel into the corpus warning, because it is used in
fast cover and cover, so it needs to respect the local level.
2019-04-08 20:00:18 -07:00
d0f5ba36fb [cover] Improvements for small or homogeneous data
* The algorithm would bail as soon as it found one epoch that
  contained no new segments. Change it so it now has to fail
  >= 10 times in a row (10 for fastcover, 10-100 for cover).
* The algorithm uses the `maxDict` size to decide the epoch size.
  When this size is absurdly large, it causes tiny epochs. Lower
  bound the epoch size at 10x the segment size, and warn the user
  that their training set is too small.

Fixes #1554
2019-03-22 14:14:46 -07:00
ededcfca57 fix confusion between unsigned <-> U32
as suggested in #1441.

generally U32 and unsigned are the same thing,
except when they are not ...

case : 32-bit compilation for MIPS (uint32_t == unsigned long)

A vast majority of transformation consists in transforming U32 into unsigned.
In rare cases, it's the other way around (typically for internal code, such as seeds).

Among a few issues this patches solves :
- some parameters were declared with type `unsigned` in *.h,
  but with type `U32` in their implementation *.c .
- some parameters have type unsigned*,
  but the caller user a pointer to U32 instead.

These fixes are useful.

However, the bulk of changes is about %u formating,
which requires unsigned type,
but generally receives U32 values instead,
often just for brevity (U32 is shorter than unsigned).
These changes are generally minor, or even annoying.

As a consequence, the amount of code changed is larger than I would expect for such a patch.

Testing is also a pain :
it requires manually modifying `mem.h`,
in order to lie about `U32`
and force it to be an `unsigned long` typically.
On a 64-bit system, this will break the equivalence unsigned == U32.
Unfortunately, it will also break a few static_assert(), controlling structure sizes.
So it also requires modifying `debug.h` to make `static_assert()` a noop.
And then reverting these changes.

So it's inconvenient, and as a consequence,
this property is currently not checked during CI tests.
Therefore, these problems can emerge again in the future.

I wonder if it is worth ensuring proper distinction of U32 != unsigned in CI tests.
It's another restriction for coding, adding more frustration during merge tests,
since most platforms don't need this distinction (hence contributor will not see it),
and while this can matter in theory, the number of platforms impacted seems minimal.

Thoughts ?
2018-12-21 18:09:41 -08:00
5c5c476338 Prevent deadlock on malloc() failure. 2018-11-08 10:29:31 +01:00
16db0337b1 Always use splitPoint=1.0 for non-optimize cover and fastcover 2018-08-30 14:59:22 -07:00
9d6ed9def3 Merge fastCover into DictBuilder (#1274)
* Minor fix

* Run non-optimize FASTCOVER 5 times in benchmark

* Merge fastCover into dictBuilder

* Fix mixed declaration issue

* Add fastcover to symbol.c

* Add fastCover.c and cover.h to build

* Change fastCover.c to fastcover.c

* Update benchmark to run FASTCOVER in dictBuilder

* Undo spliting fastcover_param into cover_param and f

* Remove convert param functions

* Assign f to parameter

* Add zdict.h to Makefile in lib

* Add cover.h to BUCK

* Cast 1 to U64 before shifting

* Remove trimming of zero freq head and tail in selectSegment and rebenchmark

* Remove f as a separate parameter of tryParam

* Read 8 bytes when d is 6

* Add trimming off zero frequency head and tail

* Use best functions from COVER and remove trimming part(which leads to worse compression ratio after previous bugs were fixed)

* Add finalize= argument to FASTCOVER to specify percentage of training samples passed to ZDICT_finalizeDictionary

* Change nbDmer to always read 8 bytes even when d=6

* Add skip=# argument to allow skipping dmers in computeFrequency in FASTCOVER

* Update comments and benchmarking result

* Change default method of ZDICT_trainFromBuffer to ZDICT_optimizeTrainFromBuffer_fastCover

* Add dictType enum and fix bug about passing zParam when converting to coverParam

* Combine finalize and skip into a single parameter

* Update acceleration parameters and benchmark on 3 sample sets

* Change default splitPoint of FASTCOVER to 0.75 and benchmark first 3 sample sets

* Initialize variables outside of for loop in benchmark.c

* Update benchmark result for hg-manifest

* Remove cover.h from install-includes

* Add explanation of f

* Set default compression level for trainFromBuffer to 3

* Add assertion of fastCoverParams in DiB_trainFromFiles

* Add checkTotalCompressedSize function + some minor fixes

* Add test for multithreading fastCovr

* Initialize segmentFreqs in every FASTCOVER_selectSegment and move mutex_unnlock to end of COVER_best_finish

* Free segmentFreqs

* Initialize segmentFreqs before calling FASTCOVER_buildDictionary instead of in FASTCOVER_selectSegment

* Add FASTCOVER_MEMMULT

* Minor fix

* Update benchmarking result
2018-08-23 12:06:20 -07:00
5021441d86 Change default splitPoint to 100 2018-07-10 11:19:33 -07:00
456f290e31 Change back to splitPoint<=0 2018-07-09 13:53:25 -07:00
7efabb2cf6 Only make 0.0 default splitPoint 2018-07-09 12:26:53 -07:00
015a00af0f Change cover_sum back to 2 parameters and fix splitPoint issues 2018-07-06 14:24:18 -07:00
0bbff01211 Fix testing parameter 2018-07-05 22:40:32 -07:00
a085d1aae1 Allow splitPoint==1.0 (using all samples for both training and testing) 2018-07-05 10:38:45 -07:00
0881184c89 Some edits based on pull request comments 2018-07-03 17:53:27 -07:00
16e75e8804 Update minimal training sample size 2018-07-03 12:07:06 -07:00
348e5f77a9 Add split=# to cli 2018-06-29 17:54:41 -07:00
52fbbbcb6b Explicitly cast double to unsigned 2018-06-29 16:17:20 -07:00
f9d19b83fb Fix variable declaration problem 2018-06-29 15:46:56 -07:00
e061d84016 Another fix to comparator 2018-06-29 15:38:08 -07:00
59797d3328 Fix splitPoint floating point comparison problem 2018-06-29 12:47:03 -07:00
0ef06f2e8a Split samples into train and test sets 2018-06-29 12:33:34 -07:00
7cbb8bbbbf [cover] Small compression ratio improvement
The cover algorithm selects one segment per epoch, and it selects the epoch
size such that `epochs * segmentSize ~= dictSize`. Selecting less epochs
gives the algorithm more candidates to choose from for each segment it
selects, and then it will loop back to the first epoch when it hits the
last one.

The trade off is that now it takes longer to select each segment, since it
has to look at more data before making a choice.

I benchmarked on the following data sets using this command:

```sh
$ZSTD -T0 -3 --train-cover=d=8,steps=256 $DIR -r -o dict && $ZSTD -3 -D dict -rc $DIR | wc -c
```

| Data set     | k (approx) |  Before  |  After   | % difference |
|--------------|------------|----------|----------|--------------|
| GitHub       | ~1000      |   738138 |   746610 |       +1.14% |
| hg-changelog | ~90        |  4295156 |  4285336 |       -0.23% |
| hg-commands  | ~500       |  1095580 |  1079814 |       -1.44% |
| hg-manifest  | ~400       | 16559892 | 16504346 |       -0.34% |

There is some noise in the measurements, since small changes to `k` can
have large differences, which is why I'm using `steps=256`, to try to
minimize the noise. However, the GitHub data set still has some noise.

If I run the GitHub data set on my Mac, which presumably lists directory
entries in a different order, so the dictionary builder sees the files in
a different order, or I use `steps=1024` I see these results.

| Run        | Before | After  | % difference |
|------------|--------|--------|--------------|
| steps=1024 | 738138 | 734470 |       -0.50% |
| MacBook    | 738451 | 737132 |       -0.18% |

Question: Should we expose this as a parameter? I don't think it is
necessary. Someone might want to turn it up to exchange a much longer
dictionary building time in exchange for a slightly better dictionary.
I tested `2`, `4`, and `16`, and `4` got most of the benefit of `16`
with a faster running time.
2018-05-18 16:15:27 -07:00
462aed6811 zstd requires a stable sort.
On OpenBSD qsort() is not guaranteed to be stable, their mergesort() is.
This fixes issue #1088. All the hard work has been done by @terrelln.
2018-04-05 07:59:16 +02:00
752bae4a48 added warning message
when pathological dataset is detected
(note : cover_optimize needs -v to display the warning)
2018-01-11 11:29:28 -08:00
218e9fe0fc added a test case for dictBuilder failure
cyclic data set makes the entropy stage fails
now, onto a fix for #304 ...
2018-01-11 09:42:38 -08:00
6c41adfb28 [libzstd] pthread function prefixed with ZSTD_
* `sed -i 's/pthread_/ZSTD_pthread_/g' lib/{,common,compress,decompress,dictBuilder}/*.[hc]`
* Fix up `lib/common/threading.[hc]`
* `sed -i s/PTHREAD_MUTEX_LOCK/ZSTD_PTHREAD_MUTEX_LOCK/g lib/compress/zstdmt_compress.c`
2017-09-27 11:48:48 -07:00
3128e03be6 updated license header
to clarify dual-license meaning as "or"
2017-09-08 00:09:23 -07:00
20f715d709 Fix displayLevel overflow 2017-08-23 15:56:15 +05:00
bd9c8ca146 Merge pull request #811 from terrelln/segmentSize
[cover] Fix end condition for small dictionary
2017-08-22 14:36:30 -07:00
29c2d9a4d0 [cover] Turn down notification for ZDICT subroutines 2017-08-21 14:28:31 -07:00
98de3f6847 [cover] Add dictionary size to compressed size 2017-08-21 14:23:17 -07:00
9a54a315aa [cover] Convert score to U32 and check for zero 2017-08-21 13:30:07 -07:00
d49eb40c03 [cover] Stop when segmentSize is less than d 2017-08-21 13:10:03 -07:00
f306d400c0 [cover] Fix divide by zero 2017-08-21 11:12:11 -07:00
32fb407c9d updated a bunch of headers
for the new license
2017-08-18 16:52:05 -07:00
b71363b967 check pthread_*_init() success condition 2017-07-19 01:05:40 -07:00
5b7fd7c422 [zdict] Make COVER the default algorithm 2017-06-26 21:09:22 -07:00
f376d47c11 [CLI] Switch dictionary builder on CLI to cover 2017-05-02 11:18:27 -07:00
020b960e13 [cover] Make optimization faster 2017-05-02 11:02:48 -07:00
f2d9ef1dc0 [cover] Optimize case where d <= 8 2017-05-02 11:02:43 -07:00
042ba122ae Change g_displayLevel to int and fix DISPLAYUPDATE flush 2017-03-23 11:21:59 -07:00
976e325b2e Fix COVER_optimizeTrainFromBuffer() resource leaks
Thanks to @nemequ for reporting the resource leaks.
2017-03-02 15:54:39 -08:00
71c5263c00 Attribute cover dictionary code 2017-02-07 11:35:07 -08:00
2fe9126591 Add multithread support to COVER 2017-01-27 11:56:02 -08:00
555e281637 Handle large input size in 32-bit mode correctly 2017-01-09 18:20:06 -08:00
3a1fefcf00 Simplify COVER parameters 2017-01-02 17:51:38 -08:00
96b39f65fa Add COVER dictionary builder 2017-01-02 13:22:51 -08:00