2017-09-01 18:28:35 -07:00
/*
2021-03-29 14:23:36 -07:00
* Copyright ( c ) Yann Collet , Facebook , Inc .
2017-09-01 18:28:35 -07:00
* All rights reserved .
*
* This source code is licensed under both the BSD - style license ( found in the
* LICENSE file in the root directory of this source tree ) and the GPLv2 ( found
* in the COPYING file in the root directory of this source tree ) .
2017-09-08 00:09:23 -07:00
* You may select , at your option , one of the above - listed licenses .
2017-09-01 18:28:35 -07:00
*/
2017-11-07 16:15:23 -08:00
# include "zstd_compress_internal.h"
2017-09-01 18:28:35 -07:00
# include "zstd_lazy.h"
2022-01-21 11:29:14 -07:00
# include "../common/bits.h" /* ZSTD_countTrailingZeros64 */
2017-09-01 18:28:35 -07:00
/*-*************************************
* Binary Tree search
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
2018-10-02 18:20:20 -07:00
static void
ZSTD_updateDUBT ( ZSTD_matchState_t * ms ,
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
const BYTE * ip , const BYTE * iend ,
U32 mls )
{
2018-08-23 12:17:58 -07:00
const ZSTD_compressionParameters * const cParams = & ms - > cParams ;
2017-12-12 16:51:00 -08:00
U32 * const hashTable = ms - > hashTable ;
U32 const hashLog = cParams - > hashLog ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
2017-12-12 16:51:00 -08:00
U32 * const bt = ms - > chainTable ;
U32 const btLog = cParams - > chainLog - 1 ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
U32 const btMask = ( 1 < < btLog ) - 1 ;
2018-02-23 16:48:18 -08:00
const BYTE * const base = ms - > window . base ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
U32 const target = ( U32 ) ( ip - base ) ;
2017-12-12 16:51:00 -08:00
U32 idx = ms - > nextToUpdate ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
2017-12-29 17:04:37 +01:00
if ( idx ! = target )
2017-12-29 19:08:51 +01:00
DEBUGLOG ( 7 , " ZSTD_updateDUBT, from %u to %u (dictLimit:%u) " ,
2018-02-23 16:48:18 -08:00
idx , target , ms - > window . dictLimit ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
assert ( ip + 8 < = iend ) ; /* condition for ZSTD_hashPtr */
( void ) iend ;
2018-02-23 16:48:18 -08:00
assert ( idx > = ms - > window . dictLimit ) ; /* condition for valid base+idx */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
for ( ; idx < target ; idx + + ) {
size_t const h = ZSTD_hashPtr ( base + idx , hashLog , mls ) ; /* assumption : ip + 8 <= iend */
U32 const matchIndex = hashTable [ h ] ;
U32 * const nextCandidatePtr = bt + 2 * ( idx & btMask ) ;
U32 * const sortMarkPtr = nextCandidatePtr + 1 ;
2017-12-30 15:12:59 +01:00
DEBUGLOG ( 8 , " ZSTD_updateDUBT: insert %u " , idx ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
hashTable [ h ] = idx ; /* Update Hash Table */
* nextCandidatePtr = matchIndex ; /* update BT like a chain */
2017-12-29 19:08:51 +01:00
* sortMarkPtr = ZSTD_DUBT_UNSORTED_MARK ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
}
2017-12-12 16:51:00 -08:00
ms - > nextToUpdate = target ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
}
/** ZSTD_insertDUBT1() :
* sort one already inserted but unsorted position
2020-08-11 14:31:09 -07:00
* assumption : curr > = btlow = = ( curr - btmask )
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
* doesn ' t fail */
2018-10-02 18:20:20 -07:00
static void
2021-09-23 08:27:44 -07:00
ZSTD_insertDUBT1 ( const ZSTD_matchState_t * ms ,
2020-08-11 14:31:09 -07:00
U32 curr , const BYTE * inputEnd ,
2018-11-12 17:05:32 -08:00
U32 nbCompares , U32 btLow ,
const ZSTD_dictMode_e dictMode )
2017-09-01 18:28:35 -07:00
{
2018-08-23 12:17:58 -07:00
const ZSTD_compressionParameters * const cParams = & ms - > cParams ;
2018-11-12 17:05:32 -08:00
U32 * const bt = ms - > chainTable ;
U32 const btLog = cParams - > chainLog - 1 ;
U32 const btMask = ( 1 < < btLog ) - 1 ;
2017-09-01 18:28:35 -07:00
size_t commonLengthSmaller = 0 , commonLengthLarger = 0 ;
2018-02-23 16:48:18 -08:00
const BYTE * const base = ms - > window . base ;
const BYTE * const dictBase = ms - > window . dictBase ;
const U32 dictLimit = ms - > window . dictLimit ;
2020-08-11 14:31:09 -07:00
const BYTE * const ip = ( curr > = dictLimit ) ? base + curr : dictBase + curr ;
const BYTE * const iend = ( curr > = dictLimit ) ? inputEnd : dictBase + dictLimit ;
2017-09-01 18:28:35 -07:00
const BYTE * const dictEnd = dictBase + dictLimit ;
const BYTE * const prefixStart = base + dictLimit ;
const BYTE * match ;
2020-08-11 14:31:09 -07:00
U32 * smallerPtr = bt + 2 * ( curr & btMask ) ;
2017-09-01 18:28:35 -07:00
U32 * largerPtr = smallerPtr + 1 ;
2018-11-12 17:05:32 -08:00
U32 matchIndex = * smallerPtr ; /* this candidate is unsorted : next sorted candidate is reached through *smallerPtr, while *largerPtr contains previous unsorted candidate (which is already saved and can be overwritten) */
2017-09-01 18:28:35 -07:00
U32 dummy32 ; /* to be nullified at the end */
2019-05-31 16:08:48 -07:00
U32 const windowValid = ms - > window . lowLimit ;
U32 const maxDistance = 1U < < cParams - > windowLog ;
2020-08-11 14:31:09 -07:00
U32 const windowLow = ( curr - windowValid > maxDistance ) ? curr - maxDistance : windowValid ;
2019-05-31 16:08:48 -07:00
2017-09-01 18:28:35 -07:00
2017-12-29 17:04:37 +01:00
DEBUGLOG ( 8 , " ZSTD_insertDUBT1(%u) (dictLimit=%u, lowLimit=%u) " ,
2020-08-11 14:31:09 -07:00
curr , dictLimit , windowLow ) ;
assert ( curr > = btLow ) ;
2017-12-30 15:12:59 +01:00
assert ( ip < iend ) ; /* condition for ZSTD_count */
2017-09-01 18:28:35 -07:00
2021-10-08 11:45:30 -07:00
for ( ; nbCompares & & ( matchIndex > windowLow ) ; - - nbCompares ) {
2017-09-01 18:28:35 -07:00
U32 * const nextPtr = bt + 2 * ( matchIndex & btMask ) ;
size_t matchLength = MIN ( commonLengthSmaller , commonLengthLarger ) ; /* guaranteed minimum nb of common bytes */
2020-08-11 14:31:09 -07:00
assert ( matchIndex < curr ) ;
2018-11-12 17:05:32 -08:00
/* note : all candidates are now supposed sorted,
* but it ' s still possible to have nextPtr [ 1 ] = = ZSTD_DUBT_UNSORTED_MARK
* when a real index has the same value as ZSTD_DUBT_UNSORTED_MARK */
2017-09-01 18:28:35 -07:00
2018-05-16 04:07:09 -04:00
if ( ( dictMode ! = ZSTD_extDict )
2017-12-30 15:12:59 +01:00
| | ( matchIndex + matchLength > = dictLimit ) /* both in current segment*/
2020-08-11 14:31:09 -07:00
| | ( curr < dictLimit ) /* both in extDict */ ) {
2018-05-16 04:07:09 -04:00
const BYTE * const mBase = ( ( dictMode ! = ZSTD_extDict )
| | ( matchIndex + matchLength > = dictLimit ) ) ?
base : dictBase ;
2017-12-30 15:12:59 +01:00
assert ( ( matchIndex + matchLength > = dictLimit ) /* might be wrong if extDict is incorrectly set to 0 */
2020-08-11 14:31:09 -07:00
| | ( curr < dictLimit ) ) ;
2017-12-30 15:12:59 +01:00
match = mBase + matchIndex ;
2017-11-19 14:40:21 -08:00
matchLength + = ZSTD_count ( ip + matchLength , match + matchLength , iend ) ;
2017-09-01 18:28:35 -07:00
} else {
match = dictBase + matchIndex ;
matchLength + = ZSTD_count_2segments ( ip + matchLength , match + matchLength , iend , dictEnd , prefixStart ) ;
if ( matchIndex + matchLength > = dictLimit )
2018-11-12 17:05:32 -08:00
match = base + matchIndex ; /* preparation for next read of match[matchLength] */
2017-09-01 18:28:35 -07:00
}
2017-12-30 15:12:59 +01:00
DEBUGLOG ( 8 , " ZSTD_insertDUBT1: comparing %u with %u : found %u common bytes " ,
2020-08-11 14:31:09 -07:00
curr , matchIndex , ( U32 ) matchLength ) ;
2017-12-30 15:12:59 +01:00
2017-11-15 11:29:24 -08:00
if ( ip + matchLength = = iend ) { /* equal : no way to know if inf or sup */
2017-09-16 23:40:14 -07:00
break ; /* drop , to guarantee consistency ; miss a bit of compression, but other solutions can corrupt tree */
2017-11-15 11:29:24 -08:00
}
2017-09-01 18:28:35 -07:00
2017-09-16 23:40:14 -07:00
if ( match [ matchLength ] < ip [ matchLength ] ) { /* necessarily within buffer */
2017-11-15 13:44:24 -08:00
/* match is smaller than current */
2017-09-01 18:28:35 -07:00
* smallerPtr = matchIndex ; /* update smaller idx */
commonLengthSmaller = matchLength ; /* all smaller will now have at least this guaranteed common length */
2017-09-16 23:40:14 -07:00
if ( matchIndex < = btLow ) { smallerPtr = & dummy32 ; break ; } /* beyond tree size, stop searching */
2017-12-30 15:12:59 +01:00
DEBUGLOG ( 8 , " ZSTD_insertDUBT1: %u (>btLow=%u) is smaller : next => %u " ,
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
matchIndex , btLow , nextPtr [ 1 ] ) ;
2017-11-15 13:44:24 -08:00
smallerPtr = nextPtr + 1 ; /* new "candidate" => larger than match, which was smaller than target */
matchIndex = nextPtr [ 1 ] ; /* new matchIndex, larger than previous and closer to current */
2017-09-01 18:28:35 -07:00
} else {
/* match is larger than current */
* largerPtr = matchIndex ;
commonLengthLarger = matchLength ;
2017-09-16 23:40:14 -07:00
if ( matchIndex < = btLow ) { largerPtr = & dummy32 ; break ; } /* beyond tree size, stop searching */
2017-12-30 15:12:59 +01:00
DEBUGLOG ( 8 , " ZSTD_insertDUBT1: %u (>btLow=%u) is larger => %u " ,
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
matchIndex , btLow , nextPtr [ 0 ] ) ;
2017-09-01 18:28:35 -07:00
largerPtr = nextPtr ;
matchIndex = nextPtr [ 0 ] ;
} }
* smallerPtr = * largerPtr = 0 ;
Fixed Btree update
ZSTD_updateTree() expected to be followed by a Bt match finder, which would update zc->nextToUpdate.
With the new optimal match finder, it's not necessarily the case : a match might be found during repcode or hash3, and stops there because it reaches sufficient_len, without even entering the binary tree.
Previous policy was to nonetheless update zc->nextToUpdate, but the current position would not be inserted, creating "holes" in the btree, aka positions that will no longer be searched.
Now, when current position is not inserted, zc->nextToUpdate is not update, expecting ZSTD_updateTree() to fill the tree later on.
Solution selected is that ZSTD_updateTree() takes care of properly setting zc->nextToUpdate,
so that it no longer depends on a future function to do this job.
It took time to get there, as the issue started with a memory sanitizer error.
The pb would have been easier to spot with a proper `assert()`.
So this patch add a few of them.
Additionnally, I discovered that `make test` does not enable `assert()` during CLI tests.
This patch enables them.
Unfortunately, these `assert()` triggered other (unrelated) bugs during CLI tests, mostly within zstdmt.
So this patch also fixes them.
- Changed packed structure for gcc memory access : memory sanitizer would complain that a read "might" reach out-of-bound position on the ground that the `union` is larger than the type accessed.
Now, to avoid this issue, each type is independent.
- ZSTD_CCtxParams_setParameter() : @return provides the value of parameter, clamped/fixed appropriately.
- ZSTDMT : changed constant name to ZSTDMT_JOBSIZE_MIN
- ZSTDMT : multithreading is automatically disabled when srcSize <= ZSTDMT_JOBSIZE_MIN, since only one thread will be used in this case (saves memory and runtime).
- ZSTDMT : nbThreads is automatically clamped on setting the value.
2017-11-16 12:18:56 -08:00
}
2017-09-01 18:28:35 -07:00
2018-10-02 18:20:20 -07:00
static size_t
ZSTD_DUBT_findBetterDictMatch (
2021-09-23 08:27:44 -07:00
const ZSTD_matchState_t * ms ,
2018-06-12 18:38:10 -04:00
const BYTE * const ip , const BYTE * const iend ,
size_t * offsetPtr ,
2018-10-08 15:56:09 -07:00
size_t bestLength ,
2018-06-12 18:38:10 -04:00
U32 nbCompares ,
U32 const mls ,
2018-10-02 18:20:20 -07:00
const ZSTD_dictMode_e dictMode )
{
2018-06-12 18:38:10 -04:00
const ZSTD_matchState_t * const dms = ms - > dictMatchState ;
2018-08-27 16:19:41 -07:00
const ZSTD_compressionParameters * const dmsCParams = & dms - > cParams ;
2018-06-12 18:38:10 -04:00
const U32 * const dictHashTable = dms - > hashTable ;
2018-08-27 16:19:41 -07:00
U32 const hashLog = dmsCParams - > hashLog ;
2018-06-12 18:38:10 -04:00
size_t const h = ZSTD_hashPtr ( ip , hashLog , mls ) ;
U32 dictMatchIndex = dictHashTable [ h ] ;
const BYTE * const base = ms - > window . base ;
const BYTE * const prefixStart = base + ms - > window . dictLimit ;
2020-08-11 14:31:09 -07:00
U32 const curr = ( U32 ) ( ip - base ) ;
2018-06-12 18:38:10 -04:00
const BYTE * const dictBase = dms - > window . base ;
const BYTE * const dictEnd = dms - > window . nextSrc ;
U32 const dictHighLimit = ( U32 ) ( dms - > window . nextSrc - dms - > window . base ) ;
U32 const dictLowLimit = dms - > window . lowLimit ;
U32 const dictIndexDelta = ms - > window . lowLimit - dictHighLimit ;
U32 * const dictBt = dms - > chainTable ;
2018-08-27 16:19:41 -07:00
U32 const btLog = dmsCParams - > chainLog - 1 ;
2018-06-12 18:38:10 -04:00
U32 const btMask = ( 1 < < btLog ) - 1 ;
2018-06-21 15:24:08 -04:00
U32 const btLow = ( btMask > = dictHighLimit - dictLowLimit ) ? dictLowLimit : dictHighLimit - btMask ;
2018-06-12 18:38:10 -04:00
2018-10-08 15:56:09 -07:00
size_t commonLengthSmaller = 0 , commonLengthLarger = 0 ;
2018-06-12 18:38:10 -04:00
( void ) dictMode ;
assert ( dictMode = = ZSTD_dictMatchState ) ;
2021-10-08 11:45:30 -07:00
for ( ; nbCompares & & ( dictMatchIndex > dictLowLimit ) ; - - nbCompares ) {
2018-06-12 18:38:10 -04:00
U32 * const nextPtr = dictBt + 2 * ( dictMatchIndex & btMask ) ;
size_t matchLength = MIN ( commonLengthSmaller , commonLengthLarger ) ; /* guaranteed minimum nb of common bytes */
const BYTE * match = dictBase + dictMatchIndex ;
matchLength + = ZSTD_count_2segments ( ip + matchLength , match + matchLength , iend , dictEnd , prefixStart ) ;
if ( dictMatchIndex + matchLength > = dictHighLimit )
match = base + dictMatchIndex + dictIndexDelta ; /* to prepare for next usage of match[matchLength] */
if ( matchLength > bestLength ) {
U32 matchIndex = dictMatchIndex + dictIndexDelta ;
2020-08-11 14:31:09 -07:00
if ( ( 4 * ( int ) ( matchLength - bestLength ) ) > ( int ) ( ZSTD_highbit32 ( curr - matchIndex + 1 ) - ZSTD_highbit32 ( ( U32 ) offsetPtr [ 0 ] + 1 ) ) ) {
2018-10-08 15:50:02 -07:00
DEBUGLOG ( 9 , " ZSTD_DUBT_findBetterDictMatch(%u) : found better match length %u -> %u and offsetCode %u -> %u (dictMatchIndex %u, matchIndex %u) " ,
2021-12-29 17:30:43 -08:00
curr , ( U32 ) bestLength , ( U32 ) matchLength , ( U32 ) * offsetPtr , OFFSET_TO_OFFBASE ( curr - matchIndex ) , dictMatchIndex , matchIndex ) ;
bestLength = matchLength , * offsetPtr = OFFSET_TO_OFFBASE ( curr - matchIndex ) ;
2018-06-12 18:38:10 -04:00
}
2018-10-02 18:20:20 -07:00
if ( ip + matchLength = = iend ) { /* reached end of input : ip[matchLength] is not valid, no way to know if it's larger or smaller than match */
2018-06-12 18:38:10 -04:00
break ; /* drop, to guarantee consistency (miss a little bit of compression) */
}
}
if ( match [ matchLength ] < ip [ matchLength ] ) {
if ( dictMatchIndex < = btLow ) { break ; } /* beyond tree size, stop the search */
commonLengthSmaller = matchLength ; /* all smaller will now have at least this guaranteed common length */
dictMatchIndex = nextPtr [ 1 ] ; /* new matchIndex larger than previous (closer to current) */
} else {
/* match is larger than current */
if ( dictMatchIndex < = btLow ) { break ; } /* beyond tree size, stop the search */
commonLengthLarger = matchLength ;
dictMatchIndex = nextPtr [ 0 ] ;
}
}
if ( bestLength > = MINMATCH ) {
2021-12-29 17:30:43 -08:00
U32 const mIndex = curr - ( U32 ) OFFBASE_TO_OFFSET ( * offsetPtr ) ; ( void ) mIndex ;
2018-10-08 15:50:02 -07:00
DEBUGLOG ( 8 , " ZSTD_DUBT_findBetterDictMatch(%u) : found match of length %u and offsetCode %u (pos %u) " ,
2020-08-11 14:31:09 -07:00
curr , ( U32 ) bestLength , ( U32 ) * offsetPtr , mIndex ) ;
2018-06-12 18:38:10 -04:00
}
return bestLength ;
}
2018-10-02 18:20:20 -07:00
static size_t
ZSTD_DUBT_findBestMatch ( ZSTD_matchState_t * ms ,
const BYTE * const ip , const BYTE * const iend ,
2021-12-29 18:51:03 -08:00
size_t * offBasePtr ,
2018-10-02 18:20:20 -07:00
U32 const mls ,
const ZSTD_dictMode_e dictMode )
2017-09-01 18:28:35 -07:00
{
2018-08-23 12:17:58 -07:00
const ZSTD_compressionParameters * const cParams = & ms - > cParams ;
2017-12-12 16:51:00 -08:00
U32 * const hashTable = ms - > hashTable ;
U32 const hashLog = cParams - > hashLog ;
2017-09-01 18:28:35 -07:00
size_t const h = ZSTD_hashPtr ( ip , hashLog , mls ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
U32 matchIndex = hashTable [ h ] ;
2018-02-23 16:48:18 -08:00
const BYTE * const base = ms - > window . base ;
2020-08-11 14:31:09 -07:00
U32 const curr = ( U32 ) ( ip - base ) ;
U32 const windowLow = ZSTD_getLowestMatchIndex ( ms , curr , cParams - > windowLog ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
2017-12-12 16:51:00 -08:00
U32 * const bt = ms - > chainTable ;
U32 const btLog = cParams - > chainLog - 1 ;
2017-09-01 18:28:35 -07:00
U32 const btMask = ( 1 < < btLog ) - 1 ;
2020-08-11 14:31:09 -07:00
U32 const btLow = ( btMask > = curr ) ? 0 : curr - btMask ;
2017-12-29 17:04:37 +01:00
U32 const unsortLimit = MAX ( btLow , windowLow ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
U32 * nextCandidate = bt + 2 * ( matchIndex & btMask ) ;
U32 * unsortedMark = bt + 2 * ( matchIndex & btMask ) + 1 ;
2017-12-12 16:51:00 -08:00
U32 nbCompares = 1U < < cParams - > searchLog ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
U32 nbCandidates = nbCompares ;
U32 previousCandidate = 0 ;
2017-09-01 18:28:35 -07:00
2020-08-11 14:31:09 -07:00
DEBUGLOG ( 7 , " ZSTD_DUBT_findBestMatch (%u) " , curr ) ;
2017-09-16 23:40:14 -07:00
assert ( ip < = iend - 8 ) ; /* required for h calculation */
2020-08-11 18:48:22 -04:00
assert ( dictMode ! = ZSTD_dedicatedDictSearch ) ;
2017-09-01 18:28:35 -07:00
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
/* reach end of unsorted candidates list */
2017-12-29 17:04:37 +01:00
while ( ( matchIndex > unsortLimit )
2017-12-29 19:08:51 +01:00
& & ( * unsortedMark = = ZSTD_DUBT_UNSORTED_MARK )
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
& & ( nbCandidates > 1 ) ) {
2018-01-11 12:38:31 -08:00
DEBUGLOG ( 8 , " ZSTD_DUBT_findBestMatch: candidate %u is unsorted " ,
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
matchIndex ) ;
2018-11-12 17:05:32 -08:00
* unsortedMark = previousCandidate ; /* the unsortedMark becomes a reversed chain, to move up back to original position */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
previousCandidate = matchIndex ;
matchIndex = * nextCandidate ;
nextCandidate = bt + 2 * ( matchIndex & btMask ) ;
unsortedMark = bt + 2 * ( matchIndex & btMask ) + 1 ;
nbCandidates - - ;
}
2017-09-01 18:28:35 -07:00
2018-11-12 17:05:32 -08:00
/* nullify last candidate if it's still unsorted
* simplification , detrimental to compression ratio , beneficial for speed */
2017-12-29 17:04:37 +01:00
if ( ( matchIndex > unsortLimit )
2017-12-29 19:08:51 +01:00
& & ( * unsortedMark = = ZSTD_DUBT_UNSORTED_MARK ) ) {
2018-01-11 12:38:31 -08:00
DEBUGLOG ( 7 , " ZSTD_DUBT_findBestMatch: nullify last unsorted candidate %u " ,
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
matchIndex ) ;
2018-11-12 17:05:32 -08:00
* nextCandidate = * unsortedMark = 0 ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
}
2017-09-01 18:28:35 -07:00
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
/* batch sort stacked candidates */
matchIndex = previousCandidate ;
while ( matchIndex ) { /* will end on matchIndex == 0 */
U32 * const nextCandidateIdxPtr = bt + 2 * ( matchIndex & btMask ) + 1 ;
U32 const nextCandidateIdx = * nextCandidateIdxPtr ;
2018-08-23 12:17:58 -07:00
ZSTD_insertDUBT1 ( ms , matchIndex , iend ,
2018-05-16 04:07:09 -04:00
nbCandidates , unsortLimit , dictMode ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
matchIndex = nextCandidateIdx ;
nbCandidates + + ;
}
2017-09-01 18:28:35 -07:00
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
/* find longest match */
2018-11-12 17:05:32 -08:00
{ size_t commonLengthSmaller = 0 , commonLengthLarger = 0 ;
2018-02-23 16:48:18 -08:00
const BYTE * const dictBase = ms - > window . dictBase ;
const U32 dictLimit = ms - > window . dictLimit ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
const BYTE * const dictEnd = dictBase + dictLimit ;
const BYTE * const prefixStart = base + dictLimit ;
2020-08-11 14:31:09 -07:00
U32 * smallerPtr = bt + 2 * ( curr & btMask ) ;
U32 * largerPtr = bt + 2 * ( curr & btMask ) + 1 ;
U32 matchEndIdx = curr + 8 + 1 ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
U32 dummy32 ; /* to be nullified at the end */
size_t bestLength = 0 ;
matchIndex = hashTable [ h ] ;
2020-08-11 14:31:09 -07:00
hashTable [ h ] = curr ; /* Update Hash Table */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
2021-10-08 11:45:30 -07:00
for ( ; nbCompares & & ( matchIndex > windowLow ) ; - - nbCompares ) {
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
U32 * const nextPtr = bt + 2 * ( matchIndex & btMask ) ;
size_t matchLength = MIN ( commonLengthSmaller , commonLengthLarger ) ; /* guaranteed minimum nb of common bytes */
const BYTE * match ;
2018-05-16 04:07:09 -04:00
if ( ( dictMode ! = ZSTD_extDict ) | | ( matchIndex + matchLength > = dictLimit ) ) {
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
match = base + matchIndex ;
matchLength + = ZSTD_count ( ip + matchLength , match + matchLength , iend ) ;
} else {
match = dictBase + matchIndex ;
matchLength + = ZSTD_count_2segments ( ip + matchLength , match + matchLength , iend , dictEnd , prefixStart ) ;
if ( matchIndex + matchLength > = dictLimit )
match = base + matchIndex ; /* to prepare for next usage of match[matchLength] */
}
2017-09-01 18:28:35 -07:00
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
if ( matchLength > bestLength ) {
if ( matchLength > matchEndIdx - matchIndex )
matchEndIdx = matchIndex + ( U32 ) matchLength ;
2021-12-29 18:51:03 -08:00
if ( ( 4 * ( int ) ( matchLength - bestLength ) ) > ( int ) ( ZSTD_highbit32 ( curr - matchIndex + 1 ) - ZSTD_highbit32 ( ( U32 ) * offBasePtr ) ) )
bestLength = matchLength , * offBasePtr = OFFSET_TO_OFFBASE ( curr - matchIndex ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
if ( ip + matchLength = = iend ) { /* equal : no way to know if inf or sup */
2018-10-08 15:56:09 -07:00
if ( dictMode = = ZSTD_dictMatchState ) {
nbCompares = 0 ; /* in addition to avoiding checking any
* further in this loop , make sure we
* skip checking in the dictionary . */
}
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
break ; /* drop, to guarantee consistency (miss a little bit of compression) */
}
}
2017-09-01 18:28:35 -07:00
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
if ( match [ matchLength ] < ip [ matchLength ] ) {
/* match is smaller than current */
* smallerPtr = matchIndex ; /* update smaller idx */
commonLengthSmaller = matchLength ; /* all smaller will now have at least this guaranteed common length */
if ( matchIndex < = btLow ) { smallerPtr = & dummy32 ; break ; } /* beyond tree size, stop the search */
smallerPtr = nextPtr + 1 ; /* new "smaller" => larger of match */
matchIndex = nextPtr [ 1 ] ; /* new matchIndex larger than previous (closer to current) */
} else {
/* match is larger than current */
* largerPtr = matchIndex ;
commonLengthLarger = matchLength ;
if ( matchIndex < = btLow ) { largerPtr = & dummy32 ; break ; } /* beyond tree size, stop the search */
largerPtr = nextPtr ;
matchIndex = nextPtr [ 0 ] ;
} }
* smallerPtr = * largerPtr = 0 ;
2021-10-08 11:45:30 -07:00
assert ( nbCompares < = ( 1U < < ZSTD_SEARCHLOG_MAX ) ) ; /* Check we haven't underflowed. */
2018-06-12 18:38:10 -04:00
if ( dictMode = = ZSTD_dictMatchState & & nbCompares ) {
2018-10-08 15:56:09 -07:00
bestLength = ZSTD_DUBT_findBetterDictMatch (
ms , ip , iend ,
2021-12-29 18:51:03 -08:00
offBasePtr , bestLength , nbCompares ,
2018-10-08 15:56:09 -07:00
mls , dictMode ) ;
2018-06-12 18:38:10 -04:00
}
2020-08-11 14:31:09 -07:00
assert ( matchEndIdx > curr + 8 ) ; /* ensure nextToUpdate is increased */
2017-12-12 16:51:00 -08:00
ms - > nextToUpdate = matchEndIdx - 8 ; /* skip repetitive patterns */
2017-12-30 15:12:59 +01:00
if ( bestLength > = MINMATCH ) {
2021-12-29 18:51:03 -08:00
U32 const mIndex = curr - ( U32 ) OFFBASE_TO_OFFSET ( * offBasePtr ) ; ( void ) mIndex ;
2018-01-11 12:38:31 -08:00
DEBUGLOG ( 8 , " ZSTD_DUBT_findBestMatch(%u) : found match of length %u and offsetCode %u (pos %u) " ,
2021-12-29 18:51:03 -08:00
curr , ( U32 ) bestLength , ( U32 ) * offBasePtr , mIndex ) ;
2017-12-30 15:12:59 +01:00
}
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 16:58:57 +01:00
return bestLength ;
}
2017-09-01 18:28:35 -07:00
}
/** ZSTD_BtFindBestMatch() : Tree updater, providing best match */
2018-10-02 18:20:20 -07:00
FORCE_INLINE_TEMPLATE size_t
ZSTD_BtFindBestMatch ( ZSTD_matchState_t * ms ,
const BYTE * const ip , const BYTE * const iLimit ,
2021-12-29 18:51:03 -08:00
size_t * offBasePtr ,
2018-10-02 18:20:20 -07:00
const U32 mls /* template */ ,
const ZSTD_dictMode_e dictMode )
2017-09-01 18:28:35 -07:00
{
2017-12-30 15:12:59 +01:00
DEBUGLOG ( 7 , " ZSTD_BtFindBestMatch " ) ;
2018-02-23 16:48:18 -08:00
if ( ip < ms - > window . base + ms - > nextToUpdate ) return 0 ; /* skipped area */
2018-08-23 12:17:58 -07:00
ZSTD_updateDUBT ( ms , ip , iLimit , mls ) ;
2021-12-29 18:51:03 -08:00
return ZSTD_DUBT_findBestMatch ( ms , ip , iLimit , offBasePtr , mls , dictMode ) ;
2017-09-01 18:28:35 -07:00
}
2020-11-02 17:52:29 -08:00
/***********************************
* Dedicated dict search
2017-09-01 18:28:35 -07:00
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
2017-12-12 16:51:00 -08:00
2020-06-11 22:48:48 -07:00
void ZSTD_dedicatedDictSearch_lazy_loadDictionary ( ZSTD_matchState_t * ms , const BYTE * const ip )
2020-06-11 18:27:07 -07:00
{
2020-09-03 19:45:24 -04:00
const BYTE * const base = ms - > window . base ;
U32 const target = ( U32 ) ( ip - base ) ;
2020-09-03 12:55:40 -04:00
U32 * const hashTable = ms - > hashTable ;
2020-06-11 18:27:07 -07:00
U32 * const chainTable = ms - > chainTable ;
2020-09-02 17:29:46 -04:00
U32 const chainSize = 1 < < ms - > cParams . chainLog ;
2020-08-20 11:40:47 -04:00
U32 idx = ms - > nextToUpdate ;
2021-07-22 12:37:35 -04:00
U32 const minChain = chainSize < target - idx ? target - chainSize : idx ;
2020-09-03 13:58:11 -04:00
U32 const bucketSize = 1 < < ZSTD_LAZY_DDSS_BUCKET_LOG ;
2020-09-03 19:45:24 -04:00
U32 const cacheSize = bucketSize - 1 ;
U32 const chainAttempts = ( 1 < < ms - > cParams . searchLog ) - cacheSize ;
U32 const chainLimit = chainAttempts > 255 ? 255 : chainAttempts ;
2020-09-03 17:29:44 -04:00
/* We know the hashtable is oversized by a factor of `bucketSize`.
* We are going to temporarily pretend ` bucketSize = = 1 ` , keeping only a
2020-09-04 00:11:44 -04:00
* single entry . We will use the rest of the space to construct a temporary
* chaintable .
2020-09-03 17:29:44 -04:00
*/
U32 const hashLog = ms - > cParams . hashLog - ZSTD_LAZY_DDSS_BUCKET_LOG ;
U32 * const tmpHashTable = hashTable ;
2020-09-08 17:39:37 -04:00
U32 * const tmpChainTable = hashTable + ( ( size_t ) 1 < < hashLog ) ;
2021-12-14 02:12:09 -08:00
U32 const tmpChainSize = ( U32 ) ( ( 1 < < ZSTD_LAZY_DDSS_BUCKET_LOG ) - 1 ) < < hashLog ;
2020-09-04 00:11:44 -04:00
U32 const tmpMinChain = tmpChainSize < target ? target - tmpChainSize : idx ;
2020-09-03 17:29:44 -04:00
U32 hashIdx ;
2020-09-03 12:55:40 -04:00
assert ( ms - > cParams . chainLog < = 24 ) ;
2021-04-23 16:34:21 -04:00
assert ( ms - > cParams . hashLog > ms - > cParams . chainLog ) ;
2020-09-03 17:29:44 -04:00
assert ( idx ! = 0 ) ;
2020-09-04 00:11:44 -04:00
assert ( tmpMinChain < = minChain ) ;
2020-09-03 17:29:44 -04:00
2020-09-03 19:45:24 -04:00
/* fill conventional hash table and conventional chain table */
2020-08-20 11:40:47 -04:00
for ( ; idx < target ; idx + + ) {
2020-09-08 17:39:37 -04:00
U32 const h = ( U32 ) ZSTD_hashPtr ( base + idx , hashLog , ms - > cParams . minMatch ) ;
2020-09-04 00:11:44 -04:00
if ( idx > = tmpMinChain ) {
tmpChainTable [ idx - tmpMinChain ] = hashTable [ h ] ;
2020-09-03 12:55:40 -04:00
}
2020-09-03 17:29:44 -04:00
tmpHashTable [ h ] = idx ;
2020-09-03 12:55:40 -04:00
}
2020-09-03 17:29:44 -04:00
/* sort chains into ddss chain table */
2020-09-03 12:55:40 -04:00
{
U32 chainPos = 0 ;
2020-09-03 17:29:44 -04:00
for ( hashIdx = 0 ; hashIdx < ( 1U < < hashLog ) ; hashIdx + + ) {
2020-09-03 19:45:24 -04:00
U32 count ;
2020-09-04 00:11:44 -04:00
U32 countBeyondMinChain = 0 ;
2020-09-03 17:29:44 -04:00
U32 i = tmpHashTable [ hashIdx ] ;
2020-09-04 00:11:44 -04:00
for ( count = 0 ; i > = tmpMinChain & & count < cacheSize ; count + + ) {
2020-09-03 19:45:24 -04:00
/* skip through the chain to the first position that won't be
2020-09-04 00:11:44 -04:00
* in the hash cache bucket */
if ( i < minChain ) {
countBeyondMinChain + + ;
}
i = tmpChainTable [ i - tmpMinChain ] ;
2020-09-03 12:55:40 -04:00
}
2020-09-03 19:45:24 -04:00
if ( count = = cacheSize ) {
for ( count = 0 ; count < chainLimit ; ) {
2020-09-04 00:11:44 -04:00
if ( i < minChain ) {
2021-04-23 16:34:21 -04:00
if ( ! i | | + + countBeyondMinChain > cacheSize ) {
2020-09-04 00:11:44 -04:00
/* only allow pulling `cacheSize` number of entries
* into the cache or chainTable beyond ` minChain ` ,
* to replace the entries pulled out of the
* chainTable into the cache . This lets us reach
* back further without increasing the total number
* of entries in the chainTable , guaranteeing the
* DDSS chain table will fit into the space
* allocated for the regular one . */
break ;
}
}
2020-09-03 19:45:24 -04:00
chainTable [ chainPos + + ] = i ;
count + + ;
2020-09-04 00:11:44 -04:00
if ( i < tmpMinChain ) {
2020-09-03 19:45:24 -04:00
break ;
}
2020-09-04 00:11:44 -04:00
i = tmpChainTable [ i - tmpMinChain ] ;
2020-09-03 19:45:24 -04:00
}
} else {
count = 0 ;
}
if ( count ) {
tmpHashTable [ hashIdx ] = ( ( chainPos - count ) < < 8 ) + count ;
} else {
tmpHashTable [ hashIdx ] = 0 ;
}
2020-09-02 17:29:46 -04:00
}
2020-09-03 19:45:24 -04:00
assert ( chainPos < = chainSize ) ; /* I believe this is guaranteed... */
2020-06-11 18:27:07 -07:00
}
2020-08-20 11:40:47 -04:00
2020-09-03 19:45:24 -04:00
/* move chain pointers into the last entry of each hash bucket */
2020-09-03 17:29:44 -04:00
for ( hashIdx = ( 1 < < hashLog ) ; hashIdx ; ) {
U32 const bucketIdx = - - hashIdx < < ZSTD_LAZY_DDSS_BUCKET_LOG ;
U32 const chainPackedPointer = tmpHashTable [ hashIdx ] ;
U32 i ;
2020-09-03 19:45:24 -04:00
for ( i = 0 ; i < cacheSize ; i + + ) {
2020-09-03 17:29:44 -04:00
hashTable [ bucketIdx + i ] = 0 ;
}
2020-09-03 19:45:24 -04:00
hashTable [ bucketIdx + bucketSize - 1 ] = chainPackedPointer ;
2020-09-03 17:29:44 -04:00
}
2020-09-03 19:45:24 -04:00
/* fill the buckets of the hash table */
for ( idx = ms - > nextToUpdate ; idx < target ; idx + + ) {
2020-09-08 17:39:37 -04:00
U32 const h = ( U32 ) ZSTD_hashPtr ( base + idx , hashLog , ms - > cParams . minMatch )
2020-09-03 19:45:24 -04:00
< < ZSTD_LAZY_DDSS_BUCKET_LOG ;
U32 i ;
/* Shift hash cache down 1. */
for ( i = cacheSize - 1 ; i ; i - - )
hashTable [ h + i ] = hashTable [ h + i - 1 ] ;
hashTable [ h ] = idx ;
}
2020-09-03 17:29:44 -04:00
2020-06-11 18:27:07 -07:00
ms - > nextToUpdate = target ;
}
2020-11-02 17:52:29 -08:00
/* Returns the longest match length found in the dedicated dict search structure.
* If none are longer than the argument ml , then ml will be returned .
*/
FORCE_INLINE_TEMPLATE
size_t ZSTD_dedicatedDictSearch_lazy_search ( size_t * offsetPtr , size_t ml , U32 nbAttempts ,
const ZSTD_matchState_t * const dms ,
const BYTE * const ip , const BYTE * const iLimit ,
const BYTE * const prefixStart , const U32 curr ,
const U32 dictLimit , const size_t ddsIdx ) {
const U32 ddsLowestIndex = dms - > window . dictLimit ;
const BYTE * const ddsBase = dms - > window . base ;
const BYTE * const ddsEnd = dms - > window . nextSrc ;
const U32 ddsSize = ( U32 ) ( ddsEnd - ddsBase ) ;
const U32 ddsIndexDelta = dictLimit - ddsSize ;
const U32 bucketSize = ( 1 < < ZSTD_LAZY_DDSS_BUCKET_LOG ) ;
const U32 bucketLimit = nbAttempts < bucketSize - 1 ? nbAttempts : bucketSize - 1 ;
U32 ddsAttempt ;
U32 matchIndex ;
for ( ddsAttempt = 0 ; ddsAttempt < bucketSize - 1 ; ddsAttempt + + ) {
PREFETCH_L1 ( ddsBase + dms - > hashTable [ ddsIdx + ddsAttempt ] ) ;
}
{
U32 const chainPackedPointer = dms - > hashTable [ ddsIdx + bucketSize - 1 ] ;
U32 const chainIndex = chainPackedPointer > > 8 ;
PREFETCH_L1 ( & dms - > chainTable [ chainIndex ] ) ;
}
for ( ddsAttempt = 0 ; ddsAttempt < bucketLimit ; ddsAttempt + + ) {
size_t currentMl = 0 ;
const BYTE * match ;
matchIndex = dms - > hashTable [ ddsIdx + ddsAttempt ] ;
match = ddsBase + matchIndex ;
if ( ! matchIndex ) {
return ml ;
}
/* guaranteed by table construction */
( void ) ddsLowestIndex ;
assert ( matchIndex > = ddsLowestIndex ) ;
assert ( match + 4 < = ddsEnd ) ;
if ( MEM_read32 ( match ) = = MEM_read32 ( ip ) ) {
/* assumption : matchIndex <= dictLimit-4 (by table construction) */
currentMl = ZSTD_count_2segments ( ip + 4 , match + 4 , iLimit , ddsEnd , prefixStart ) + 4 ;
}
/* save best solution */
if ( currentMl > ml ) {
ml = currentMl ;
2021-12-29 17:30:43 -08:00
* offsetPtr = OFFSET_TO_OFFBASE ( curr - ( matchIndex + ddsIndexDelta ) ) ;
2020-11-02 17:52:29 -08:00
if ( ip + currentMl = = iLimit ) {
/* best possible, avoids read overflow on next attempt */
return ml ;
}
}
}
{
U32 const chainPackedPointer = dms - > hashTable [ ddsIdx + bucketSize - 1 ] ;
U32 chainIndex = chainPackedPointer > > 8 ;
U32 const chainLength = chainPackedPointer & 0xFF ;
U32 const chainAttempts = nbAttempts - ddsAttempt ;
U32 const chainLimit = chainAttempts > chainLength ? chainLength : chainAttempts ;
U32 chainAttempt ;
for ( chainAttempt = 0 ; chainAttempt < chainLimit ; chainAttempt + + ) {
PREFETCH_L1 ( ddsBase + dms - > chainTable [ chainIndex + chainAttempt ] ) ;
}
for ( chainAttempt = 0 ; chainAttempt < chainLimit ; chainAttempt + + , chainIndex + + ) {
size_t currentMl = 0 ;
const BYTE * match ;
matchIndex = dms - > chainTable [ chainIndex ] ;
match = ddsBase + matchIndex ;
/* guaranteed by table construction */
assert ( matchIndex > = ddsLowestIndex ) ;
assert ( match + 4 < = ddsEnd ) ;
if ( MEM_read32 ( match ) = = MEM_read32 ( ip ) ) {
/* assumption : matchIndex <= dictLimit-4 (by table construction) */
currentMl = ZSTD_count_2segments ( ip + 4 , match + 4 , iLimit , ddsEnd , prefixStart ) + 4 ;
}
/* save best solution */
if ( currentMl > ml ) {
ml = currentMl ;
2021-12-29 17:30:43 -08:00
* offsetPtr = OFFSET_TO_OFFBASE ( curr - ( matchIndex + ddsIndexDelta ) ) ;
2020-11-02 17:52:29 -08:00
if ( ip + currentMl = = iLimit ) break ; /* best possible, avoids read overflow on next attempt */
}
}
}
return ml ;
}
/* *********************************
* Hash Chain
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
# define NEXT_IN_CHAIN(d, mask) chainTable[(d) & (mask)]
/* Update chains up to ip (excluded)
Assumption : always within prefix ( i . e . not within extDict ) */
FORCE_INLINE_TEMPLATE U32 ZSTD_insertAndFindFirstIndex_internal (
ZSTD_matchState_t * ms ,
const ZSTD_compressionParameters * const cParams ,
const BYTE * ip , U32 const mls )
{
U32 * const hashTable = ms - > hashTable ;
const U32 hashLog = cParams - > hashLog ;
U32 * const chainTable = ms - > chainTable ;
const U32 chainMask = ( 1 < < cParams - > chainLog ) - 1 ;
const BYTE * const base = ms - > window . base ;
const U32 target = ( U32 ) ( ip - base ) ;
U32 idx = ms - > nextToUpdate ;
while ( idx < target ) { /* catch up */
size_t const h = ZSTD_hashPtr ( base + idx , hashLog , mls ) ;
NEXT_IN_CHAIN ( idx , chainMask ) = hashTable [ h ] ;
hashTable [ h ] = idx ;
idx + + ;
}
ms - > nextToUpdate = target ;
return hashTable [ ZSTD_hashPtr ( ip , hashLog , mls ) ] ;
}
U32 ZSTD_insertAndFindFirstIndex ( ZSTD_matchState_t * ms , const BYTE * ip ) {
const ZSTD_compressionParameters * const cParams = & ms - > cParams ;
return ZSTD_insertAndFindFirstIndex_internal ( ms , cParams , ip , ms - > cParams . minMatch ) ;
}
2017-09-01 18:28:35 -07:00
/* inlining is important to hardwire a hot branch (template emulation) */
FORCE_INLINE_TEMPLATE
2021-10-21 12:52:26 -07:00
size_t ZSTD_HcFindBestMatch (
2018-08-23 12:17:58 -07:00
ZSTD_matchState_t * ms ,
2017-09-01 18:28:35 -07:00
const BYTE * const ip , const BYTE * const iLimit ,
size_t * offsetPtr ,
2018-05-16 04:07:09 -04:00
const U32 mls , const ZSTD_dictMode_e dictMode )
2017-09-01 18:28:35 -07:00
{
2018-08-23 12:17:58 -07:00
const ZSTD_compressionParameters * const cParams = & ms - > cParams ;
2017-12-12 16:51:00 -08:00
U32 * const chainTable = ms - > chainTable ;
const U32 chainSize = ( 1 < < cParams - > chainLog ) ;
2017-09-01 18:28:35 -07:00
const U32 chainMask = chainSize - 1 ;
2018-02-23 16:48:18 -08:00
const BYTE * const base = ms - > window . base ;
const BYTE * const dictBase = ms - > window . dictBase ;
const U32 dictLimit = ms - > window . dictLimit ;
2017-09-01 18:28:35 -07:00
const BYTE * const prefixStart = base + dictLimit ;
const BYTE * const dictEnd = dictBase + dictLimit ;
2020-08-11 14:31:09 -07:00
const U32 curr = ( U32 ) ( ip - base ) ;
2019-05-31 16:08:48 -07:00
const U32 maxDistance = 1U < < cParams - > windowLog ;
2019-08-02 14:21:39 +02:00
const U32 lowestValid = ms - > window . lowLimit ;
2020-08-11 14:31:09 -07:00
const U32 withinMaxDistance = ( curr - lowestValid > maxDistance ) ? curr - maxDistance : lowestValid ;
2019-08-02 14:21:39 +02:00
const U32 isDictionary = ( ms - > loadedDictEnd ! = 0 ) ;
const U32 lowLimit = isDictionary ? lowestValid : withinMaxDistance ;
2020-08-11 14:31:09 -07:00
const U32 minChain = curr > chainSize ? curr - chainSize : 0 ;
2017-12-12 16:51:00 -08:00
U32 nbAttempts = 1U < < cParams - > searchLog ;
2017-09-01 18:28:35 -07:00
size_t ml = 4 - 1 ;
2020-08-13 11:57:31 -04:00
const ZSTD_matchState_t * const dms = ms - > dictMatchState ;
const U32 ddsHashLog = dictMode = = ZSTD_dedicatedDictSearch
2020-08-18 15:20:12 -04:00
? dms - > cParams . hashLog - ZSTD_LAZY_DDSS_BUCKET_LOG : 0 ;
2020-08-28 14:14:29 -04:00
const size_t ddsIdx = dictMode = = ZSTD_dedicatedDictSearch
? ZSTD_hashPtr ( ip , ddsHashLog , mls ) < < ZSTD_LAZY_DDSS_BUCKET_LOG : 0 ;
2020-08-13 11:57:31 -04:00
U32 matchIndex ;
if ( dictMode = = ZSTD_dedicatedDictSearch ) {
const U32 * entry = & dms - > hashTable [ ddsIdx ] ;
PREFETCH_L1 ( entry ) ;
}
2020-06-11 22:54:36 -07:00
2020-08-13 14:54:10 -04:00
/* HC4 match finder */
matchIndex = ZSTD_insertAndFindFirstIndex_internal ( ms , cParams , ip , mls ) ;
2020-08-26 18:33:44 -04:00
for ( ; ( matchIndex > = lowLimit ) & ( nbAttempts > 0 ) ; nbAttempts - - ) {
2017-09-01 18:28:35 -07:00
size_t currentMl = 0 ;
2018-05-16 04:07:09 -04:00
if ( ( dictMode ! = ZSTD_extDict ) | | matchIndex > = dictLimit ) {
2017-11-19 14:40:21 -08:00
const BYTE * const match = base + matchIndex ;
2018-11-12 17:05:32 -08:00
assert ( matchIndex > = dictLimit ) ; /* ensures this is true if dictMode != ZSTD_extDict */
2017-09-01 18:28:35 -07:00
if ( match [ ml ] = = ip [ ml ] ) /* potentially better */
currentMl = ZSTD_count ( ip , match , iLimit ) ;
} else {
2017-11-19 14:40:21 -08:00
const BYTE * const match = dictBase + matchIndex ;
assert ( match + 4 < = dictEnd ) ;
2017-09-01 18:28:35 -07:00
if ( MEM_read32 ( match ) = = MEM_read32 ( ip ) ) /* assumption : matchIndex <= dictLimit-4 (by table construction) */
currentMl = ZSTD_count_2segments ( ip + 4 , match + 4 , iLimit , dictEnd , prefixStart ) + 4 ;
}
/* save best solution */
if ( currentMl > ml ) {
ml = currentMl ;
2021-12-29 17:30:43 -08:00
* offsetPtr = OFFSET_TO_OFFBASE ( curr - matchIndex ) ;
2017-09-01 18:28:35 -07:00
if ( ip + currentMl = = iLimit ) break ; /* best possible, avoids read overflow on next attempt */
}
if ( matchIndex < = minChain ) break ;
matchIndex = NEXT_IN_CHAIN ( matchIndex , chainMask ) ;
}
2021-10-08 11:45:30 -07:00
assert ( nbAttempts < = ( 1U < < ZSTD_SEARCHLOG_MAX ) ) ; /* Check we haven't underflowed. */
2020-08-11 18:48:22 -04:00
if ( dictMode = = ZSTD_dedicatedDictSearch ) {
2020-11-02 17:52:29 -08:00
ml = ZSTD_dedicatedDictSearch_lazy_search ( offsetPtr , ml , nbAttempts , dms ,
ip , iLimit , prefixStart , curr , dictLimit , ddsIdx ) ;
2020-06-11 18:54:44 -07:00
} else if ( dictMode = = ZSTD_dictMatchState ) {
2018-05-29 16:26:23 -04:00
const U32 * const dmsChainTable = dms - > chainTable ;
2018-08-27 16:19:41 -07:00
const U32 dmsChainSize = ( 1 < < dms - > cParams . chainLog ) ;
const U32 dmsChainMask = dmsChainSize - 1 ;
2018-05-29 16:26:23 -04:00
const U32 dmsLowestIndex = dms - > window . dictLimit ;
const BYTE * const dmsBase = dms - > window . base ;
const BYTE * const dmsEnd = dms - > window . nextSrc ;
const U32 dmsSize = ( U32 ) ( dmsEnd - dmsBase ) ;
const U32 dmsIndexDelta = dictLimit - dmsSize ;
2018-08-27 16:19:41 -07:00
const U32 dmsMinChain = dmsSize > dmsChainSize ? dmsSize - dmsChainSize : 0 ;
2018-05-29 16:26:23 -04:00
2018-08-27 16:19:41 -07:00
matchIndex = dms - > hashTable [ ZSTD_hashPtr ( ip , dms - > cParams . hashLog , mls ) ] ;
2018-05-29 16:26:23 -04:00
2020-08-26 18:33:44 -04:00
for ( ; ( matchIndex > = dmsLowestIndex ) & ( nbAttempts > 0 ) ; nbAttempts - - ) {
2018-05-29 16:26:23 -04:00
size_t currentMl = 0 ;
const BYTE * const match = dmsBase + matchIndex ;
assert ( match + 4 < = dmsEnd ) ;
if ( MEM_read32 ( match ) = = MEM_read32 ( ip ) ) /* assumption : matchIndex <= dictLimit-4 (by table construction) */
currentMl = ZSTD_count_2segments ( ip + 4 , match + 4 , iLimit , dmsEnd , prefixStart ) + 4 ;
/* save best solution */
if ( currentMl > ml ) {
ml = currentMl ;
2021-12-23 21:58:08 -08:00
assert ( curr > matchIndex + dmsIndexDelta ) ;
2021-12-29 17:30:43 -08:00
* offsetPtr = OFFSET_TO_OFFBASE ( curr - ( matchIndex + dmsIndexDelta ) ) ;
2018-05-29 16:26:23 -04:00
if ( ip + currentMl = = iLimit ) break ; /* best possible, avoids read overflow on next attempt */
}
if ( matchIndex < = dmsMinChain ) break ;
2020-06-11 18:54:44 -07:00
2018-08-27 16:19:41 -07:00
matchIndex = dmsChainTable [ matchIndex & dmsChainMask ] ;
2018-05-29 16:26:23 -04:00
}
}
2017-09-01 18:28:35 -07:00
return ml ;
}
2020-11-02 17:52:29 -08:00
/* *********************************
* ( SIMD ) Row - based matchfinder
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
/* Constants for row-based hash */
2021-06-03 10:54:31 +03:00
# define ZSTD_ROW_HASH_TAG_OFFSET 16 /* byte offset of hashes in the match state's tagTable from the beginning of a row */
# define ZSTD_ROW_HASH_TAG_BITS 8 /* nb bits to use for the tag */
2020-11-02 17:52:29 -08:00
# define ZSTD_ROW_HASH_TAG_MASK ((1u << ZSTD_ROW_HASH_TAG_BITS) - 1)
2021-09-23 08:27:44 -07:00
# define ZSTD_ROW_HASH_MAX_ENTRIES 64 /* absolute maximum number of entries per row, for all configurations */
2020-11-02 17:52:29 -08:00
# define ZSTD_ROW_HASH_CACHE_MASK (ZSTD_ROW_HASH_CACHE_SIZE - 1)
2021-06-03 10:54:31 +03:00
typedef U64 ZSTD_VecMask ; /* Clarifies when we are interacting with a U64 representing a mask of matches */
2020-11-02 17:52:29 -08:00
2022-02-07 12:22:04 -05:00
/* ZSTD_VecMask_next():
* Starting from the LSB , returns the idx of the next non - zero bit .
* Basically counting the nb of trailing zeroes .
*/
MEM_STATIC U32 ZSTD_VecMask_next ( ZSTD_VecMask val ) {
return ZSTD_countTrailingZeros64 ( val ) ;
}
2021-06-03 10:54:31 +03:00
/* ZSTD_rotateRight_*():
2021-06-09 01:50:25 -04:00
* Rotates a bitfield to the right by " count " bits .
* https : //en.wikipedia.org/w/index.php?title=Circular_shift&oldid=991635599#Implementing_circular_shifts
2020-11-02 17:52:29 -08:00
*/
2021-06-03 10:54:31 +03:00
FORCE_INLINE_TEMPLATE
U64 ZSTD_rotateRight_U64 ( U64 const value , U32 count ) {
assert ( count < 64 ) ;
count & = 0x3F ; /* for fickle pattern recognition */
return ( value > > count ) | ( U64 ) ( value < < ( ( 0U - count ) & 0x3F ) ) ;
}
2021-06-09 01:50:25 -04:00
FORCE_INLINE_TEMPLATE
U32 ZSTD_rotateRight_U32 ( U32 const value , U32 count ) {
assert ( count < 32 ) ;
count & = 0x1F ; /* for fickle pattern recognition */
return ( value > > count ) | ( U32 ) ( value < < ( ( 0U - count ) & 0x1F ) ) ;
}
FORCE_INLINE_TEMPLATE
U16 ZSTD_rotateRight_U16 ( U16 const value , U32 count ) {
assert ( count < 16 ) ;
count & = 0x0F ; /* for fickle pattern recognition */
return ( value > > count ) | ( U16 ) ( value < < ( ( 0U - count ) & 0x0F ) ) ;
2020-11-02 17:52:29 -08:00
}
/* ZSTD_row_nextIndex():
* Returns the next index to insert at within a tagTable row , and updates the " head "
* value to reflect the update . Essentially cycles backwards from [ 0 , { entries per row } )
*/
FORCE_INLINE_TEMPLATE U32 ZSTD_row_nextIndex ( BYTE * const tagRow , U32 const rowMask ) {
U32 const next = ( * tagRow - 1 ) & rowMask ;
* tagRow = ( BYTE ) next ;
return next ;
}
/* ZSTD_isAligned():
* Checks that a pointer is aligned to " align " bytes which must be a power of 2.
*/
MEM_STATIC int ZSTD_isAligned ( void const * ptr , size_t align ) {
assert ( ( align & ( align - 1 ) ) = = 0 ) ;
return ( ( ( size_t ) ptr ) & ( align - 1 ) ) = = 0 ;
}
/* ZSTD_row_prefetch():
* Performs prefetching for the hashTable and tagTable at a given row .
*/
FORCE_INLINE_TEMPLATE void ZSTD_row_prefetch ( U32 const * hashTable , U16 const * tagTable , U32 const relRow , U32 const rowLog ) {
PREFETCH_L1 ( hashTable + relRow ) ;
2021-06-03 10:54:31 +03:00
if ( rowLog > = 5 ) {
2020-11-02 17:52:29 -08:00
PREFETCH_L1 ( hashTable + relRow + 16 ) ;
2021-06-03 10:54:31 +03:00
/* Note: prefetching more of the hash table does not appear to be beneficial for 128-entry rows */
2020-11-02 17:52:29 -08:00
}
PREFETCH_L1 ( tagTable + relRow ) ;
2021-06-03 10:54:31 +03:00
if ( rowLog = = 6 ) {
PREFETCH_L1 ( tagTable + relRow + 32 ) ;
}
assert ( rowLog = = 4 | | rowLog = = 5 | | rowLog = = 6 ) ;
2020-11-02 17:52:29 -08:00
assert ( ZSTD_isAligned ( hashTable + relRow , 64 ) ) ; /* prefetched hash row always 64-byte aligned */
2021-06-03 10:54:31 +03:00
assert ( ZSTD_isAligned ( tagTable + relRow , ( size_t ) 1 < < rowLog ) ) ; /* prefetched tagRow sits on correct multiple of bytes (32,64,128) */
2020-11-02 17:52:29 -08:00
}
/* ZSTD_row_fillHashCache():
2021-04-08 19:54:19 -07:00
* Fill up the hash cache starting at idx , prefetching up to ZSTD_ROW_HASH_CACHE_SIZE entries ,
* but not beyond iLimit .
2020-11-02 17:52:29 -08:00
*/
2021-08-18 06:52:17 -07:00
FORCE_INLINE_TEMPLATE void ZSTD_row_fillHashCache ( ZSTD_matchState_t * ms , const BYTE * base ,
2020-11-02 17:52:29 -08:00
U32 const rowLog , U32 const mls ,
2021-04-08 19:54:19 -07:00
U32 idx , const BYTE * const iLimit )
2020-11-02 17:52:29 -08:00
{
U32 const * const hashTable = ms - > hashTable ;
U16 const * const tagTable = ms - > tagTable ;
U32 const hashLog = ms - > rowHashLog ;
2021-04-08 19:54:19 -07:00
U32 const maxElemsToPrefetch = ( base + idx ) > iLimit ? 0 : ( U32 ) ( iLimit - ( base + idx ) + 1 ) ;
2020-11-02 17:52:29 -08:00
U32 const lim = idx + MIN ( ZSTD_ROW_HASH_CACHE_SIZE , maxElemsToPrefetch ) ;
for ( ; idx < lim ; + + idx ) {
U32 const hash = ( U32 ) ZSTD_hashPtr ( base + idx , hashLog + ZSTD_ROW_HASH_TAG_BITS , mls ) ;
U32 const row = ( hash > > ZSTD_ROW_HASH_TAG_BITS ) < < rowLog ;
ZSTD_row_prefetch ( hashTable , tagTable , row , rowLog ) ;
ms - > hashCache [ idx & ZSTD_ROW_HASH_CACHE_MASK ] = hash ;
}
DEBUGLOG ( 6 , " ZSTD_row_fillHashCache(): [%u %u %u %u %u %u %u %u] " , ms - > hashCache [ 0 ] , ms - > hashCache [ 1 ] ,
ms - > hashCache [ 2 ] , ms - > hashCache [ 3 ] , ms - > hashCache [ 4 ] ,
ms - > hashCache [ 5 ] , ms - > hashCache [ 6 ] , ms - > hashCache [ 7 ] ) ;
}
/* ZSTD_row_nextCachedHash():
* Returns the hash of base + idx , and replaces the hash in the hash cache with the byte at
* base + idx + ZSTD_ROW_HASH_CACHE_SIZE . Also prefetches the appropriate rows from hashTable and tagTable .
*/
FORCE_INLINE_TEMPLATE U32 ZSTD_row_nextCachedHash ( U32 * cache , U32 const * hashTable ,
U16 const * tagTable , BYTE const * base ,
U32 idx , U32 const hashLog ,
U32 const rowLog , U32 const mls )
{
U32 const newHash = ( U32 ) ZSTD_hashPtr ( base + idx + ZSTD_ROW_HASH_CACHE_SIZE , hashLog + ZSTD_ROW_HASH_TAG_BITS , mls ) ;
U32 const row = ( newHash > > ZSTD_ROW_HASH_TAG_BITS ) < < rowLog ;
ZSTD_row_prefetch ( hashTable , tagTable , row , rowLog ) ;
{ U32 const hash = cache [ idx & ZSTD_ROW_HASH_CACHE_MASK ] ;
cache [ idx & ZSTD_ROW_HASH_CACHE_MASK ] = newHash ;
return hash ;
}
}
2021-09-28 07:48:56 -07:00
/* ZSTD_row_update_internalImpl():
* Updates the hash table with positions starting from updateStartIdx until updateEndIdx .
*/
FORCE_INLINE_TEMPLATE void ZSTD_row_update_internalImpl ( ZSTD_matchState_t * ms ,
U32 updateStartIdx , U32 const updateEndIdx ,
U32 const mls , U32 const rowLog ,
U32 const rowMask , U32 const useCache )
{
U32 * const hashTable = ms - > hashTable ;
U16 * const tagTable = ms - > tagTable ;
U32 const hashLog = ms - > rowHashLog ;
const BYTE * const base = ms - > window . base ;
DEBUGLOG ( 6 , " ZSTD_row_update_internalImpl(): updateStartIdx=%u, updateEndIdx=%u " , updateStartIdx , updateEndIdx ) ;
for ( ; updateStartIdx < updateEndIdx ; + + updateStartIdx ) {
U32 const hash = useCache ? ZSTD_row_nextCachedHash ( ms - > hashCache , hashTable , tagTable , base , updateStartIdx , hashLog , rowLog , mls )
: ( U32 ) ZSTD_hashPtr ( base + updateStartIdx , hashLog + ZSTD_ROW_HASH_TAG_BITS , mls ) ;
U32 const relRow = ( hash > > ZSTD_ROW_HASH_TAG_BITS ) < < rowLog ;
U32 * const row = hashTable + relRow ;
BYTE * tagRow = ( BYTE * ) ( tagTable + relRow ) ; /* Though tagTable is laid out as a table of U16, each tag is only 1 byte.
Explicit cast allows us to get exact desired position within each row */
U32 const pos = ZSTD_row_nextIndex ( tagRow , rowMask ) ;
assert ( hash = = ZSTD_hashPtr ( base + updateStartIdx , hashLog + ZSTD_ROW_HASH_TAG_BITS , mls ) ) ;
( ( BYTE * ) tagRow ) [ pos + ZSTD_ROW_HASH_TAG_OFFSET ] = hash & ZSTD_ROW_HASH_TAG_MASK ;
row [ pos ] = updateStartIdx ;
}
}
2020-11-02 17:52:29 -08:00
/* ZSTD_row_update_internal():
2021-09-28 07:48:56 -07:00
* Inserts the byte at ip into the appropriate position in the hash table , and updates ms - > nextToUpdate .
* Skips sections of long matches as is necessary .
2020-11-02 17:52:29 -08:00
*/
FORCE_INLINE_TEMPLATE void ZSTD_row_update_internal ( ZSTD_matchState_t * ms , const BYTE * ip ,
U32 const mls , U32 const rowLog ,
U32 const rowMask , U32 const useCache )
{
2021-08-18 06:52:17 -07:00
U32 idx = ms - > nextToUpdate ;
2020-11-02 17:52:29 -08:00
const BYTE * const base = ms - > window . base ;
const U32 target = ( U32 ) ( ip - base ) ;
2021-09-28 07:48:56 -07:00
const U32 kSkipThreshold = 384 ;
const U32 kMaxMatchStartPositionsToUpdate = 96 ;
const U32 kMaxMatchEndPositionsToUpdate = 32 ;
2021-08-18 06:52:17 -07:00
if ( useCache ) {
2021-10-08 11:13:11 -07:00
/* Only skip positions when using hash cache, i.e.
2021-09-23 07:08:56 -07:00
* if we are loading a dict , don ' t skip anything .
2021-09-28 07:48:56 -07:00
* If we decide to skip , then we only update a set number
* of positions at the beginning and end of the match .
2021-09-23 07:08:56 -07:00
*/
2021-09-28 07:48:56 -07:00
if ( UNLIKELY ( target - idx > kSkipThreshold ) ) {
U32 const bound = idx + kMaxMatchStartPositionsToUpdate ;
ZSTD_row_update_internalImpl ( ms , idx , bound , mls , rowLog , rowMask , useCache ) ;
idx = target - kMaxMatchEndPositionsToUpdate ;
2021-08-18 06:52:17 -07:00
ZSTD_row_fillHashCache ( ms , base , rowLog , mls , idx , ip + 1 ) ;
}
}
2021-09-28 07:48:56 -07:00
assert ( target > = idx ) ;
ZSTD_row_update_internalImpl ( ms , idx , target , mls , rowLog , rowMask , useCache ) ;
2020-11-02 17:52:29 -08:00
ms - > nextToUpdate = target ;
}
/* ZSTD_row_update():
* External wrapper for ZSTD_row_update_internal ( ) . Used for filling the hashtable during dictionary
* processing .
*/
void ZSTD_row_update ( ZSTD_matchState_t * const ms , const BYTE * ip ) {
2021-10-26 08:21:31 -07:00
const U32 rowLog = BOUNDED ( 4 , ms - > cParams . searchLog , 6 ) ;
2020-11-02 17:52:29 -08:00
const U32 rowMask = ( 1u < < rowLog ) - 1 ;
const U32 mls = MIN ( ms - > cParams . minMatch , 6 /* mls caps out at 6 */ ) ;
DEBUGLOG ( 5 , " ZSTD_row_update(), rowLog=%u " , rowLog ) ;
2022-03-12 08:52:40 +01:00
ZSTD_row_update_internal ( ms , ip , mls , rowLog , rowMask , 0 /* don't use cache */ ) ;
2020-11-02 17:52:29 -08:00
}
2022-05-23 14:49:35 +00:00
/* Returns the mask width of bits group of which will be set to 1. Given not all
* architectures have easy movemask instruction , this helps to iterate over
* groups of bits easier and faster .
*/
FORCE_INLINE_TEMPLATE U32
ZSTD_row_matchMaskGroupWidth ( const U32 rowEntries )
{
assert ( ( rowEntries = = 16 ) | | ( rowEntries = = 32 ) | | rowEntries = = 64 ) ;
assert ( rowEntries < = ZSTD_ROW_HASH_MAX_ENTRIES ) ;
2022-05-23 14:51:47 +00:00
( void ) rowEntries ;
2022-05-23 14:49:35 +00:00
# if defined(ZSTD_ARCH_ARM_NEON)
if ( rowEntries = = 16 ) {
return 4 ;
}
if ( rowEntries = = 32 ) {
return 2 ;
}
if ( rowEntries = = 64 ) {
return 1 ;
}
# endif
return 1 ;
}
2021-12-14 02:12:09 -08:00
# if defined(ZSTD_ARCH_X86_SSE2)
FORCE_INLINE_TEMPLATE ZSTD_VecMask
ZSTD_row_getSSEMask ( int nbChunks , const BYTE * const src , const BYTE tag , const U32 head )
{
const __m128i comparisonMask = _mm_set1_epi8 ( ( char ) tag ) ;
int matches [ 4 ] = { 0 } ;
int i ;
assert ( nbChunks = = 1 | | nbChunks = = 2 | | nbChunks = = 4 ) ;
for ( i = 0 ; i < nbChunks ; i + + ) {
const __m128i chunk = _mm_loadu_si128 ( ( const __m128i * ) ( const void * ) ( src + 16 * i ) ) ;
const __m128i equalMask = _mm_cmpeq_epi8 ( chunk , comparisonMask ) ;
matches [ i ] = _mm_movemask_epi8 ( equalMask ) ;
}
if ( nbChunks = = 1 ) return ZSTD_rotateRight_U16 ( ( U16 ) matches [ 0 ] , head ) ;
if ( nbChunks = = 2 ) return ZSTD_rotateRight_U32 ( ( U32 ) matches [ 1 ] < < 16 | ( U32 ) matches [ 0 ] , head ) ;
assert ( nbChunks = = 4 ) ;
return ZSTD_rotateRight_U64 ( ( U64 ) matches [ 3 ] < < 48 | ( U64 ) matches [ 2 ] < < 32 | ( U64 ) matches [ 1 ] < < 16 | ( U64 ) matches [ 0 ] , head ) ;
}
# endif
2022-05-22 10:34:33 +00:00
# if defined(ZSTD_ARCH_ARM_NEON)
2022-05-23 14:49:35 +00:00
FORCE_INLINE_TEMPLATE ZSTD_VecMask
ZSTD_row_getNEONMask ( const U32 rowEntries , const BYTE * const src , const BYTE tag , const U32 headGrouped )
{
assert ( ( rowEntries = = 16 ) | | ( rowEntries = = 32 ) | | rowEntries = = 64 ) ;
if ( rowEntries = = 16 ) {
/* vshrn_n_u16 shifts by 4 every u16 and narrows to 8 lower bits.
* After that groups of 4 bits represent the equalMask . We lower
* all bits except the highest in these groups by doing AND with
* 0x88 = 0 b10001000 .
*/
const uint8x16_t chunk = vld1q_u8 ( src ) ;
const uint16x8_t equalMask = vreinterpretq_u16_u8 ( vceqq_u8 ( chunk , vdupq_n_u8 ( tag ) ) ) ;
const uint8x8_t res = vshrn_n_u16 ( equalMask , 4 ) ;
const U64 matches = vget_lane_u64 ( vreinterpret_u64_u8 ( res ) , 0 ) ;
return ZSTD_rotateRight_U64 ( matches , headGrouped ) & 0x8888888888888888ull ;
} else if ( rowEntries = = 32 ) {
/* Same idea as with rowEntries == 16 but doing AND with
* 0x55 = 0 b01010101 .
*/
const uint16x8x2_t chunk = vld2q_u16 ( ( const uint16_t * ) ( const void * ) src ) ;
const uint8x16_t chunk0 = vreinterpretq_u8_u16 ( chunk . val [ 0 ] ) ;
const uint8x16_t chunk1 = vreinterpretq_u8_u16 ( chunk . val [ 1 ] ) ;
const uint8x16_t dup = vdupq_n_u8 ( tag ) ;
const uint8x8_t t0 = vshrn_n_u16 ( vreinterpretq_u16_u8 ( vceqq_u8 ( chunk0 , dup ) ) , 6 ) ;
const uint8x8_t t1 = vshrn_n_u16 ( vreinterpretq_u16_u8 ( vceqq_u8 ( chunk1 , dup ) ) , 6 ) ;
const uint8x8_t res = vsli_n_u8 ( t0 , t1 , 4 ) ;
const U64 matches = vget_lane_u64 ( vreinterpret_u64_u8 ( res ) , 0 ) ;
return ZSTD_rotateRight_U64 ( matches , headGrouped ) & 0x5555555555555555ull ;
} else { /* rowEntries == 64 */
const uint8x16x4_t chunk = vld4q_u8 ( src ) ;
const uint8x16_t dup = vdupq_n_u8 ( tag ) ;
const uint8x16_t cmp0 = vceqq_u8 ( chunk . val [ 0 ] , dup ) ;
const uint8x16_t cmp1 = vceqq_u8 ( chunk . val [ 1 ] , dup ) ;
const uint8x16_t cmp2 = vceqq_u8 ( chunk . val [ 2 ] , dup ) ;
const uint8x16_t cmp3 = vceqq_u8 ( chunk . val [ 3 ] , dup ) ;
const uint8x16_t t0 = vsriq_n_u8 ( cmp1 , cmp0 , 1 ) ;
const uint8x16_t t1 = vsriq_n_u8 ( cmp3 , cmp2 , 1 ) ;
const uint8x16_t t2 = vsriq_n_u8 ( t1 , t0 , 2 ) ;
const uint8x16_t t3 = vsriq_n_u8 ( t2 , t2 , 4 ) ;
const uint8x8_t t4 = vshrn_n_u16 ( vreinterpretq_u16_u8 ( t3 ) , 4 ) ;
const U64 matches = vget_lane_u64 ( vreinterpret_u64_u8 ( t4 ) , 0 ) ;
return ZSTD_rotateRight_U64 ( matches , headGrouped ) ;
}
2022-05-22 10:34:33 +00:00
}
2022-05-23 14:49:35 +00:00
# endif
2022-05-22 10:34:33 +00:00
/* Returns a ZSTD_VecMask (U64) that has the nth group (determined by
* ZSTD_row_matchMaskGroupWidth ) of bits set to 1 if the newly - computed " tag "
* matches the hash at the nth position in a row of the tagTable .
* Each row is a circular buffer beginning at the value of " headGrouped " . So we
* must rotate the " matches " bitfield to match up with the actual layout of the
* entries within the hashTable */
2021-12-14 02:12:09 -08:00
FORCE_INLINE_TEMPLATE ZSTD_VecMask
2022-05-22 10:34:33 +00:00
ZSTD_row_getMatchMask ( const BYTE * const tagRow , const BYTE tag , const U32 headGrouped , const U32 rowEntries )
2021-12-14 02:12:09 -08:00
{
2021-06-09 01:50:25 -04:00
const BYTE * const src = tagRow + ZSTD_ROW_HASH_TAG_OFFSET ;
2021-06-03 10:54:31 +03:00
assert ( ( rowEntries = = 16 ) | | ( rowEntries = = 32 ) | | rowEntries = = 64 ) ;
assert ( rowEntries < = ZSTD_ROW_HASH_MAX_ENTRIES ) ;
2022-05-22 10:34:33 +00:00
assert ( ZSTD_row_matchMaskGroupWidth ( rowEntries ) * rowEntries < = sizeof ( ZSTD_VecMask ) * 8 ) ;
2021-12-14 02:12:09 -08:00
2021-06-09 01:50:25 -04:00
# if defined(ZSTD_ARCH_X86_SSE2)
2021-12-14 02:12:09 -08:00
2022-05-22 10:34:33 +00:00
return ZSTD_row_getSSEMask ( rowEntries / 16 , src , tag , headGrouped ) ;
2021-12-14 02:12:09 -08:00
# else /* SW or NEON-LE */
# if defined(ZSTD_ARCH_ARM_NEON)
/* This NEON path only works for little endian - otherwise use SWAR below */
2021-06-09 01:50:25 -04:00
if ( MEM_isLittleEndian ( ) ) {
2022-05-23 14:49:35 +00:00
return ZSTD_row_getNEONMask ( rowEntries , src , tag , headGrouped ) ;
2021-06-09 01:50:25 -04:00
}
2021-12-14 02:12:09 -08:00
# endif /* ZSTD_ARCH_ARM_NEON */
/* SWAR */
{ const size_t chunkSize = sizeof ( size_t ) ;
2021-06-09 01:50:25 -04:00
const size_t shiftAmount = ( ( chunkSize * 8 ) - chunkSize ) ;
const size_t xFF = ~ ( ( size_t ) 0 ) ;
const size_t x01 = xFF / 0xFF ;
const size_t x80 = x01 < < 7 ;
const size_t splatChar = tag * x01 ;
2021-06-03 10:54:31 +03:00
ZSTD_VecMask matches = 0 ;
2021-06-09 01:50:25 -04:00
int i = rowEntries - chunkSize ;
assert ( ( sizeof ( size_t ) = = 4 ) | | ( sizeof ( size_t ) = = 8 ) ) ;
if ( MEM_isLittleEndian ( ) ) { /* runtime check so have two loops */
const size_t extractMagic = ( xFF / 0x7F ) > > chunkSize ;
do {
size_t chunk = MEM_readST ( & src [ i ] ) ;
chunk ^ = splatChar ;
chunk = ( ( ( chunk | x80 ) - x01 ) | chunk ) & x80 ;
matches < < = chunkSize ;
matches | = ( chunk * extractMagic ) > > shiftAmount ;
i - = chunkSize ;
} while ( i > = 0 ) ;
} else { /* big endian: reverse bits during extraction */
const size_t msb = xFF ^ ( xFF > > 1 ) ;
const size_t extractMagic = ( msb / 0x1FF ) | msb ;
do {
size_t chunk = MEM_readST ( & src [ i ] ) ;
chunk ^ = splatChar ;
chunk = ( ( ( chunk | x80 ) - x01 ) | chunk ) & x80 ;
matches < < = chunkSize ;
matches | = ( ( chunk > > 7 ) * extractMagic ) > > shiftAmount ;
i - = chunkSize ;
} while ( i > = 0 ) ;
}
matches = ~ matches ;
if ( rowEntries = = 16 ) {
2022-05-22 10:34:33 +00:00
return ZSTD_rotateRight_U16 ( ( U16 ) matches , headGrouped ) ;
2021-06-03 10:54:31 +03:00
} else if ( rowEntries = = 32 ) {
2022-05-22 10:34:33 +00:00
return ZSTD_rotateRight_U32 ( ( U32 ) matches , headGrouped ) ;
2021-06-03 10:54:31 +03:00
} else {
2022-05-22 10:34:33 +00:00
return ZSTD_rotateRight_U64 ( ( U64 ) matches , headGrouped ) ;
2021-06-09 01:50:25 -04:00
}
2020-11-02 17:52:29 -08:00
}
2021-06-09 01:50:25 -04:00
# endif
2020-11-02 17:52:29 -08:00
}
/* The high-level approach of the SIMD row based match finder is as follows:
* - Figure out where to insert the new entry :
* - Generate a hash from a byte along with an additional 1 - byte " short hash " . The additional byte is our " tag "
* - The hashTable is effectively split into groups or " rows " of 16 or 32 entries of U32 , and the hash determines
* which row to insert into .
* - Determine the correct position within the row to insert the entry into . Each row of 16 or 32 can
* be considered as a circular buffer with a " head " index that resides in the tagTable .
* - Also insert the " tag " into the equivalent row and position in the tagTable .
* - Note : The tagTable has 17 or 33 1 - byte entries per row , due to 16 or 32 tags , and 1 " head " entry .
* The 17 or 33 entry rows are spaced out to occur every 32 or 64 bytes , respectively ,
* for alignment / performance reasons , leaving some bytes unused .
* - Use SIMD to efficiently compare the tags in the tagTable to the 1 - byte " short hash " and
* generate a bitfield that we can cycle through to check the collisions in the hash table .
* - Pick the longest match .
*/
FORCE_INLINE_TEMPLATE
2021-10-21 12:52:26 -07:00
size_t ZSTD_RowFindBestMatch (
2020-11-02 17:52:29 -08:00
ZSTD_matchState_t * ms ,
const BYTE * const ip , const BYTE * const iLimit ,
size_t * offsetPtr ,
const U32 mls , const ZSTD_dictMode_e dictMode ,
const U32 rowLog )
{
U32 * const hashTable = ms - > hashTable ;
U16 * const tagTable = ms - > tagTable ;
U32 * const hashCache = ms - > hashCache ;
const U32 hashLog = ms - > rowHashLog ;
const ZSTD_compressionParameters * const cParams = & ms - > cParams ;
const BYTE * const base = ms - > window . base ;
const BYTE * const dictBase = ms - > window . dictBase ;
const U32 dictLimit = ms - > window . dictLimit ;
const BYTE * const prefixStart = base + dictLimit ;
const BYTE * const dictEnd = dictBase + dictLimit ;
const U32 curr = ( U32 ) ( ip - base ) ;
const U32 maxDistance = 1U < < cParams - > windowLog ;
const U32 lowestValid = ms - > window . lowLimit ;
const U32 withinMaxDistance = ( curr - lowestValid > maxDistance ) ? curr - maxDistance : lowestValid ;
const U32 isDictionary = ( ms - > loadedDictEnd ! = 0 ) ;
const U32 lowLimit = isDictionary ? lowestValid : withinMaxDistance ;
const U32 rowEntries = ( 1U < < rowLog ) ;
const U32 rowMask = rowEntries - 1 ;
const U32 cappedSearchLog = MIN ( cParams - > searchLog , rowLog ) ; /* nb of searches is capped at nb entries per row */
2022-05-22 10:34:33 +00:00
const U32 groupWidth = ZSTD_row_matchMaskGroupWidth ( rowEntries ) ;
2020-11-02 17:52:29 -08:00
U32 nbAttempts = 1U < < cappedSearchLog ;
size_t ml = 4 - 1 ;
/* DMS/DDS variables that may be referenced laster */
const ZSTD_matchState_t * const dms = ms - > dictMatchState ;
2021-06-03 10:54:31 +03:00
2021-07-29 09:05:51 -07:00
/* Initialize the following variables to satisfy static analyzer */
size_t ddsIdx = 0 ;
U32 ddsExtraAttempts = 0 ; /* cctx hash tables are limited in searches, but allow extra searches into DDS */
U32 dmsTag = 0 ;
2021-05-15 00:40:49 +02:00
U32 * dmsRow = NULL ;
BYTE * dmsTagRow = NULL ;
2020-11-02 17:52:29 -08:00
if ( dictMode = = ZSTD_dedicatedDictSearch ) {
const U32 ddsHashLog = dms - > cParams . hashLog - ZSTD_LAZY_DDSS_BUCKET_LOG ;
{ /* Prefetch DDS hashtable entry */
ddsIdx = ZSTD_hashPtr ( ip , ddsHashLog , mls ) < < ZSTD_LAZY_DDSS_BUCKET_LOG ;
PREFETCH_L1 ( & dms - > hashTable [ ddsIdx ] ) ;
}
2021-04-02 08:07:10 -07:00
ddsExtraAttempts = cParams - > searchLog > rowLog ? 1U < < ( cParams - > searchLog - rowLog ) : 0 ;
2020-11-02 17:52:29 -08:00
}
if ( dictMode = = ZSTD_dictMatchState ) {
/* Prefetch DMS rows */
U32 * const dmsHashTable = dms - > hashTable ;
U16 * const dmsTagTable = dms - > tagTable ;
U32 const dmsHash = ( U32 ) ZSTD_hashPtr ( ip , dms - > rowHashLog + ZSTD_ROW_HASH_TAG_BITS , mls ) ;
U32 const dmsRelRow = ( dmsHash > > ZSTD_ROW_HASH_TAG_BITS ) < < rowLog ;
dmsTag = dmsHash & ZSTD_ROW_HASH_TAG_MASK ;
dmsTagRow = ( BYTE * ) ( dmsTagTable + dmsRelRow ) ;
dmsRow = dmsHashTable + dmsRelRow ;
ZSTD_row_prefetch ( dmsHashTable , dmsTagTable , dmsRelRow , rowLog ) ;
}
/* Update the hashTable and tagTable up to (but not including) ip */
ZSTD_row_update_internal ( ms , ip , mls , rowLog , rowMask , 1 /* useCache */ ) ;
{ /* Get the hash for ip, compute the appropriate row */
U32 const hash = ZSTD_row_nextCachedHash ( hashCache , hashTable , tagTable , base , curr , hashLog , rowLog , mls ) ;
U32 const relRow = ( hash > > ZSTD_ROW_HASH_TAG_BITS ) < < rowLog ;
U32 const tag = hash & ZSTD_ROW_HASH_TAG_MASK ;
U32 * const row = hashTable + relRow ;
BYTE * tagRow = ( BYTE * ) ( tagTable + relRow ) ;
2022-05-22 10:34:33 +00:00
U32 const headGrouped = ( * tagRow & rowMask ) * groupWidth ;
2021-06-03 10:54:31 +03:00
U32 matchBuffer [ ZSTD_ROW_HASH_MAX_ENTRIES ] ;
2020-11-02 17:52:29 -08:00
size_t numMatches = 0 ;
size_t currMatch = 0 ;
2022-05-22 10:34:33 +00:00
ZSTD_VecMask matches = ZSTD_row_getMatchMask ( tagRow , ( BYTE ) tag , headGrouped , rowEntries ) ;
2020-11-02 17:52:29 -08:00
/* Cycle through the matches and prefetch */
for ( ; ( matches > 0 ) & & ( nbAttempts > 0 ) ; - - nbAttempts , matches & = ( matches - 1 ) ) {
2022-05-22 10:34:33 +00:00
U32 const matchPos = ( ( headGrouped + ZSTD_VecMask_next ( matches ) ) / groupWidth ) & rowMask ;
2020-11-02 17:52:29 -08:00
U32 const matchIndex = row [ matchPos ] ;
assert ( numMatches < rowEntries ) ;
if ( matchIndex < lowLimit )
break ;
if ( ( dictMode ! = ZSTD_extDict ) | | matchIndex > = dictLimit ) {
PREFETCH_L1 ( base + matchIndex ) ;
} else {
PREFETCH_L1 ( dictBase + matchIndex ) ;
}
matchBuffer [ numMatches + + ] = matchIndex ;
}
/* Speed opt: insert current byte into hashtable too. This allows us to avoid one iteration of the loop
in ZSTD_row_update_internal ( ) at the next search . */
{
U32 const pos = ZSTD_row_nextIndex ( tagRow , rowMask ) ;
tagRow [ pos + ZSTD_ROW_HASH_TAG_OFFSET ] = ( BYTE ) tag ;
row [ pos ] = ms - > nextToUpdate + + ;
}
/* Return the longest match */
for ( ; currMatch < numMatches ; + + currMatch ) {
U32 const matchIndex = matchBuffer [ currMatch ] ;
size_t currentMl = 0 ;
assert ( matchIndex < curr ) ;
assert ( matchIndex > = lowLimit ) ;
if ( ( dictMode ! = ZSTD_extDict ) | | matchIndex > = dictLimit ) {
const BYTE * const match = base + matchIndex ;
assert ( matchIndex > = dictLimit ) ; /* ensures this is true if dictMode != ZSTD_extDict */
if ( match [ ml ] = = ip [ ml ] ) /* potentially better */
currentMl = ZSTD_count ( ip , match , iLimit ) ;
} else {
const BYTE * const match = dictBase + matchIndex ;
assert ( match + 4 < = dictEnd ) ;
if ( MEM_read32 ( match ) = = MEM_read32 ( ip ) ) /* assumption : matchIndex <= dictLimit-4 (by table construction) */
currentMl = ZSTD_count_2segments ( ip + 4 , match + 4 , iLimit , dictEnd , prefixStart ) + 4 ;
}
/* Save best solution */
if ( currentMl > ml ) {
ml = currentMl ;
2021-12-29 17:30:43 -08:00
* offsetPtr = OFFSET_TO_OFFBASE ( curr - matchIndex ) ;
2020-11-02 17:52:29 -08:00
if ( ip + currentMl = = iLimit ) break ; /* best possible, avoids read overflow on next attempt */
}
}
}
2021-10-08 11:45:30 -07:00
assert ( nbAttempts < = ( 1U < < ZSTD_SEARCHLOG_MAX ) ) ; /* Check we haven't underflowed. */
2020-11-02 17:52:29 -08:00
if ( dictMode = = ZSTD_dedicatedDictSearch ) {
ml = ZSTD_dedicatedDictSearch_lazy_search ( offsetPtr , ml , nbAttempts + ddsExtraAttempts , dms ,
ip , iLimit , prefixStart , curr , dictLimit , ddsIdx ) ;
} else if ( dictMode = = ZSTD_dictMatchState ) {
/* TODO: Measure and potentially add prefetching to DMS */
const U32 dmsLowestIndex = dms - > window . dictLimit ;
const BYTE * const dmsBase = dms - > window . base ;
const BYTE * const dmsEnd = dms - > window . nextSrc ;
const U32 dmsSize = ( U32 ) ( dmsEnd - dmsBase ) ;
const U32 dmsIndexDelta = dictLimit - dmsSize ;
2022-05-22 10:34:33 +00:00
{ U32 const headGrouped = ( * dmsTagRow & rowMask ) * groupWidth ;
2021-06-03 10:54:31 +03:00
U32 matchBuffer [ ZSTD_ROW_HASH_MAX_ENTRIES ] ;
2020-11-02 17:52:29 -08:00
size_t numMatches = 0 ;
size_t currMatch = 0 ;
2022-05-22 10:34:33 +00:00
ZSTD_VecMask matches = ZSTD_row_getMatchMask ( dmsTagRow , ( BYTE ) dmsTag , headGrouped , rowEntries ) ;
2020-11-02 17:52:29 -08:00
for ( ; ( matches > 0 ) & & ( nbAttempts > 0 ) ; - - nbAttempts , matches & = ( matches - 1 ) ) {
2022-05-22 10:34:33 +00:00
U32 const matchPos = ( ( headGrouped + ZSTD_VecMask_next ( matches ) ) / groupWidth ) & rowMask ;
2020-11-02 17:52:29 -08:00
U32 const matchIndex = dmsRow [ matchPos ] ;
if ( matchIndex < dmsLowestIndex )
break ;
PREFETCH_L1 ( dmsBase + matchIndex ) ;
matchBuffer [ numMatches + + ] = matchIndex ;
}
/* Return the longest match */
for ( ; currMatch < numMatches ; + + currMatch ) {
U32 const matchIndex = matchBuffer [ currMatch ] ;
size_t currentMl = 0 ;
assert ( matchIndex > = dmsLowestIndex ) ;
assert ( matchIndex < curr ) ;
{ const BYTE * const match = dmsBase + matchIndex ;
assert ( match + 4 < = dmsEnd ) ;
if ( MEM_read32 ( match ) = = MEM_read32 ( ip ) )
currentMl = ZSTD_count_2segments ( ip + 4 , match + 4 , iLimit , dmsEnd , prefixStart ) + 4 ;
}
if ( currentMl > ml ) {
ml = currentMl ;
2021-12-23 21:58:08 -08:00
assert ( curr > matchIndex + dmsIndexDelta ) ;
2021-12-29 17:30:43 -08:00
* offsetPtr = OFFSET_TO_OFFBASE ( curr - ( matchIndex + dmsIndexDelta ) ) ;
2020-11-02 17:52:29 -08:00
if ( ip + currentMl = = iLimit ) break ;
}
}
}
}
return ml ;
}
2021-10-21 12:52:26 -07:00
typedef size_t ( * searchMax_f ) (
ZSTD_matchState_t * ms ,
const BYTE * ip , const BYTE * iLimit , size_t * offsetPtr ) ;
/**
* This struct contains the functions necessary for lazy to search .
* Currently , that is only searchMax . However , it is still valuable to have the
* VTable because this makes it easier to add more functions to the VTable later .
*
* TODO : The start of the search function involves loading and calculating a
* bunch of constants from the ZSTD_matchState_t . These computations could be
* done in an initialization function , and saved somewhere in the match state .
* Then we could pass a pointer to the saved state instead of the match state ,
* and avoid duplicate computations .
*
* TODO : Move the match re - winding into searchMax . This improves compression
* ratio , and unlocks further simplifications with the next TODO .
*
* TODO : Try moving the repcode search into searchMax . After the re - winding
* and repcode search are in searchMax , there is no more logic in the match
* finder loop that requires knowledge about the dictMode . So we should be
* able to avoid force inlining it , and we can join the extDict loop with
* the single segment loop . It should go in searchMax instead of its own
* function to avoid having multiple virtual function calls per search .
*/
typedef struct {
searchMax_f searchMax ;
} ZSTD_LazyVTable ;
2021-12-01 11:49:58 -08:00
# define GEN_ZSTD_BT_VTABLE(dictMode, mls) \
2021-10-21 12:52:26 -07:00
static size_t ZSTD_BtFindBestMatch_ # # dictMode # # _ # # mls ( \
ZSTD_matchState_t * ms , \
const BYTE * ip , const BYTE * const iLimit , \
2021-12-29 18:51:03 -08:00
size_t * offBasePtr ) \
2021-10-21 12:52:26 -07:00
{ \
assert ( MAX ( 4 , MIN ( 6 , ms - > cParams . minMatch ) ) = = mls ) ; \
2021-12-29 18:51:03 -08:00
return ZSTD_BtFindBestMatch ( ms , ip , iLimit , offBasePtr , mls , ZSTD_ # # dictMode ) ; \
2021-10-21 12:52:26 -07:00
} \
static const ZSTD_LazyVTable ZSTD_BtVTable_ # # dictMode # # _ # # mls = { \
ZSTD_BtFindBestMatch_ # # dictMode # # _ # # mls \
} ;
2021-12-01 11:49:58 -08:00
# define GEN_ZSTD_HC_VTABLE(dictMode, mls) \
2021-10-21 12:52:26 -07:00
static size_t ZSTD_HcFindBestMatch_ # # dictMode # # _ # # mls ( \
ZSTD_matchState_t * ms , \
const BYTE * ip , const BYTE * const iLimit , \
size_t * offsetPtr ) \
{ \
assert ( MAX ( 4 , MIN ( 6 , ms - > cParams . minMatch ) ) = = mls ) ; \
return ZSTD_HcFindBestMatch ( ms , ip , iLimit , offsetPtr , mls , ZSTD_ # # dictMode ) ; \
} \
static const ZSTD_LazyVTable ZSTD_HcVTable_ # # dictMode # # _ # # mls = { \
ZSTD_HcFindBestMatch_ # # dictMode # # _ # # mls \
} ;
# define GEN_ZSTD_ROW_VTABLE(dictMode, mls, rowLog) \
static size_t ZSTD_RowFindBestMatch_ # # dictMode # # _ # # mls # # _ # # rowLog ( \
ZSTD_matchState_t * ms , \
const BYTE * ip , const BYTE * const iLimit , \
size_t * offsetPtr ) \
{ \
assert ( MAX ( 4 , MIN ( 6 , ms - > cParams . minMatch ) ) = = mls ) ; \
assert ( MAX ( 4 , MIN ( 6 , ms - > cParams . searchLog ) ) = = rowLog ) ; \
return ZSTD_RowFindBestMatch ( ms , ip , iLimit , offsetPtr , mls , ZSTD_ # # dictMode , rowLog ) ; \
} \
static const ZSTD_LazyVTable ZSTD_RowVTable_ # # dictMode # # _ # # mls # # _ # # rowLog = { \
ZSTD_RowFindBestMatch_ # # dictMode # # _ # # mls # # _ # # rowLog \
} ;
# define ZSTD_FOR_EACH_ROWLOG(X, dictMode, mls) \
X ( dictMode , mls , 4 ) \
X ( dictMode , mls , 5 ) \
X ( dictMode , mls , 6 )
# define ZSTD_FOR_EACH_MLS_ROWLOG(X, dictMode) \
ZSTD_FOR_EACH_ROWLOG ( X , dictMode , 4 ) \
ZSTD_FOR_EACH_ROWLOG ( X , dictMode , 5 ) \
ZSTD_FOR_EACH_ROWLOG ( X , dictMode , 6 )
# define ZSTD_FOR_EACH_MLS(X, dictMode) \
X ( dictMode , 4 ) \
X ( dictMode , 5 ) \
X ( dictMode , 6 )
# define ZSTD_FOR_EACH_DICT_MODE(X, ...) \
X ( __VA_ARGS__ , noDict ) \
X ( __VA_ARGS__ , extDict ) \
X ( __VA_ARGS__ , dictMatchState ) \
X ( __VA_ARGS__ , dedicatedDictSearch )
/* Generate Row VTables for each combination of (dictMode, mls, rowLog) */
ZSTD_FOR_EACH_DICT_MODE ( ZSTD_FOR_EACH_MLS_ROWLOG , GEN_ZSTD_ROW_VTABLE )
/* Generate Binary Tree VTables for each combination of (dictMode, mls) */
ZSTD_FOR_EACH_DICT_MODE ( ZSTD_FOR_EACH_MLS , GEN_ZSTD_BT_VTABLE )
/* Generate Hash Chain VTables for each combination of (dictMode, mls) */
ZSTD_FOR_EACH_DICT_MODE ( ZSTD_FOR_EACH_MLS , GEN_ZSTD_HC_VTABLE )
# define GEN_ZSTD_BT_VTABLE_ARRAY(dictMode) \
{ \
& ZSTD_BtVTable_ # # dictMode # # _4 , \
& ZSTD_BtVTable_ # # dictMode # # _5 , \
& ZSTD_BtVTable_ # # dictMode # # _6 \
2020-11-02 17:52:29 -08:00
}
2021-10-21 12:52:26 -07:00
# define GEN_ZSTD_HC_VTABLE_ARRAY(dictMode) \
{ \
& ZSTD_HcVTable_ # # dictMode # # _4 , \
& ZSTD_HcVTable_ # # dictMode # # _5 , \
& ZSTD_HcVTable_ # # dictMode # # _6 \
2020-11-02 17:52:29 -08:00
}
2021-10-21 12:52:26 -07:00
# define GEN_ZSTD_ROW_VTABLE_ARRAY_(dictMode, mls) \
{ \
& ZSTD_RowVTable_ # # dictMode # # _ # # mls # # _4 , \
& ZSTD_RowVTable_ # # dictMode # # _ # # mls # # _5 , \
& ZSTD_RowVTable_ # # dictMode # # _ # # mls # # _6 \
2020-11-02 17:52:29 -08:00
}
2021-10-21 12:52:26 -07:00
# define GEN_ZSTD_ROW_VTABLE_ARRAY(dictMode) \
{ \
GEN_ZSTD_ROW_VTABLE_ARRAY_ ( dictMode , 4 ) , \
GEN_ZSTD_ROW_VTABLE_ARRAY_ ( dictMode , 5 ) , \
GEN_ZSTD_ROW_VTABLE_ARRAY_ ( dictMode , 6 ) \
2020-11-02 17:52:29 -08:00
}
2021-10-21 12:52:26 -07:00
# define GEN_ZSTD_VTABLE_ARRAY(X) \
{ \
X ( noDict ) , \
X ( extDict ) , \
X ( dictMatchState ) , \
X ( dedicatedDictSearch ) \
}
2017-09-01 18:28:35 -07:00
/* *******************************
* Common parser - lazy strategy
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
2020-11-02 17:52:29 -08:00
typedef enum { search_hashChain = 0 , search_binaryTree = 1 , search_rowHash = 2 } searchMethod_e ;
2019-08-01 15:58:17 +02:00
2021-12-15 14:37:05 -08:00
/**
* This table is indexed first by the four ZSTD_dictMode_e values , and then
* by the two searchMethod_e values . NULLs are placed for configurations
* that should never occur ( extDict modes go to the other implementation
* below and there is no DDSS for binary tree search yet ) .
*/
2021-10-21 12:52:26 -07:00
2021-12-15 14:37:05 -08:00
static ZSTD_LazyVTable const *
ZSTD_selectLazyVTable ( ZSTD_matchState_t const * ms , searchMethod_e searchMethod , ZSTD_dictMode_e dictMode )
2021-10-21 12:52:26 -07:00
{
/* Fill the Hc/Bt VTable arrays with the right functions for the (dictMode, mls) combination. */
ZSTD_LazyVTable const * const hcVTables [ 4 ] [ 3 ] = GEN_ZSTD_VTABLE_ARRAY ( GEN_ZSTD_HC_VTABLE_ARRAY ) ;
ZSTD_LazyVTable const * const btVTables [ 4 ] [ 3 ] = GEN_ZSTD_VTABLE_ARRAY ( GEN_ZSTD_BT_VTABLE_ARRAY ) ;
/* Fill the Row VTable array with the right functions for the (dictMode, mls, rowLog) combination. */
ZSTD_LazyVTable const * const rowVTables [ 4 ] [ 3 ] [ 3 ] = GEN_ZSTD_VTABLE_ARRAY ( GEN_ZSTD_ROW_VTABLE_ARRAY ) ;
U32 const mls = MAX ( 4 , MIN ( 6 , ms - > cParams . minMatch ) ) ;
U32 const rowLog = MAX ( 4 , MIN ( 6 , ms - > cParams . searchLog ) ) ;
switch ( searchMethod ) {
case search_hashChain :
return hcVTables [ dictMode ] [ mls - 4 ] ;
case search_binaryTree :
return btVTables [ dictMode ] [ mls - 4 ] ;
case search_rowHash :
return rowVTables [ dictMode ] [ mls - 4 ] [ rowLog - 4 ] ;
default :
return NULL ;
}
}
2019-08-01 15:58:17 +02:00
FORCE_INLINE_TEMPLATE size_t
ZSTD_compressBlock_lazy_generic (
2017-12-12 16:51:00 -08:00
ZSTD_matchState_t * ms , seqStore_t * seqStore ,
U32 rep [ ZSTD_REP_NUM ] ,
const void * src , size_t srcSize ,
2019-08-01 15:58:17 +02:00
const searchMethod_e searchMethod , const U32 depth ,
2018-05-16 02:26:07 -04:00
ZSTD_dictMode_e const dictMode )
2017-09-01 18:28:35 -07:00
{
const BYTE * const istart = ( const BYTE * ) src ;
const BYTE * ip = istart ;
const BYTE * anchor = istart ;
const BYTE * const iend = istart + srcSize ;
2021-12-15 14:37:05 -08:00
const BYTE * const ilimit = ( searchMethod = = search_rowHash ) ? iend - 8 - ZSTD_ROW_HASH_CACHE_SIZE : iend - 8 ;
2018-05-23 15:06:24 -04:00
const BYTE * const base = ms - > window . base ;
const U32 prefixLowestIndex = ms - > window . dictLimit ;
const BYTE * const prefixLowest = base + prefixLowestIndex ;
2020-09-02 12:40:25 -04:00
2021-10-21 12:52:26 -07:00
searchMax_f const searchMax = ZSTD_selectLazyVTable ( ms , searchMethod , dictMode ) - > searchMax ;
2022-05-09 17:17:11 -04:00
U32 offset_1 = rep [ 0 ] , offset_2 = rep [ 1 ] ;
U32 offsetSaved1 = 0 , offsetSaved2 = 0 ;
2017-09-01 18:28:35 -07:00
2020-09-02 13:27:11 -04:00
const int isDMS = dictMode = = ZSTD_dictMatchState ;
const int isDDS = dictMode = = ZSTD_dedicatedDictSearch ;
2020-09-10 18:18:50 -04:00
const int isDxS = isDMS | | isDDS ;
2018-05-23 14:13:16 -04:00
const ZSTD_matchState_t * const dms = ms - > dictMatchState ;
2020-09-10 18:18:50 -04:00
const U32 dictLowestIndex = isDxS ? dms - > window . dictLimit : 0 ;
const BYTE * const dictBase = isDxS ? dms - > window . base : NULL ;
const BYTE * const dictLowest = isDxS ? dictBase + dictLowestIndex : NULL ;
const BYTE * const dictEnd = isDxS ? dms - > window . nextSrc : NULL ;
const U32 dictIndexDelta = isDxS ?
2018-05-23 15:06:24 -04:00
prefixLowestIndex - ( U32 ) ( dictEnd - dictBase ) :
0 ;
2019-11-20 18:21:51 -08:00
const U32 dictAndPrefixLength = ( U32 ) ( ( ip - prefixLowest ) + ( dictEnd - dictLowest ) ) ;
2018-05-23 14:13:16 -04:00
2020-09-02 12:40:25 -04:00
assert ( searchMax ! = NULL ) ;
2020-11-02 17:52:29 -08:00
DEBUGLOG ( 5 , " ZSTD_compressBlock_lazy_generic (dictMode=%u) (searchFunc=%u) " , ( U32 ) dictMode , ( U32 ) searchMethod ) ;
2018-05-23 14:13:16 -04:00
ip + = ( dictAndPrefixLength = = 0 ) ;
if ( dictMode = = ZSTD_noDict ) {
2020-08-11 14:31:09 -07:00
U32 const curr = ( U32 ) ( ip - base ) ;
U32 const windowLow = ZSTD_getLowestPrefixIndex ( ms , curr , ms - > cParams . windowLog ) ;
U32 const maxRep = curr - windowLow ;
2022-05-09 17:17:11 -04:00
if ( offset_2 > maxRep ) offsetSaved2 = offset_2 , offset_2 = 0 ;
if ( offset_1 > maxRep ) offsetSaved1 = offset_1 , offset_1 = 0 ;
2017-09-01 18:28:35 -07:00
}
2020-09-10 18:18:50 -04:00
if ( isDxS ) {
2018-05-23 14:13:16 -04:00
/* dictMatchState repCode checks don't currently handle repCode == 0
* disabling . */
assert ( offset_1 < = dictAndPrefixLength ) ;
assert ( offset_2 < = dictAndPrefixLength ) ;
}
2017-09-01 18:28:35 -07:00
2020-11-02 17:52:29 -08:00
if ( searchMethod = = search_rowHash ) {
2021-12-15 14:37:05 -08:00
const U32 rowLog = MAX ( 4 , MIN ( 6 , ms - > cParams . searchLog ) ) ;
2020-11-02 17:52:29 -08:00
ZSTD_row_fillHashCache ( ms , base , rowLog ,
MIN ( ms - > cParams . minMatch , 6 /* mls caps out at 6 */ ) ,
ms - > nextToUpdate , ilimit ) ;
}
2017-09-01 18:28:35 -07:00
/* Match Loop */
2020-05-12 17:51:16 -07:00
# if defined(__GNUC__) && defined(__x86_64__)
/* I've measured random a 5% speed loss on levels 5 & 6 (greedy) when the
* code alignment is perturbed . To fix the instability align the loop on 32 - bytes .
*/
__asm__ ( " .p2align 5 " ) ;
# endif
2017-09-01 18:28:35 -07:00
while ( ip < ilimit ) {
size_t matchLength = 0 ;
2021-12-29 17:30:43 -08:00
size_t offBase = REPCODE1_TO_OFFBASE ;
2017-09-01 18:28:35 -07:00
const BYTE * start = ip + 1 ;
2021-12-28 16:18:44 -08:00
DEBUGLOG ( 7 , " search baseline (depth 0) " ) ;
2017-09-01 18:28:35 -07:00
/* check repCode */
2020-09-10 18:18:50 -04:00
if ( isDxS ) {
2018-06-06 19:54:13 -04:00
const U32 repIndex = ( U32 ) ( ip - base ) + 1 - offset_1 ;
2020-08-11 18:48:22 -04:00
const BYTE * repMatch = ( ( dictMode = = ZSTD_dictMatchState | | dictMode = = ZSTD_dedicatedDictSearch )
2018-06-06 19:54:13 -04:00
& & repIndex < prefixLowestIndex ) ?
dictBase + ( repIndex - dictIndexDelta ) :
base + repIndex ;
if ( ( ( U32 ) ( ( prefixLowestIndex - 1 ) - repIndex ) > = 3 /* intentional underflow */ )
& & ( MEM_read32 ( repMatch ) = = MEM_read32 ( ip + 1 ) ) ) {
const BYTE * repMatchEnd = repIndex < prefixLowestIndex ? dictEnd : iend ;
matchLength = ZSTD_count_2segments ( ip + 1 + 4 , repMatch + 4 , iend , repMatchEnd , prefixLowest ) + 4 ;
if ( depth = = 0 ) goto _storeSequence ;
}
}
if ( dictMode = = ZSTD_noDict
& & ( ( offset_1 > 0 ) & ( MEM_read32 ( ip + 1 - offset_1 ) = = MEM_read32 ( ip + 1 ) ) ) ) {
2018-05-23 15:06:24 -04:00
matchLength = ZSTD_count ( ip + 1 + 4 , ip + 1 + 4 - offset_1 , iend ) + 4 ;
2017-09-01 18:28:35 -07:00
if ( depth = = 0 ) goto _storeSequence ;
}
/* first search (depth 0) */
2021-12-29 17:30:43 -08:00
{ size_t offbaseFound = 999999999 ;
size_t const ml2 = searchMax ( ms , ip , iend , & offbaseFound ) ;
2017-09-01 18:28:35 -07:00
if ( ml2 > matchLength )
2021-12-29 17:30:43 -08:00
matchLength = ml2 , start = ip , offBase = offbaseFound ;
2017-09-01 18:28:35 -07:00
}
if ( matchLength < 4 ) {
2018-02-02 16:31:20 -08:00
ip + = ( ( ip - anchor ) > > kSearchStrength ) + 1 ; /* jump faster over incompressible sections */
2017-09-01 18:28:35 -07:00
continue ;
}
/* let's try to find a better solution */
if ( depth > = 1 )
while ( ip < ilimit ) {
2021-12-28 16:18:44 -08:00
DEBUGLOG ( 7 , " search depth 1 " ) ;
2017-09-01 18:28:35 -07:00
ip + + ;
2018-06-06 19:54:13 -04:00
if ( ( dictMode = = ZSTD_noDict )
2021-12-29 17:30:43 -08:00
& & ( offBase ) & & ( ( offset_1 > 0 ) & ( MEM_read32 ( ip ) = = MEM_read32 ( ip - offset_1 ) ) ) ) {
2017-09-01 18:28:35 -07:00
size_t const mlRep = ZSTD_count ( ip + 4 , ip + 4 - offset_1 , iend ) + 4 ;
int const gain2 = ( int ) ( mlRep * 3 ) ;
2021-12-29 17:30:43 -08:00
int const gain1 = ( int ) ( matchLength * 3 - ZSTD_highbit32 ( ( U32 ) offBase ) + 1 ) ;
2017-09-01 18:28:35 -07:00
if ( ( mlRep > = 4 ) & & ( gain2 > gain1 ) )
2021-12-29 17:30:43 -08:00
matchLength = mlRep , offBase = REPCODE1_TO_OFFBASE , start = ip ;
2017-09-01 18:28:35 -07:00
}
2020-09-10 18:18:50 -04:00
if ( isDxS ) {
2018-06-06 19:54:13 -04:00
const U32 repIndex = ( U32 ) ( ip - base ) - offset_1 ;
const BYTE * repMatch = repIndex < prefixLowestIndex ?
dictBase + ( repIndex - dictIndexDelta ) :
base + repIndex ;
if ( ( ( U32 ) ( ( prefixLowestIndex - 1 ) - repIndex ) > = 3 /* intentional underflow */ )
& & ( MEM_read32 ( repMatch ) = = MEM_read32 ( ip ) ) ) {
const BYTE * repMatchEnd = repIndex < prefixLowestIndex ? dictEnd : iend ;
size_t const mlRep = ZSTD_count_2segments ( ip + 4 , repMatch + 4 , iend , repMatchEnd , prefixLowest ) + 4 ;
int const gain2 = ( int ) ( mlRep * 3 ) ;
2021-12-29 17:30:43 -08:00
int const gain1 = ( int ) ( matchLength * 3 - ZSTD_highbit32 ( ( U32 ) offBase ) + 1 ) ;
2018-06-06 19:54:13 -04:00
if ( ( mlRep > = 4 ) & & ( gain2 > gain1 ) )
2021-12-29 17:30:43 -08:00
matchLength = mlRep , offBase = REPCODE1_TO_OFFBASE , start = ip ;
2018-06-06 19:54:13 -04:00
}
}
2021-12-29 17:30:43 -08:00
{ size_t ofbCandidate = 999999999 ;
size_t const ml2 = searchMax ( ms , ip , iend , & ofbCandidate ) ;
int const gain2 = ( int ) ( ml2 * 4 - ZSTD_highbit32 ( ( U32 ) ofbCandidate ) ) ; /* raw approx */
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offBase ) + 4 ) ;
2017-09-01 18:28:35 -07:00
if ( ( ml2 > = 4 ) & & ( gain2 > gain1 ) ) {
2021-12-29 17:30:43 -08:00
matchLength = ml2 , offBase = ofbCandidate , start = ip ;
2017-09-01 18:28:35 -07:00
continue ; /* search a better one */
} }
/* let's find an even better one */
if ( ( depth = = 2 ) & & ( ip < ilimit ) ) {
2021-12-28 16:18:44 -08:00
DEBUGLOG ( 7 , " search depth 2 " ) ;
2017-09-01 18:28:35 -07:00
ip + + ;
2018-06-06 19:54:13 -04:00
if ( ( dictMode = = ZSTD_noDict )
2021-12-29 17:30:43 -08:00
& & ( offBase ) & & ( ( offset_1 > 0 ) & ( MEM_read32 ( ip ) = = MEM_read32 ( ip - offset_1 ) ) ) ) {
2018-06-06 19:54:13 -04:00
size_t const mlRep = ZSTD_count ( ip + 4 , ip + 4 - offset_1 , iend ) + 4 ;
int const gain2 = ( int ) ( mlRep * 4 ) ;
2021-12-29 17:30:43 -08:00
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offBase ) + 1 ) ;
2018-06-06 19:54:13 -04:00
if ( ( mlRep > = 4 ) & & ( gain2 > gain1 ) )
2021-12-29 17:30:43 -08:00
matchLength = mlRep , offBase = REPCODE1_TO_OFFBASE , start = ip ;
2018-06-06 19:54:13 -04:00
}
2020-09-10 18:18:50 -04:00
if ( isDxS ) {
2018-06-06 19:54:13 -04:00
const U32 repIndex = ( U32 ) ( ip - base ) - offset_1 ;
const BYTE * repMatch = repIndex < prefixLowestIndex ?
dictBase + ( repIndex - dictIndexDelta ) :
base + repIndex ;
if ( ( ( U32 ) ( ( prefixLowestIndex - 1 ) - repIndex ) > = 3 /* intentional underflow */ )
& & ( MEM_read32 ( repMatch ) = = MEM_read32 ( ip ) ) ) {
const BYTE * repMatchEnd = repIndex < prefixLowestIndex ? dictEnd : iend ;
size_t const mlRep = ZSTD_count_2segments ( ip + 4 , repMatch + 4 , iend , repMatchEnd , prefixLowest ) + 4 ;
int const gain2 = ( int ) ( mlRep * 4 ) ;
2021-12-29 17:30:43 -08:00
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offBase ) + 1 ) ;
2018-06-06 19:54:13 -04:00
if ( ( mlRep > = 4 ) & & ( gain2 > gain1 ) )
2021-12-29 17:30:43 -08:00
matchLength = mlRep , offBase = REPCODE1_TO_OFFBASE , start = ip ;
2018-06-06 19:54:13 -04:00
}
2017-09-01 18:28:35 -07:00
}
2021-12-29 17:30:43 -08:00
{ size_t ofbCandidate = 999999999 ;
size_t const ml2 = searchMax ( ms , ip , iend , & ofbCandidate ) ;
int const gain2 = ( int ) ( ml2 * 4 - ZSTD_highbit32 ( ( U32 ) ofbCandidate ) ) ; /* raw approx */
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offBase ) + 7 ) ;
2017-09-01 18:28:35 -07:00
if ( ( ml2 > = 4 ) & & ( gain2 > gain1 ) ) {
2021-12-29 17:30:43 -08:00
matchLength = ml2 , offBase = ofbCandidate , start = ip ;
2017-09-01 18:28:35 -07:00
continue ;
} } }
break ; /* nothing found : store previous solution */
}
/* NOTE:
2021-12-28 16:18:44 -08:00
* Pay attention that ` start [ - value ] ` can lead to strange undefined behavior
2021-12-28 13:47:57 -08:00
* notably if ` value ` is unsigned , resulting in a large positive ` - value ` .
2017-09-01 18:28:35 -07:00
*/
/* catch up */
2021-12-29 17:30:43 -08:00
if ( OFFBASE_IS_OFFSET ( offBase ) ) {
2018-06-08 15:06:47 -04:00
if ( dictMode = = ZSTD_noDict ) {
2021-12-29 17:30:43 -08:00
while ( ( ( start > anchor ) & ( start - OFFBASE_TO_OFFSET ( offBase ) > prefixLowest ) )
& & ( start [ - 1 ] = = ( start - OFFBASE_TO_OFFSET ( offBase ) ) [ - 1 ] ) ) /* only search for offset within prefix */
2018-06-08 15:06:47 -04:00
{ start - - ; matchLength + + ; }
}
2020-09-10 18:18:50 -04:00
if ( isDxS ) {
2021-12-29 17:30:43 -08:00
U32 const matchIndex = ( U32 ) ( ( size_t ) ( start - base ) - OFFBASE_TO_OFFSET ( offBase ) ) ;
2018-06-08 15:06:47 -04:00
const BYTE * match = ( matchIndex < prefixLowestIndex ) ? dictBase + matchIndex - dictIndexDelta : base + matchIndex ;
const BYTE * const mStart = ( matchIndex < prefixLowestIndex ) ? dictLowest : prefixLowest ;
while ( ( start > anchor ) & & ( match > mStart ) & & ( start [ - 1 ] = = match [ - 1 ] ) ) { start - - ; match - - ; matchLength + + ; } /* catch up */
}
2021-12-29 17:30:43 -08:00
offset_2 = offset_1 ; offset_1 = ( U32 ) OFFBASE_TO_OFFSET ( offBase ) ;
2017-09-01 18:28:35 -07:00
}
/* store sequence */
_storeSequence :
2021-12-14 02:12:09 -08:00
{ size_t const litLength = ( size_t ) ( start - anchor ) ;
2021-12-29 17:30:43 -08:00
ZSTD_storeSeq ( seqStore , litLength , anchor , iend , ( U32 ) offBase , matchLength ) ;
2017-09-01 18:28:35 -07:00
anchor = ip = start + matchLength ;
}
/* check immediate repcode */
2020-09-10 18:18:50 -04:00
if ( isDxS ) {
2018-05-23 15:49:43 -04:00
while ( ip < = ilimit ) {
U32 const current2 = ( U32 ) ( ip - base ) ;
2018-06-06 19:54:13 -04:00
U32 const repIndex = current2 - offset_2 ;
2020-09-02 13:27:11 -04:00
const BYTE * repMatch = repIndex < prefixLowestIndex ?
2018-06-06 19:54:13 -04:00
dictBase - dictIndexDelta + repIndex :
base + repIndex ;
if ( ( ( U32 ) ( ( prefixLowestIndex - 1 ) - ( U32 ) repIndex ) > = 3 /* intentional overflow */ )
& & ( MEM_read32 ( repMatch ) = = MEM_read32 ( ip ) ) ) {
const BYTE * const repEnd2 = repIndex < prefixLowestIndex ? dictEnd : iend ;
matchLength = ZSTD_count_2segments ( ip + 4 , repMatch + 4 , iend , repEnd2 , prefixLowest ) + 4 ;
2021-12-29 17:30:43 -08:00
offBase = offset_2 ; offset_2 = offset_1 ; offset_1 = ( U32 ) offBase ; /* swap offset_2 <=> offset_1 */
ZSTD_storeSeq ( seqStore , 0 , anchor , iend , REPCODE1_TO_OFFBASE , matchLength ) ;
2018-05-23 15:49:43 -04:00
ip + = matchLength ;
anchor = ip ;
continue ;
}
break ;
}
}
if ( dictMode = = ZSTD_noDict ) {
while ( ( ( ip < = ilimit ) & ( offset_2 > 0 ) )
& & ( MEM_read32 ( ip ) = = MEM_read32 ( ip - offset_2 ) ) ) {
/* store sequence */
matchLength = ZSTD_count ( ip + 4 , ip + 4 - offset_2 , iend ) + 4 ;
2021-12-29 17:30:43 -08:00
offBase = offset_2 ; offset_2 = offset_1 ; offset_1 = ( U32 ) offBase ; /* swap repcodes */
ZSTD_storeSeq ( seqStore , 0 , anchor , iend , REPCODE1_TO_OFFBASE , matchLength ) ;
2018-05-23 15:49:43 -04:00
ip + = matchLength ;
anchor = ip ;
continue ; /* faster when present ... (?) */
} } }
2017-09-01 18:28:35 -07:00
2022-05-12 12:53:15 -04:00
/* If offset_1 started invalid (offsetSaved1 != 0) and became valid (offset_1 != 0),
* rotate saved offsets . See comment in ZSTD_compressBlock_fast_noDict for more context . */
offsetSaved2 = ( ( offsetSaved1 ! = 0 ) & & ( offset_1 ! = 0 ) ) ? offsetSaved1 : offsetSaved2 ;
2022-05-09 17:17:11 -04:00
/* save reps for next block */
rep [ 0 ] = offset_1 ? offset_1 : offsetSaved1 ;
rep [ 1 ] = offset_2 ? offset_2 : offsetSaved2 ;
2017-09-01 18:28:35 -07:00
2017-09-06 15:56:32 -07:00
/* Return the last literals size */
2019-08-02 14:42:53 +02:00
return ( size_t ) ( iend - anchor ) ;
2017-09-01 18:28:35 -07:00
}
2017-12-12 16:51:00 -08:00
size_t ZSTD_compressBlock_btlazy2 (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2017-09-01 18:28:35 -07:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_binaryTree , 2 , ZSTD_noDict ) ;
2017-09-01 18:28:35 -07:00
}
2017-12-12 16:51:00 -08:00
size_t ZSTD_compressBlock_lazy2 (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2017-09-01 18:28:35 -07:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 2 , ZSTD_noDict ) ;
2017-09-01 18:28:35 -07:00
}
2017-12-12 16:51:00 -08:00
size_t ZSTD_compressBlock_lazy (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2017-09-01 18:28:35 -07:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 1 , ZSTD_noDict ) ;
2017-09-01 18:28:35 -07:00
}
2017-12-12 16:51:00 -08:00
size_t ZSTD_compressBlock_greedy (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2017-09-01 18:28:35 -07:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 0 , ZSTD_noDict ) ;
2017-09-01 18:28:35 -07:00
}
2018-05-16 02:30:20 -04:00
size_t ZSTD_compressBlock_btlazy2_dictMatchState (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2018-05-16 02:30:20 -04:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_binaryTree , 2 , ZSTD_dictMatchState ) ;
2018-05-16 02:30:20 -04:00
}
size_t ZSTD_compressBlock_lazy2_dictMatchState (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2018-05-16 02:30:20 -04:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 2 , ZSTD_dictMatchState ) ;
2018-05-16 02:30:20 -04:00
}
size_t ZSTD_compressBlock_lazy_dictMatchState (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2018-05-16 02:30:20 -04:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 1 , ZSTD_dictMatchState ) ;
2018-05-16 02:30:20 -04:00
}
size_t ZSTD_compressBlock_greedy_dictMatchState (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2018-05-16 02:30:20 -04:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 0 , ZSTD_dictMatchState ) ;
2018-05-16 02:30:20 -04:00
}
2017-09-01 18:28:35 -07:00
2020-08-11 18:48:22 -04:00
size_t ZSTD_compressBlock_lazy2_dedicatedDictSearch (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 2 , ZSTD_dedicatedDictSearch ) ;
}
size_t ZSTD_compressBlock_lazy_dedicatedDictSearch (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 1 , ZSTD_dedicatedDictSearch ) ;
}
size_t ZSTD_compressBlock_greedy_dedicatedDictSearch (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 0 , ZSTD_dedicatedDictSearch ) ;
}
2020-11-02 17:52:29 -08:00
/* Row-based matchfinder */
size_t ZSTD_compressBlock_lazy2_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 2 , ZSTD_noDict ) ;
}
size_t ZSTD_compressBlock_lazy_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 1 , ZSTD_noDict ) ;
}
size_t ZSTD_compressBlock_greedy_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 0 , ZSTD_noDict ) ;
}
size_t ZSTD_compressBlock_lazy2_dictMatchState_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 2 , ZSTD_dictMatchState ) ;
}
size_t ZSTD_compressBlock_lazy_dictMatchState_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 1 , ZSTD_dictMatchState ) ;
}
size_t ZSTD_compressBlock_greedy_dictMatchState_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 0 , ZSTD_dictMatchState ) ;
}
size_t ZSTD_compressBlock_lazy2_dedicatedDictSearch_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 2 , ZSTD_dedicatedDictSearch ) ;
}
size_t ZSTD_compressBlock_lazy_dedicatedDictSearch_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 1 , ZSTD_dedicatedDictSearch ) ;
}
size_t ZSTD_compressBlock_greedy_dedicatedDictSearch_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 0 , ZSTD_dedicatedDictSearch ) ;
}
2020-08-11 18:48:22 -04:00
2017-09-01 18:28:35 -07:00
FORCE_INLINE_TEMPLATE
2017-12-12 16:51:00 -08:00
size_t ZSTD_compressBlock_lazy_extDict_generic (
ZSTD_matchState_t * ms , seqStore_t * seqStore ,
U32 rep [ ZSTD_REP_NUM ] ,
const void * src , size_t srcSize ,
2019-08-01 15:58:17 +02:00
const searchMethod_e searchMethod , const U32 depth )
2017-09-01 18:28:35 -07:00
{
const BYTE * const istart = ( const BYTE * ) src ;
const BYTE * ip = istart ;
const BYTE * anchor = istart ;
const BYTE * const iend = istart + srcSize ;
2021-04-08 19:54:19 -07:00
const BYTE * const ilimit = searchMethod = = search_rowHash ? iend - 8 - ZSTD_ROW_HASH_CACHE_SIZE : iend - 8 ;
2018-02-23 16:48:18 -08:00
const BYTE * const base = ms - > window . base ;
const U32 dictLimit = ms - > window . dictLimit ;
2017-09-01 18:28:35 -07:00
const BYTE * const prefixStart = base + dictLimit ;
2018-02-23 16:48:18 -08:00
const BYTE * const dictBase = ms - > window . dictBase ;
2017-09-01 18:28:35 -07:00
const BYTE * const dictEnd = dictBase + dictLimit ;
2020-05-11 19:05:42 -07:00
const BYTE * const dictStart = dictBase + ms - > window . lowLimit ;
const U32 windowLog = ms - > cParams . windowLog ;
2020-11-02 17:52:29 -08:00
const U32 rowLog = ms - > cParams . searchLog < 5 ? 4 : 5 ;
2017-09-01 18:28:35 -07:00
2021-10-21 12:52:26 -07:00
searchMax_f const searchMax = ZSTD_selectLazyVTable ( ms , searchMethod , ZSTD_extDict ) - > searchMax ;
2017-12-12 16:51:00 -08:00
U32 offset_1 = rep [ 0 ] , offset_2 = rep [ 1 ] ;
2017-09-01 18:28:35 -07:00
2020-11-02 17:52:29 -08:00
DEBUGLOG ( 5 , " ZSTD_compressBlock_lazy_extDict_generic (searchFunc=%u) " , ( U32 ) searchMethod ) ;
2020-05-12 12:25:06 -07:00
2017-09-01 18:28:35 -07:00
/* init */
ip + = ( ip = = prefixStart ) ;
2020-11-02 17:52:29 -08:00
if ( searchMethod = = search_rowHash ) {
ZSTD_row_fillHashCache ( ms , base , rowLog ,
MIN ( ms - > cParams . minMatch , 6 /* mls caps out at 6 */ ) ,
ms - > nextToUpdate , ilimit ) ;
}
2017-09-01 18:28:35 -07:00
/* Match Loop */
2020-05-12 17:51:16 -07:00
# if defined(__GNUC__) && defined(__x86_64__)
/* I've measured random a 5% speed loss on levels 5 & 6 (greedy) when the
* code alignment is perturbed . To fix the instability align the loop on 32 - bytes .
*/
__asm__ ( " .p2align 5 " ) ;
# endif
2017-09-01 18:28:35 -07:00
while ( ip < ilimit ) {
size_t matchLength = 0 ;
2021-12-29 17:30:43 -08:00
size_t offBase = REPCODE1_TO_OFFBASE ;
2017-09-01 18:28:35 -07:00
const BYTE * start = ip + 1 ;
2020-08-11 14:31:09 -07:00
U32 curr = ( U32 ) ( ip - base ) ;
2017-09-01 18:28:35 -07:00
/* check repCode */
2020-08-11 14:31:09 -07:00
{ const U32 windowLow = ZSTD_getLowestMatchIndex ( ms , curr + 1 , windowLog ) ;
const U32 repIndex = ( U32 ) ( curr + 1 - offset_1 ) ;
2017-09-01 18:28:35 -07:00
const BYTE * const repBase = repIndex < dictLimit ? dictBase : base ;
const BYTE * const repMatch = repBase + repIndex ;
2021-05-04 09:50:44 -07:00
if ( ( ( U32 ) ( ( dictLimit - 1 ) - repIndex ) > = 3 ) /* intentional overflow */
2021-05-13 15:51:15 -07:00
& ( offset_1 < = curr + 1 - windowLow ) ) /* note: we are searching at curr+1 */
2017-09-01 18:28:35 -07:00
if ( MEM_read32 ( ip + 1 ) = = MEM_read32 ( repMatch ) ) {
/* repcode detected we should take it */
const BYTE * const repEnd = repIndex < dictLimit ? dictEnd : iend ;
matchLength = ZSTD_count_2segments ( ip + 1 + 4 , repMatch + 4 , iend , repEnd , prefixStart ) + 4 ;
if ( depth = = 0 ) goto _storeSequence ;
} }
/* first search (depth 0) */
2021-12-29 17:30:43 -08:00
{ size_t ofbCandidate = 999999999 ;
size_t const ml2 = searchMax ( ms , ip , iend , & ofbCandidate ) ;
2017-09-01 18:28:35 -07:00
if ( ml2 > matchLength )
2021-12-29 17:30:43 -08:00
matchLength = ml2 , start = ip , offBase = ofbCandidate ;
2017-09-01 18:28:35 -07:00
}
2021-10-08 11:13:11 -07:00
if ( matchLength < 4 ) {
2018-02-02 16:31:20 -08:00
ip + = ( ( ip - anchor ) > > kSearchStrength ) + 1 ; /* jump faster over incompressible sections */
2017-09-01 18:28:35 -07:00
continue ;
}
/* let's try to find a better solution */
if ( depth > = 1 )
while ( ip < ilimit ) {
ip + + ;
2020-08-11 14:31:09 -07:00
curr + + ;
2017-09-01 18:28:35 -07:00
/* check repCode */
2021-12-29 17:30:43 -08:00
if ( offBase ) {
2020-08-11 14:31:09 -07:00
const U32 windowLow = ZSTD_getLowestMatchIndex ( ms , curr , windowLog ) ;
const U32 repIndex = ( U32 ) ( curr - offset_1 ) ;
2017-09-01 18:28:35 -07:00
const BYTE * const repBase = repIndex < dictLimit ? dictBase : base ;
const BYTE * const repMatch = repBase + repIndex ;
2021-05-04 09:50:44 -07:00
if ( ( ( U32 ) ( ( dictLimit - 1 ) - repIndex ) > = 3 ) /* intentional overflow : do not test positions overlapping 2 memory segments */
2021-05-13 15:51:15 -07:00
& ( offset_1 < = curr - windowLow ) ) /* equivalent to `curr > repIndex >= windowLow` */
2017-09-01 18:28:35 -07:00
if ( MEM_read32 ( ip ) = = MEM_read32 ( repMatch ) ) {
/* repcode detected */
const BYTE * const repEnd = repIndex < dictLimit ? dictEnd : iend ;
size_t const repLength = ZSTD_count_2segments ( ip + 4 , repMatch + 4 , iend , repEnd , prefixStart ) + 4 ;
int const gain2 = ( int ) ( repLength * 3 ) ;
2021-12-29 17:30:43 -08:00
int const gain1 = ( int ) ( matchLength * 3 - ZSTD_highbit32 ( ( U32 ) offBase ) + 1 ) ;
2017-09-01 18:28:35 -07:00
if ( ( repLength > = 4 ) & & ( gain2 > gain1 ) )
2021-12-29 17:30:43 -08:00
matchLength = repLength , offBase = REPCODE1_TO_OFFBASE , start = ip ;
2017-09-01 18:28:35 -07:00
} }
/* search match, depth 1 */
2021-12-29 18:51:03 -08:00
{ size_t ofbCandidate = 999999999 ;
2021-12-29 17:30:43 -08:00
size_t const ml2 = searchMax ( ms , ip , iend , & ofbCandidate ) ;
int const gain2 = ( int ) ( ml2 * 4 - ZSTD_highbit32 ( ( U32 ) ofbCandidate ) ) ; /* raw approx */
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offBase ) + 4 ) ;
2017-09-01 18:28:35 -07:00
if ( ( ml2 > = 4 ) & & ( gain2 > gain1 ) ) {
2021-12-29 17:30:43 -08:00
matchLength = ml2 , offBase = ofbCandidate , start = ip ;
2017-09-01 18:28:35 -07:00
continue ; /* search a better one */
} }
/* let's find an even better one */
if ( ( depth = = 2 ) & & ( ip < ilimit ) ) {
ip + + ;
2020-08-11 14:31:09 -07:00
curr + + ;
2017-09-01 18:28:35 -07:00
/* check repCode */
2021-12-29 17:30:43 -08:00
if ( offBase ) {
2020-08-11 14:31:09 -07:00
const U32 windowLow = ZSTD_getLowestMatchIndex ( ms , curr , windowLog ) ;
const U32 repIndex = ( U32 ) ( curr - offset_1 ) ;
2017-09-01 18:28:35 -07:00
const BYTE * const repBase = repIndex < dictLimit ? dictBase : base ;
const BYTE * const repMatch = repBase + repIndex ;
2021-05-04 09:50:44 -07:00
if ( ( ( U32 ) ( ( dictLimit - 1 ) - repIndex ) > = 3 ) /* intentional overflow : do not test positions overlapping 2 memory segments */
2021-05-13 15:51:15 -07:00
& ( offset_1 < = curr - windowLow ) ) /* equivalent to `curr > repIndex >= windowLow` */
2017-09-01 18:28:35 -07:00
if ( MEM_read32 ( ip ) = = MEM_read32 ( repMatch ) ) {
/* repcode detected */
const BYTE * const repEnd = repIndex < dictLimit ? dictEnd : iend ;
size_t const repLength = ZSTD_count_2segments ( ip + 4 , repMatch + 4 , iend , repEnd , prefixStart ) + 4 ;
int const gain2 = ( int ) ( repLength * 4 ) ;
2021-12-29 17:30:43 -08:00
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offBase ) + 1 ) ;
2017-09-01 18:28:35 -07:00
if ( ( repLength > = 4 ) & & ( gain2 > gain1 ) )
2021-12-29 17:30:43 -08:00
matchLength = repLength , offBase = REPCODE1_TO_OFFBASE , start = ip ;
2017-09-01 18:28:35 -07:00
} }
/* search match, depth 2 */
2021-12-29 18:51:03 -08:00
{ size_t ofbCandidate = 999999999 ;
2021-12-29 17:30:43 -08:00
size_t const ml2 = searchMax ( ms , ip , iend , & ofbCandidate ) ;
int const gain2 = ( int ) ( ml2 * 4 - ZSTD_highbit32 ( ( U32 ) ofbCandidate ) ) ; /* raw approx */
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offBase ) + 7 ) ;
2017-09-01 18:28:35 -07:00
if ( ( ml2 > = 4 ) & & ( gain2 > gain1 ) ) {
2021-12-29 17:30:43 -08:00
matchLength = ml2 , offBase = ofbCandidate , start = ip ;
2017-09-01 18:28:35 -07:00
continue ;
} } }
break ; /* nothing found : store previous solution */
}
/* catch up */
2021-12-29 17:30:43 -08:00
if ( OFFBASE_IS_OFFSET ( offBase ) ) {
U32 const matchIndex = ( U32 ) ( ( size_t ) ( start - base ) - OFFBASE_TO_OFFSET ( offBase ) ) ;
2017-09-01 18:28:35 -07:00
const BYTE * match = ( matchIndex < dictLimit ) ? dictBase + matchIndex : base + matchIndex ;
const BYTE * const mStart = ( matchIndex < dictLimit ) ? dictStart : prefixStart ;
while ( ( start > anchor ) & & ( match > mStart ) & & ( start [ - 1 ] = = match [ - 1 ] ) ) { start - - ; match - - ; matchLength + + ; } /* catch up */
2021-12-29 17:30:43 -08:00
offset_2 = offset_1 ; offset_1 = ( U32 ) OFFBASE_TO_OFFSET ( offBase ) ;
2017-09-01 18:28:35 -07:00
}
/* store sequence */
_storeSequence :
2021-12-14 02:12:09 -08:00
{ size_t const litLength = ( size_t ) ( start - anchor ) ;
2021-12-29 17:30:43 -08:00
ZSTD_storeSeq ( seqStore , litLength , anchor , iend , ( U32 ) offBase , matchLength ) ;
2017-09-01 18:28:35 -07:00
anchor = ip = start + matchLength ;
}
/* check immediate repcode */
while ( ip < = ilimit ) {
2020-05-12 12:25:06 -07:00
const U32 repCurrent = ( U32 ) ( ip - base ) ;
const U32 windowLow = ZSTD_getLowestMatchIndex ( ms , repCurrent , windowLog ) ;
const U32 repIndex = repCurrent - offset_2 ;
2017-09-01 18:28:35 -07:00
const BYTE * const repBase = repIndex < dictLimit ? dictBase : base ;
const BYTE * const repMatch = repBase + repIndex ;
2021-05-04 09:50:44 -07:00
if ( ( ( U32 ) ( ( dictLimit - 1 ) - repIndex ) > = 3 ) /* intentional overflow : do not test positions overlapping 2 memory segments */
2021-05-13 15:51:15 -07:00
& ( offset_2 < = repCurrent - windowLow ) ) /* equivalent to `curr > repIndex >= windowLow` */
2017-09-01 18:28:35 -07:00
if ( MEM_read32 ( ip ) = = MEM_read32 ( repMatch ) ) {
/* repcode detected we should take it */
const BYTE * const repEnd = repIndex < dictLimit ? dictEnd : iend ;
matchLength = ZSTD_count_2segments ( ip + 4 , repMatch + 4 , iend , repEnd , prefixStart ) + 4 ;
2021-12-29 17:30:43 -08:00
offBase = offset_2 ; offset_2 = offset_1 ; offset_1 = ( U32 ) offBase ; /* swap offset history */
ZSTD_storeSeq ( seqStore , 0 , anchor , iend , REPCODE1_TO_OFFBASE , matchLength ) ;
2017-09-01 18:28:35 -07:00
ip + = matchLength ;
anchor = ip ;
continue ; /* faster when present ... (?) */
}
break ;
} }
/* Save reps for next block */
2017-12-12 16:51:00 -08:00
rep [ 0 ] = offset_1 ;
rep [ 1 ] = offset_2 ;
2017-09-01 18:28:35 -07:00
2017-09-06 15:56:32 -07:00
/* Return the last literals size */
2019-08-01 15:58:17 +02:00
return ( size_t ) ( iend - anchor ) ;
2017-09-01 18:28:35 -07:00
}
2017-12-12 16:51:00 -08:00
size_t ZSTD_compressBlock_greedy_extDict (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2017-09-01 18:28:35 -07:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_extDict_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 0 ) ;
2017-09-01 18:28:35 -07:00
}
2017-12-12 16:51:00 -08:00
size_t ZSTD_compressBlock_lazy_extDict (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2017-12-12 16:51:00 -08:00
2017-09-01 18:28:35 -07:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_extDict_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 1 ) ;
2017-09-01 18:28:35 -07:00
}
2017-12-12 16:51:00 -08:00
size_t ZSTD_compressBlock_lazy2_extDict (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2017-12-12 16:51:00 -08:00
2017-09-01 18:28:35 -07:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_extDict_generic ( ms , seqStore , rep , src , srcSize , search_hashChain , 2 ) ;
2017-09-01 18:28:35 -07:00
}
2017-12-12 16:51:00 -08:00
size_t ZSTD_compressBlock_btlazy2_extDict (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
2018-08-23 11:53:34 -07:00
void const * src , size_t srcSize )
2017-12-12 16:51:00 -08:00
2017-09-01 18:28:35 -07:00
{
2019-08-01 15:58:17 +02:00
return ZSTD_compressBlock_lazy_extDict_generic ( ms , seqStore , rep , src , srcSize , search_binaryTree , 2 ) ;
2017-09-01 18:28:35 -07:00
}
2020-11-02 17:52:29 -08:00
size_t ZSTD_compressBlock_greedy_extDict_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_extDict_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 0 ) ;
}
size_t ZSTD_compressBlock_lazy_extDict_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_extDict_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 1 ) ;
}
size_t ZSTD_compressBlock_lazy2_extDict_row (
ZSTD_matchState_t * ms , seqStore_t * seqStore , U32 rep [ ZSTD_REP_NUM ] ,
void const * src , size_t srcSize )
{
return ZSTD_compressBlock_lazy_extDict_generic ( ms , seqStore , rep , src , srcSize , search_rowHash , 2 ) ;
}