mirror of
https://github.com/BurntSushi/ripgrep.git
synced 2025-04-19 09:02:15 +02:00
doc: add section on PCRE2 performance
This commit is contained in:
parent
9df60e164e
commit
87a627631c
292
FAQ.md
292
FAQ.md
@ -16,6 +16,7 @@
|
|||||||
* [How do I get around the regex size limit?](#size-limit)
|
* [How do I get around the regex size limit?](#size-limit)
|
||||||
* [How do I make the `-f/--file` flag faster?](#dfa-size)
|
* [How do I make the `-f/--file` flag faster?](#dfa-size)
|
||||||
* [How do I make the output look like The Silver Searcher's output?](#silver-searcher-output)
|
* [How do I make the output look like The Silver Searcher's output?](#silver-searcher-output)
|
||||||
|
* [Why does ripgrep get slower when I enabled PCRE2 regexes?](#pcre2-slow)
|
||||||
* [When I run `rg`, why does it execute some other command?](#rg-other-cmd)
|
* [When I run `rg`, why does it execute some other command?](#rg-other-cmd)
|
||||||
* [How do I create an alias for ripgrep on Windows?](#rg-alias-windows)
|
* [How do I create an alias for ripgrep on Windows?](#rg-alias-windows)
|
||||||
* [How do I create a PowerShell profile?](#powershell-profile)
|
* [How do I create a PowerShell profile?](#powershell-profile)
|
||||||
@ -392,6 +393,297 @@ $ RIPGREP_CONFIG_PATH=$HOME/.config/ripgrep/rc rg foo
|
|||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
<h3 name="pcre2-slow">
|
||||||
|
Why does ripgrep get slower when I enable PCRE2 regexes?
|
||||||
|
</h3>
|
||||||
|
|
||||||
|
When you use the `--pcre2` (`-P` for short) flag, ripgrep will use the PCRE2
|
||||||
|
regex engine instead of the default. Both regex engines are quite fast,
|
||||||
|
but PCRE2 provides a number of additional features such as look-around and
|
||||||
|
backreferences that many enjoy using. This is largely because PCRE2 uses
|
||||||
|
a backtracking implementation where as the default regex engine uses a finite
|
||||||
|
automaton based implementation. The former provides the ability to add lots of
|
||||||
|
bells and whistles over the latter, but the latter executes with worst case
|
||||||
|
linear time complexity.
|
||||||
|
|
||||||
|
With that out of the way, if you've used `-P` with ripgrep, you may have
|
||||||
|
noticed that it can be slower. The reasons for why this is are quite complex,
|
||||||
|
and they are complex because the optimizations that ripgrep uses to implement
|
||||||
|
fast search are complex.
|
||||||
|
|
||||||
|
The task ripgrep has before it is somewhat simple; all it needs to do is search
|
||||||
|
a file for occurrences of some pattern and then print the lines containing
|
||||||
|
those occurrences. The problem lies in what is considered a valid match and how
|
||||||
|
exactly we read the bytes from a file.
|
||||||
|
|
||||||
|
In terms of what is considered a valid match, remember that ripgrep will only
|
||||||
|
report matches spanning a single line by default. The problem here is that
|
||||||
|
some patterns can match across multiple lines, and ripgrep needs to prevent
|
||||||
|
that from happening. For example, `foo\sbar` will match `foo\nbar`. The most
|
||||||
|
obvious way to achieve this is to read the data from a file, and then apply
|
||||||
|
the pattern search to that data for each line. The problem with this approach
|
||||||
|
is that it can be quite slow; it would be much faster to let the pattern
|
||||||
|
search across as much data as possible. It's faster because it gets rid of the
|
||||||
|
overhead of finding the boundaries of every line, and also because it gets rid
|
||||||
|
of the overhead of starting and stopping the pattern search for every single
|
||||||
|
line. (This is operating under the general assumption that matching lines are
|
||||||
|
much rarer than non-matching lines.)
|
||||||
|
|
||||||
|
It turns out that we can use the faster approach by applying a very simple
|
||||||
|
restriction to the pattern: *statically prevent* the pattern from matching
|
||||||
|
through a `\n` character. Namely, when given a pattern like `foo\sbar`,
|
||||||
|
ripgrep will remove `\n` from the `\s` character class automatically. In some
|
||||||
|
cases, a simple removal is not so easy. For example, ripgrep will return an
|
||||||
|
error when your pattern includes a `\n` literal:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ rg '\n'
|
||||||
|
the literal '"\n"' is not allowed in a regex
|
||||||
|
```
|
||||||
|
|
||||||
|
So what does this have to do with PCRE2? Well, ripgrep's default regex engine
|
||||||
|
exposes APIs for doing syntactic analysis on the pattern in a way that makes
|
||||||
|
it quite easy to strip `\n` from the pattern (or otherwise detect it and report
|
||||||
|
an error if stripping isn't possible). PCRE2 seemingly does not provide a
|
||||||
|
similar API, so ripgrep does not do any stripping when PCRE2 is enabled. This
|
||||||
|
forces ripgrep to use the "slow" search strategy of searching each line
|
||||||
|
individually.
|
||||||
|
|
||||||
|
OK, so if enabling PCRE2 slows down the default method of searching because it
|
||||||
|
forces matches to be limited to a single line, then why is PCRE2 also sometimes
|
||||||
|
slower when performing multiline searches? Well, that's because there are
|
||||||
|
*multiple* reasons why using PCRE2 in ripgrep can be slower than the default
|
||||||
|
regex engine. This time, blame PCRE2's Unicode support, which ripgrep enables
|
||||||
|
by default. In particular, PCRE2 cannot simultaneously enable Unicode support
|
||||||
|
and search arbitrary data. That is, when PCRE2's Unicode support is enabled,
|
||||||
|
the data **must** be valid UTF-8 (to do otherwise is to invoke undefined
|
||||||
|
behavior). This is in contrast to ripgrep's default regex engine, which can
|
||||||
|
enable Unicode support and still search arbitrary data. ripgrep's default
|
||||||
|
regex engine simply won't match invalid UTF-8 for a pattern that can otherwise
|
||||||
|
only match valid UTF-8. Why doesn't PCRE2 do the same? This author isn't
|
||||||
|
familiar with its internals, so we can't comment on it here.
|
||||||
|
|
||||||
|
The bottom line here is that we can't enable PCRE2's Unicode support without
|
||||||
|
simultaneously incurring a performance penalty for ensuring that we are
|
||||||
|
searching valid UTF-8. In particular, ripgrep will transcode the contents
|
||||||
|
of each file to UTF-8 while replacing invalid UTF-8 data with the Unicode
|
||||||
|
replacement codepoint. ripgrep then disables PCRE2's own internal UTF-8
|
||||||
|
checking, since we've guaranteed the data we hand it will be valid UTF-8. The
|
||||||
|
reason why ripgrep takes this approach is because if we do hand PCRE2 invalid
|
||||||
|
UTF-8, then it will report a match error if it comes across an invalid UTF-8
|
||||||
|
sequence. This is not good news for ripgrep, since it will stop it from
|
||||||
|
searching the rest of the file, and will also print potentially undesirable
|
||||||
|
error messages to users.
|
||||||
|
|
||||||
|
All right, the above is a lot of information to swallow if you aren't already
|
||||||
|
familiar with ripgrep internals. Let's make this concrete with some examples.
|
||||||
|
First, let's get some data big enough to magnify the performance differences:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ curl -O 'https://burntsushi.net/stuff/subtitles2016-sample.gz'
|
||||||
|
$ gzip -d subtitles2016-sample
|
||||||
|
$ md5sum subtitles2016-sample
|
||||||
|
e3cb796a20bbc602fbfd6bb43bda45f5 subtitles2016-sample
|
||||||
|
```
|
||||||
|
|
||||||
|
To search this data, we will use the pattern `^\w{42}$`, which contains exactly
|
||||||
|
one hit in the file and has no literals. Having no literals is important,
|
||||||
|
because it ensures that the regex engine won't use literal optimizations to
|
||||||
|
speed up the search. In other words, it lets us reason coherently about the
|
||||||
|
actual task that the regex engine is performing.
|
||||||
|
|
||||||
|
Let's now walk through a few examples in light of the information above. First,
|
||||||
|
let's consider the default search using ripgrep's default regex engine and
|
||||||
|
then the same search with PCRE2:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ time rg '^\w{42}$' subtitles2016-sample
|
||||||
|
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||||
|
|
||||||
|
real 0m1.783s
|
||||||
|
user 0m1.731s
|
||||||
|
sys 0m0.051s
|
||||||
|
|
||||||
|
$ time rg -P '^\w{42}$' subtitles2016-sample
|
||||||
|
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||||
|
|
||||||
|
real 0m2.458s
|
||||||
|
user 0m2.419s
|
||||||
|
sys 0m0.038s
|
||||||
|
```
|
||||||
|
|
||||||
|
In this particular example, both pattern searches are using a Unicode aware
|
||||||
|
`\w` character class and both are counting lines in order to report line
|
||||||
|
numbers. The key difference here is that the first search will not search
|
||||||
|
line by line, but the second one will. We can observe which strategy ripgrep
|
||||||
|
uses by passing the `--trace` flag:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ rg '^\w{42}$' subtitles2016-sample --trace
|
||||||
|
[... snip ...]
|
||||||
|
TRACE|grep_searcher::searcher|grep-searcher/src/searcher/mod.rs:622: Some("subtitles2016-sample"): searching via memory map
|
||||||
|
TRACE|grep_searcher::searcher|grep-searcher/src/searcher/mod.rs:712: slice reader: searching via slice-by-line strategy
|
||||||
|
TRACE|grep_searcher::searcher::core|grep-searcher/src/searcher/core.rs:61: searcher core: will use fast line searcher
|
||||||
|
[... snip ...]
|
||||||
|
|
||||||
|
$ rg -P '^\w{42}$' subtitles2016-sample --trace
|
||||||
|
[... snip ...]
|
||||||
|
TRACE|grep_searcher::searcher|grep-searcher/src/searcher/mod.rs:622: Some("subtitles2016-sample"): searching via memory map
|
||||||
|
TRACE|grep_searcher::searcher|grep-searcher/src/searcher/mod.rs:705: slice reader: needs transcoding, using generic reader
|
||||||
|
TRACE|grep_searcher::searcher|grep-searcher/src/searcher/mod.rs:685: generic reader: searching via roll buffer strategy
|
||||||
|
TRACE|grep_searcher::searcher::core|grep-searcher/src/searcher/core.rs:63: searcher core: will use slow line searcher
|
||||||
|
[... snip ...]
|
||||||
|
```
|
||||||
|
|
||||||
|
The first says it is using the "fast line searcher" where as the latter says
|
||||||
|
it is using the "slow line searcher." The latter also shows that we are
|
||||||
|
decoding the contents of the file, which also impacts performance.
|
||||||
|
|
||||||
|
Interestingly, in this case, the pattern does not match a `\n` and the file
|
||||||
|
we're searching is valid UTF-8, so neither the slow line-by-line search
|
||||||
|
strategy nor the decoding are necessary. We could fix the former issue with
|
||||||
|
better PCRE2 introspection APIs. We can actually fix the latter issue with
|
||||||
|
ripgrep's `--no-encoding` flag, which prevents the automatic UTF-8 decoding,
|
||||||
|
but will enable PCRE2's own UTF-8 validity checking. Unfortunately, it's slower
|
||||||
|
in my build of ripgrep:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ time rg -P '^\w{42}$' subtitles2016-sample --no-encoding
|
||||||
|
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||||
|
|
||||||
|
real 0m3.074s
|
||||||
|
user 0m3.021s
|
||||||
|
sys 0m0.051s
|
||||||
|
```
|
||||||
|
|
||||||
|
(Tip: use the `--trace` flag to verify that no decoding in ripgrep is
|
||||||
|
happening.)
|
||||||
|
|
||||||
|
A possible reason why PCRE2's UTF-8 checking is slower is because it might
|
||||||
|
not be better than the highly optimized UTF-8 checking routines found in the
|
||||||
|
[`encoding_rs`](https://github.com/hsivonen/encoding_rs) library, which is what
|
||||||
|
ripgrep uses for UTF-8 decoding. Moreover, my build of ripgrep enables
|
||||||
|
`encoding_rs`'s SIMD optimizations, which may be in play here.
|
||||||
|
|
||||||
|
Also, note that using the `--no-encoding` flag can cause PCRE2 to report
|
||||||
|
invalid UTF-8 errors, which causes ripgrep to stop searching the file:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ cat invalid-utf8
|
||||||
|
foobar
|
||||||
|
|
||||||
|
$ xxd invalid-utf8
|
||||||
|
00000000: 666f 6fff 6261 720a foo.bar.
|
||||||
|
|
||||||
|
$ rg foo invalid-utf8
|
||||||
|
1:foobar
|
||||||
|
|
||||||
|
$ rg -P foo invalid-utf8
|
||||||
|
1:foo�bar
|
||||||
|
|
||||||
|
$ rg -P foo invalid-utf8 --no-encoding
|
||||||
|
invalid-utf8: PCRE2: error matching: UTF-8 error: illegal byte (0xfe or 0xff)
|
||||||
|
```
|
||||||
|
|
||||||
|
All right, so at this point, you might think that we could remove the penalty
|
||||||
|
for line-by-line searching by enabling multiline search. After all, our
|
||||||
|
particular pattern can't match across multiple lines anyway, so we'll still get
|
||||||
|
the results we want. Let's try it:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ time rg -U '^\w{42}$' subtitles2016-sample
|
||||||
|
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||||
|
|
||||||
|
real 0m1.803s
|
||||||
|
user 0m1.748s
|
||||||
|
sys 0m0.054s
|
||||||
|
|
||||||
|
$ time rg -P -U '^\w{42}$' subtitles2016-sample
|
||||||
|
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||||
|
|
||||||
|
real 0m2.962s
|
||||||
|
user 0m2.246s
|
||||||
|
sys 0m0.713s
|
||||||
|
```
|
||||||
|
|
||||||
|
Search times remain the same with the default regex engine, but the PCRE2
|
||||||
|
search gets _slower_. What happened? The secrets can be revealed with the
|
||||||
|
`--trace` flag once again. In the former case, ripgrep actually detects that
|
||||||
|
the pattern can't match across multiple lines, and so will fall back to the
|
||||||
|
"fast line search" strategy as with our search without `-U`.
|
||||||
|
|
||||||
|
However, for PCRE2, things are much worse. Namely, since Unicode mode is still
|
||||||
|
enabled, ripgrep is still going to decode UTF-8 to ensure that it hands only
|
||||||
|
valid UTF-8 to PCRE2. Unfortunately, one key downside of multiline search is
|
||||||
|
that ripgrep cannot do it incrementally. Since matches can be arbitrarily long,
|
||||||
|
ripgrep actually needs the entire file in memory at once. Normally, we can use
|
||||||
|
a memory map for this, but because we need to UTF-8 decode the file before
|
||||||
|
searching it, ripgrep winds up reading the entire contents of the file on to
|
||||||
|
the heap before executing a search. Owch.
|
||||||
|
|
||||||
|
OK, so Unicode is killing us here. The file we're searching is _mostly_ ASCII,
|
||||||
|
so maybe we're OK with missing some data. (Try `rg '[\w--\p{ascii}]'` to see
|
||||||
|
non-ASCII word characters that an ASCII-only `\w` character class would miss.)
|
||||||
|
We can disable Unicode in both searches, but this is done differently depending
|
||||||
|
on the regex engine we use:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ time rg '(?-u)^\w{42}$' subtitles2016-sample
|
||||||
|
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||||
|
|
||||||
|
real 0m1.714s
|
||||||
|
user 0m1.669s
|
||||||
|
sys 0m0.044s
|
||||||
|
|
||||||
|
[andrew@Cheetah 2016] time rg -P '^\w{42}$' subtitles2016-sample --no-pcre2-unicode
|
||||||
|
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||||
|
|
||||||
|
real 0m1.997s
|
||||||
|
user 0m1.958s
|
||||||
|
sys 0m0.037s
|
||||||
|
```
|
||||||
|
|
||||||
|
For the most part, ripgrep's default regex engine performs about the same.
|
||||||
|
PCRE2 does improve a little bit, and is now almost as fast as the default
|
||||||
|
regex engine. If you look at the output of `--trace`, you'll see that ripgrep
|
||||||
|
will no longer perform UTF-8 decoding, but it does still use the slow
|
||||||
|
line-by-line searcher.
|
||||||
|
|
||||||
|
At this point, we can combine all of our insights above: let's try to get off
|
||||||
|
of the slow line-by-line searcher by enabling multiline mode, and let's stop
|
||||||
|
UTF-8 decoding by disabling Unicode support:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ time rg -U '(?-u)^\w{42}$' subtitles2016-sample
|
||||||
|
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||||
|
|
||||||
|
real 0m1.714s
|
||||||
|
user 0m1.655s
|
||||||
|
sys 0m0.058s
|
||||||
|
|
||||||
|
$ time rg -P -U '^\w{42}$' subtitles2016-sample --no-pcre2-unicode
|
||||||
|
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||||
|
|
||||||
|
real 0m1.121s
|
||||||
|
user 0m1.071s
|
||||||
|
sys 0m0.048s
|
||||||
|
```
|
||||||
|
|
||||||
|
Ah, there's PCRE2's JIT shining! ripgrep's default regex engine once again
|
||||||
|
remains about the same, but PCRE2 no longer needs to search line-by-line and it
|
||||||
|
no longer needs to do any kind of UTF-8 checks. This allows the file to get
|
||||||
|
memory mapped and passed right through PCRE2's JIT at impressive speeds. (As
|
||||||
|
a brief and interesting historical note, the configuration of "memory map +
|
||||||
|
multiline + no-Unicode" is exactly the configuration used by The Silver
|
||||||
|
Searcher. This analysis perhaps sheds some reasoning as to why it converged on
|
||||||
|
that specific setting!)
|
||||||
|
|
||||||
|
In summary, if you want PCRE2 to go as fast as possible and you don't care
|
||||||
|
about Unicode and you don't care about matches possibly spanning across
|
||||||
|
multiple lines, then enable multiline mode with `-U` and disable PCRE2's
|
||||||
|
Unicode support with the `--no-pcre2-unicode` flag.
|
||||||
|
|
||||||
|
|
||||||
<h3 name="rg-other-cmd">
|
<h3 name="rg-other-cmd">
|
||||||
When I run <code>rg</code>, why does it execute some other command?
|
When I run <code>rg</code>, why does it execute some other command?
|
||||||
</h3>
|
</h3>
|
||||||
|
Loading…
x
Reference in New Issue
Block a user