mirror of
https://github.com/BurntSushi/ripgrep.git
synced 2024-12-02 02:56:32 +02:00
doc: add section on PCRE2 performance
This commit is contained in:
parent
9df60e164e
commit
87a627631c
292
FAQ.md
292
FAQ.md
@ -16,6 +16,7 @@
|
||||
* [How do I get around the regex size limit?](#size-limit)
|
||||
* [How do I make the `-f/--file` flag faster?](#dfa-size)
|
||||
* [How do I make the output look like The Silver Searcher's output?](#silver-searcher-output)
|
||||
* [Why does ripgrep get slower when I enabled PCRE2 regexes?](#pcre2-slow)
|
||||
* [When I run `rg`, why does it execute some other command?](#rg-other-cmd)
|
||||
* [How do I create an alias for ripgrep on Windows?](#rg-alias-windows)
|
||||
* [How do I create a PowerShell profile?](#powershell-profile)
|
||||
@ -392,6 +393,297 @@ $ RIPGREP_CONFIG_PATH=$HOME/.config/ripgrep/rc rg foo
|
||||
```
|
||||
|
||||
|
||||
<h3 name="pcre2-slow">
|
||||
Why does ripgrep get slower when I enable PCRE2 regexes?
|
||||
</h3>
|
||||
|
||||
When you use the `--pcre2` (`-P` for short) flag, ripgrep will use the PCRE2
|
||||
regex engine instead of the default. Both regex engines are quite fast,
|
||||
but PCRE2 provides a number of additional features such as look-around and
|
||||
backreferences that many enjoy using. This is largely because PCRE2 uses
|
||||
a backtracking implementation where as the default regex engine uses a finite
|
||||
automaton based implementation. The former provides the ability to add lots of
|
||||
bells and whistles over the latter, but the latter executes with worst case
|
||||
linear time complexity.
|
||||
|
||||
With that out of the way, if you've used `-P` with ripgrep, you may have
|
||||
noticed that it can be slower. The reasons for why this is are quite complex,
|
||||
and they are complex because the optimizations that ripgrep uses to implement
|
||||
fast search are complex.
|
||||
|
||||
The task ripgrep has before it is somewhat simple; all it needs to do is search
|
||||
a file for occurrences of some pattern and then print the lines containing
|
||||
those occurrences. The problem lies in what is considered a valid match and how
|
||||
exactly we read the bytes from a file.
|
||||
|
||||
In terms of what is considered a valid match, remember that ripgrep will only
|
||||
report matches spanning a single line by default. The problem here is that
|
||||
some patterns can match across multiple lines, and ripgrep needs to prevent
|
||||
that from happening. For example, `foo\sbar` will match `foo\nbar`. The most
|
||||
obvious way to achieve this is to read the data from a file, and then apply
|
||||
the pattern search to that data for each line. The problem with this approach
|
||||
is that it can be quite slow; it would be much faster to let the pattern
|
||||
search across as much data as possible. It's faster because it gets rid of the
|
||||
overhead of finding the boundaries of every line, and also because it gets rid
|
||||
of the overhead of starting and stopping the pattern search for every single
|
||||
line. (This is operating under the general assumption that matching lines are
|
||||
much rarer than non-matching lines.)
|
||||
|
||||
It turns out that we can use the faster approach by applying a very simple
|
||||
restriction to the pattern: *statically prevent* the pattern from matching
|
||||
through a `\n` character. Namely, when given a pattern like `foo\sbar`,
|
||||
ripgrep will remove `\n` from the `\s` character class automatically. In some
|
||||
cases, a simple removal is not so easy. For example, ripgrep will return an
|
||||
error when your pattern includes a `\n` literal:
|
||||
|
||||
```
|
||||
$ rg '\n'
|
||||
the literal '"\n"' is not allowed in a regex
|
||||
```
|
||||
|
||||
So what does this have to do with PCRE2? Well, ripgrep's default regex engine
|
||||
exposes APIs for doing syntactic analysis on the pattern in a way that makes
|
||||
it quite easy to strip `\n` from the pattern (or otherwise detect it and report
|
||||
an error if stripping isn't possible). PCRE2 seemingly does not provide a
|
||||
similar API, so ripgrep does not do any stripping when PCRE2 is enabled. This
|
||||
forces ripgrep to use the "slow" search strategy of searching each line
|
||||
individually.
|
||||
|
||||
OK, so if enabling PCRE2 slows down the default method of searching because it
|
||||
forces matches to be limited to a single line, then why is PCRE2 also sometimes
|
||||
slower when performing multiline searches? Well, that's because there are
|
||||
*multiple* reasons why using PCRE2 in ripgrep can be slower than the default
|
||||
regex engine. This time, blame PCRE2's Unicode support, which ripgrep enables
|
||||
by default. In particular, PCRE2 cannot simultaneously enable Unicode support
|
||||
and search arbitrary data. That is, when PCRE2's Unicode support is enabled,
|
||||
the data **must** be valid UTF-8 (to do otherwise is to invoke undefined
|
||||
behavior). This is in contrast to ripgrep's default regex engine, which can
|
||||
enable Unicode support and still search arbitrary data. ripgrep's default
|
||||
regex engine simply won't match invalid UTF-8 for a pattern that can otherwise
|
||||
only match valid UTF-8. Why doesn't PCRE2 do the same? This author isn't
|
||||
familiar with its internals, so we can't comment on it here.
|
||||
|
||||
The bottom line here is that we can't enable PCRE2's Unicode support without
|
||||
simultaneously incurring a performance penalty for ensuring that we are
|
||||
searching valid UTF-8. In particular, ripgrep will transcode the contents
|
||||
of each file to UTF-8 while replacing invalid UTF-8 data with the Unicode
|
||||
replacement codepoint. ripgrep then disables PCRE2's own internal UTF-8
|
||||
checking, since we've guaranteed the data we hand it will be valid UTF-8. The
|
||||
reason why ripgrep takes this approach is because if we do hand PCRE2 invalid
|
||||
UTF-8, then it will report a match error if it comes across an invalid UTF-8
|
||||
sequence. This is not good news for ripgrep, since it will stop it from
|
||||
searching the rest of the file, and will also print potentially undesirable
|
||||
error messages to users.
|
||||
|
||||
All right, the above is a lot of information to swallow if you aren't already
|
||||
familiar with ripgrep internals. Let's make this concrete with some examples.
|
||||
First, let's get some data big enough to magnify the performance differences:
|
||||
|
||||
```
|
||||
$ curl -O 'https://burntsushi.net/stuff/subtitles2016-sample.gz'
|
||||
$ gzip -d subtitles2016-sample
|
||||
$ md5sum subtitles2016-sample
|
||||
e3cb796a20bbc602fbfd6bb43bda45f5 subtitles2016-sample
|
||||
```
|
||||
|
||||
To search this data, we will use the pattern `^\w{42}$`, which contains exactly
|
||||
one hit in the file and has no literals. Having no literals is important,
|
||||
because it ensures that the regex engine won't use literal optimizations to
|
||||
speed up the search. In other words, it lets us reason coherently about the
|
||||
actual task that the regex engine is performing.
|
||||
|
||||
Let's now walk through a few examples in light of the information above. First,
|
||||
let's consider the default search using ripgrep's default regex engine and
|
||||
then the same search with PCRE2:
|
||||
|
||||
```
|
||||
$ time rg '^\w{42}$' subtitles2016-sample
|
||||
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||
|
||||
real 0m1.783s
|
||||
user 0m1.731s
|
||||
sys 0m0.051s
|
||||
|
||||
$ time rg -P '^\w{42}$' subtitles2016-sample
|
||||
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||
|
||||
real 0m2.458s
|
||||
user 0m2.419s
|
||||
sys 0m0.038s
|
||||
```
|
||||
|
||||
In this particular example, both pattern searches are using a Unicode aware
|
||||
`\w` character class and both are counting lines in order to report line
|
||||
numbers. The key difference here is that the first search will not search
|
||||
line by line, but the second one will. We can observe which strategy ripgrep
|
||||
uses by passing the `--trace` flag:
|
||||
|
||||
```
|
||||
$ rg '^\w{42}$' subtitles2016-sample --trace
|
||||
[... snip ...]
|
||||
TRACE|grep_searcher::searcher|grep-searcher/src/searcher/mod.rs:622: Some("subtitles2016-sample"): searching via memory map
|
||||
TRACE|grep_searcher::searcher|grep-searcher/src/searcher/mod.rs:712: slice reader: searching via slice-by-line strategy
|
||||
TRACE|grep_searcher::searcher::core|grep-searcher/src/searcher/core.rs:61: searcher core: will use fast line searcher
|
||||
[... snip ...]
|
||||
|
||||
$ rg -P '^\w{42}$' subtitles2016-sample --trace
|
||||
[... snip ...]
|
||||
TRACE|grep_searcher::searcher|grep-searcher/src/searcher/mod.rs:622: Some("subtitles2016-sample"): searching via memory map
|
||||
TRACE|grep_searcher::searcher|grep-searcher/src/searcher/mod.rs:705: slice reader: needs transcoding, using generic reader
|
||||
TRACE|grep_searcher::searcher|grep-searcher/src/searcher/mod.rs:685: generic reader: searching via roll buffer strategy
|
||||
TRACE|grep_searcher::searcher::core|grep-searcher/src/searcher/core.rs:63: searcher core: will use slow line searcher
|
||||
[... snip ...]
|
||||
```
|
||||
|
||||
The first says it is using the "fast line searcher" where as the latter says
|
||||
it is using the "slow line searcher." The latter also shows that we are
|
||||
decoding the contents of the file, which also impacts performance.
|
||||
|
||||
Interestingly, in this case, the pattern does not match a `\n` and the file
|
||||
we're searching is valid UTF-8, so neither the slow line-by-line search
|
||||
strategy nor the decoding are necessary. We could fix the former issue with
|
||||
better PCRE2 introspection APIs. We can actually fix the latter issue with
|
||||
ripgrep's `--no-encoding` flag, which prevents the automatic UTF-8 decoding,
|
||||
but will enable PCRE2's own UTF-8 validity checking. Unfortunately, it's slower
|
||||
in my build of ripgrep:
|
||||
|
||||
```
|
||||
$ time rg -P '^\w{42}$' subtitles2016-sample --no-encoding
|
||||
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||
|
||||
real 0m3.074s
|
||||
user 0m3.021s
|
||||
sys 0m0.051s
|
||||
```
|
||||
|
||||
(Tip: use the `--trace` flag to verify that no decoding in ripgrep is
|
||||
happening.)
|
||||
|
||||
A possible reason why PCRE2's UTF-8 checking is slower is because it might
|
||||
not be better than the highly optimized UTF-8 checking routines found in the
|
||||
[`encoding_rs`](https://github.com/hsivonen/encoding_rs) library, which is what
|
||||
ripgrep uses for UTF-8 decoding. Moreover, my build of ripgrep enables
|
||||
`encoding_rs`'s SIMD optimizations, which may be in play here.
|
||||
|
||||
Also, note that using the `--no-encoding` flag can cause PCRE2 to report
|
||||
invalid UTF-8 errors, which causes ripgrep to stop searching the file:
|
||||
|
||||
```
|
||||
$ cat invalid-utf8
|
||||
foobar
|
||||
|
||||
$ xxd invalid-utf8
|
||||
00000000: 666f 6fff 6261 720a foo.bar.
|
||||
|
||||
$ rg foo invalid-utf8
|
||||
1:foobar
|
||||
|
||||
$ rg -P foo invalid-utf8
|
||||
1:foo�bar
|
||||
|
||||
$ rg -P foo invalid-utf8 --no-encoding
|
||||
invalid-utf8: PCRE2: error matching: UTF-8 error: illegal byte (0xfe or 0xff)
|
||||
```
|
||||
|
||||
All right, so at this point, you might think that we could remove the penalty
|
||||
for line-by-line searching by enabling multiline search. After all, our
|
||||
particular pattern can't match across multiple lines anyway, so we'll still get
|
||||
the results we want. Let's try it:
|
||||
|
||||
```
|
||||
$ time rg -U '^\w{42}$' subtitles2016-sample
|
||||
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||
|
||||
real 0m1.803s
|
||||
user 0m1.748s
|
||||
sys 0m0.054s
|
||||
|
||||
$ time rg -P -U '^\w{42}$' subtitles2016-sample
|
||||
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||
|
||||
real 0m2.962s
|
||||
user 0m2.246s
|
||||
sys 0m0.713s
|
||||
```
|
||||
|
||||
Search times remain the same with the default regex engine, but the PCRE2
|
||||
search gets _slower_. What happened? The secrets can be revealed with the
|
||||
`--trace` flag once again. In the former case, ripgrep actually detects that
|
||||
the pattern can't match across multiple lines, and so will fall back to the
|
||||
"fast line search" strategy as with our search without `-U`.
|
||||
|
||||
However, for PCRE2, things are much worse. Namely, since Unicode mode is still
|
||||
enabled, ripgrep is still going to decode UTF-8 to ensure that it hands only
|
||||
valid UTF-8 to PCRE2. Unfortunately, one key downside of multiline search is
|
||||
that ripgrep cannot do it incrementally. Since matches can be arbitrarily long,
|
||||
ripgrep actually needs the entire file in memory at once. Normally, we can use
|
||||
a memory map for this, but because we need to UTF-8 decode the file before
|
||||
searching it, ripgrep winds up reading the entire contents of the file on to
|
||||
the heap before executing a search. Owch.
|
||||
|
||||
OK, so Unicode is killing us here. The file we're searching is _mostly_ ASCII,
|
||||
so maybe we're OK with missing some data. (Try `rg '[\w--\p{ascii}]'` to see
|
||||
non-ASCII word characters that an ASCII-only `\w` character class would miss.)
|
||||
We can disable Unicode in both searches, but this is done differently depending
|
||||
on the regex engine we use:
|
||||
|
||||
```
|
||||
$ time rg '(?-u)^\w{42}$' subtitles2016-sample
|
||||
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||
|
||||
real 0m1.714s
|
||||
user 0m1.669s
|
||||
sys 0m0.044s
|
||||
|
||||
[andrew@Cheetah 2016] time rg -P '^\w{42}$' subtitles2016-sample --no-pcre2-unicode
|
||||
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||
|
||||
real 0m1.997s
|
||||
user 0m1.958s
|
||||
sys 0m0.037s
|
||||
```
|
||||
|
||||
For the most part, ripgrep's default regex engine performs about the same.
|
||||
PCRE2 does improve a little bit, and is now almost as fast as the default
|
||||
regex engine. If you look at the output of `--trace`, you'll see that ripgrep
|
||||
will no longer perform UTF-8 decoding, but it does still use the slow
|
||||
line-by-line searcher.
|
||||
|
||||
At this point, we can combine all of our insights above: let's try to get off
|
||||
of the slow line-by-line searcher by enabling multiline mode, and let's stop
|
||||
UTF-8 decoding by disabling Unicode support:
|
||||
|
||||
```
|
||||
$ time rg -U '(?-u)^\w{42}$' subtitles2016-sample
|
||||
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||
|
||||
real 0m1.714s
|
||||
user 0m1.655s
|
||||
sys 0m0.058s
|
||||
|
||||
$ time rg -P -U '^\w{42}$' subtitles2016-sample --no-pcre2-unicode
|
||||
21225780:EverymajordevelopmentinthehistoryofAmerica
|
||||
|
||||
real 0m1.121s
|
||||
user 0m1.071s
|
||||
sys 0m0.048s
|
||||
```
|
||||
|
||||
Ah, there's PCRE2's JIT shining! ripgrep's default regex engine once again
|
||||
remains about the same, but PCRE2 no longer needs to search line-by-line and it
|
||||
no longer needs to do any kind of UTF-8 checks. This allows the file to get
|
||||
memory mapped and passed right through PCRE2's JIT at impressive speeds. (As
|
||||
a brief and interesting historical note, the configuration of "memory map +
|
||||
multiline + no-Unicode" is exactly the configuration used by The Silver
|
||||
Searcher. This analysis perhaps sheds some reasoning as to why it converged on
|
||||
that specific setting!)
|
||||
|
||||
In summary, if you want PCRE2 to go as fast as possible and you don't care
|
||||
about Unicode and you don't care about matches possibly spanning across
|
||||
multiple lines, then enable multiline mode with `-U` and disable PCRE2's
|
||||
Unicode support with the `--no-pcre2-unicode` flag.
|
||||
|
||||
|
||||
<h3 name="rg-other-cmd">
|
||||
When I run <code>rg</code>, why does it execute some other command?
|
||||
</h3>
|
||||
|
Loading…
Reference in New Issue
Block a user