1
0
mirror of https://github.com/BurntSushi/ripgrep.git synced 2025-04-24 17:12:16 +02:00

24 Commits

Author SHA1 Message Date
Jakub Wieczorek
b435eaafc8 grep-regex: fix inner literal extraction bug
This appears to be another transcription bug from copying this code from
the prefix literal detection from inside the regex crate. Namely, when
it comes to inner literals, we only want to treat counted repetition as
two separate cases: the case when the minimum match is 0 and the case
when the minimum match is more than 0. In the former case, we treat
`e{0,n}` as `e*` and in the latter we treat `e{m,n}` where `m >= 1` as
just `e`.

We could definitely do better here. e.g., This means regexes like
`(foo){10}` will only have `foo` extracted as a literal, where searching
for the full literal would likely be faster.

The actual bug here was that we were not implementing this logic
correctly. Namely, we weren't always "cutting" the literals in the
second case to prevent them from being expanded.

Fixes #1319, Closes #1367
2020-02-17 17:16:28 -05:00
Andrew Gallant
cd8ec38a68 grep-regex: add fast path for -w/--word-regexp
Previously, ripgrep would always defer to the regex engine's capturing
matches in order to implement word matching. Namely, ripgrep would
determine the correct match offsets via a capturing group, since the
word regex is itself generated from the user supplied regex.

Unfortunately, the regex engine's capturing mode is still fairly slow,
so this commit adds a fast path to avoid capturing mode in the vast
majority of cases. See comments in the code for details.
2020-02-17 17:16:28 -05:00
Andrew Gallant
6a0e0147e0 grep-regex: improve literal detection with -w
When the -w/--word-regexp was used, ripgrep would in many cases fail to
apply literal optimizations. This occurs specifically when the regex
given by the user is an alternation of literals with no common prefixes
or suffixes, e.g.,

    rg -w 'foo|bar|baz|quux'

In this case, the inner literal detector fails. Normally, this would
result in literal prefixes being detected by the regex engine. But
because of the -w/--word-regexp flag, the actual regex that we run ends
up looking like this:

    (^|\W)(foo|bar|baz|quux)($|\W)

which of course defeats any prefix or suffix literal optimizations in
the regex crate's somewhat naive extractor. (A better extractor could
still do literal optimizations in the above case.)

So this commit fixes this by falling back to prefix or suffix literals
when they're available instead of prematurely giving up and assuming the
regex engine will do the rest.
2020-02-17 17:16:28 -05:00
Andrew Gallant
ad97e9c93f grep-regex: improve inner literal detection
This fixes an interesting performance bug where the inner literal
extractor would sometimes choose a sub-optimal literal. For example,
consider the regex:

    \x20+Sherlock Holmes\x20+

(The `\x20` is the ASCII code for a space character, which we use here
to just make it clearer. It otherwise does not matter.)

Previously, this would see the initial \x20 and then stop collecting
literals after the `+` repetition operator. This was because the inner
literal detector was adapter from the prefix literal detector, which had
to stop here. Namely, while \x20S would be a valid prefix (for example),
\x20\x20S would also be a valid prefix. As would \x20\x20\x20S and so
on. So the prefix detector would have to stop at the repetition
operator. Otherwise, only searching for \x20S could potentially scan
farther then the starting position of the next match.

However, for inner literals, this calculus no longer makes sense. We can
freely search for, e.g., \x20S without missing matches that start with
\x20\x20S precisely because we know this is an inner literal which may
not correspond to the start of a match.

With this fix, the literal that is now detected is

    \x20Sherlock Holmes\x20

Which is much better. We achieve this by no longer "cutting" literals
after seeing a `+` repetition operator. Instead, we permit literals to
continue to be extended.

The reason why this is important is because using \x20 as the literal to
search for is generally bad juju since it is so common. In fact, we
should probably add more logic here to either avoid such things or give
up entirely on the inner literal optimization if it detected a literal
that we think is very common. But we punt on such things here.
2020-02-17 17:16:28 -05:00
Andrew Gallant
cb2f6ddc61 deps: update to thread_local 1.0
We also update the pcre2 and regex dependencies, which removes any other
lingering uses of thread_local 0.3.
2020-01-10 15:07:47 -05:00
Andrew Gallant
5c4584aa7c
grep-regex-0.1.5 2019-08-06 09:51:13 -04:00
Andrew Gallant
4de227fd9a
deps: update everything
Mostly this just updates regex and its assorted dependencies. This does
drop utf8-ranges and ucd-util, in accordance with changes to
regex-syntax and regex.
2019-08-05 13:50:55 -04:00
Andrew Gallant
9c220f9a9b
grep-regex: release 0.1.4 2019-08-01 17:47:45 -04:00
Andrew Gallant
adb9332f52
regex: fix -F aho-corasick optimization
It turns out that when the -F flag was used, if any of the patterns
contained a regex meta character (such as `.`), then we winded up
escaping the pattern first before handing it off to Aho-Corasick, which
treats all patterns literally.

We continue to apply band-aides here and just avoid Aho-Corasick if
there is an escape in any of the literal patterns. This is unfortunate,
but making this work better requires more refactoring, and the right
solution is to get this optimization pushed down into the regex engine.

Fixes #1334
2019-08-01 16:58:12 -04:00
Andrew Gallant
44cd344438
grep-regex: release 0.1.3 2019-04-15 17:56:04 -04:00
Andrew Gallant
e493e54b9b
grep-matcher: release 0.1.2 2019-04-15 17:53:29 -04:00
Andrew Gallant
bd222ae93f regex: fix HIR analysis bug
An alternate can be empty at this point, so we must handle it. We didn't
before because the regex engine actually disallows empty alternates,
however, this code runs before the regex compiler rejects the regex.
2019-04-14 19:29:27 -04:00
Andrew Gallant
09108b7fda regex: make multi-literal searcher faster
This makes the case of searching for a dictionary of a very large number
of literals much much faster. (~10x or so.) In particular, we achieve this
by short-circuiting the construction of a full regex when we know we have
a simple alternation of literals. Building the regex for a large dictionary
(>100,000 literals) turns out to be quite slow, even if it internally will
dispatch to Aho-Corasick.

Even that isn't quite enough. It turns out that even *parsing* such a regex
is quite slow. So when the -F/--fixed-strings flag is set, we short
circuit regex parsing completely and jump straight to Aho-Corasick.

We aren't quite as fast as GNU grep here, but it's much closer (less than
2x slower).

In general, this is somewhat of a hack. In particular, it seems plausible
that this optimization could be implemented entirely in the regex engine.
Unfortunately, the regex engine's internals are just not amenable to this
at all, so it would require a larger refactoring effort. For now, it's
good enough to add this fairly simple hack at a higher level.

Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will
be slower, because of the aforementioned missing optimization. Moreover,
passing flags like `-i` or `-S` will cause ripgrep to abandon this
optimization and fall back to something potentially much slower. Again,
this fix really needs to happen inside the regex engine, although we
might be able to special case -i when the input literals are pure ASCII
via Aho-Corasick's `ascii_case_insensitive`.

Fixes #497, Fixes #838
2019-04-07 19:11:03 -04:00
lesnyrumcajs
5962abc465 searcher: add option to disable BOM sniffing
This commit adds a new encoding feature where the -E/--encoding flag
will now accept a value of 'none'. When given this value, all encoding
related machinery is disabled and ripgrep will search the raw bytes of
the file, including the BOM if it's present.

Closes #1207, Closes #1208
2019-04-06 10:35:08 -04:00
Andrew Gallant
be7d6dd9ce regex: print out final regex in trace mode
This is useful for debugging to see what regex is actually being run.
We put this as a trace since the regex can be quite gnarly. (It is not
pretty printed.)
2019-04-05 23:24:08 -04:00
Andrew Gallant
9f15e3b671 regex: fix a perf bug when using -w flag
When looking for an inner literal to speed up searches, if only a prefix
is found, then we generally give up doing inner literal optimizations since
the regex engine will generally handle it for us. Unfortunately, this
decision was being made *before* we wrap the regex in (^|\W)...($|\W) when
using the -w/--word-regexp flag, which would then defeat the literal
optimizations inside the regex engine.

We fix this with a bit of a hack that says, "if we're doing a word regexp,
then give me back any literal you find, even if it's a prefix."
2019-04-05 23:24:08 -04:00
Andrew Gallant
c84cfb6756
grep-regex-0.1.2 2019-02-16 09:30:06 -05:00
Andrew Gallant
9d703110cf
regex: make CRLF hack more robust
This commit improves the CRLF hack to be more robust. In particular, in
addition to rewriting `$` as `(?:\r??$)`, we now strip `\r` from the end
of a match if and only if the regex has an ending line anchor required for
a match. This doesn't quite make the hack 100% correct, but should fix most
use cases in practice. An example of a regex that will still be incorrect
is `foo|bar$`, since the analysis isn't quite sophisticated enough to
determine that a `\r` can be safely stripped from any match. Even if we
fix that, regexes like `foo\r|bar$` still won't be handled correctly. Alas,
more work on this front should really be focused on enabling this in the
regex engine itself.

The specific cause of this bug was that grep-searcher was sneakily
stripping CRLF from matching lines when it really shouldn't have. We remove
that code now, and instead rely on better match semantics provided at a
lower level.

Fixes #1095
2019-01-26 12:34:28 -05:00
Andrew Gallant
63b0f31a22 deps: update various dependencies
We also increase the MSRV to 1.32, the current stable release, which sets
the stage for migrating to Rust 2018.
2019-01-19 10:44:30 -05:00
Andrew Gallant
ba503eb677 grep-regex: fix inner literal detection
It seems the inner literal detector fails spectacularly in cases of
concatenations that involve groups. The issue here is that if the prefix
of a group inside a concatenation can match the empty string, then any
literals generated to that point in the concatenation need to be cut
such that they are never extended. The detector isn't really built to
handle this case, so we just act conservative cut literals whenever we
see a sub-group. This may make some regexes slower, but the inner
literal detector already misses plenty of cases.

Literal detection (including in the regex engine) is a key component
that needs to be completely rethought at some point.

Fixes #1064
2018-09-25 16:56:04 -04:00
Andrew Gallant
d14f0b37d6
deps: update versions for all crates
I don't think every change here is needed, but this ensures we're using
the latest version of every direct dependency.
2018-09-07 14:00:22 -04:00
Andrew Gallant
3b5cdea862
doc: minor touchups to API docs 2018-09-07 12:06:05 -04:00
Andrew Gallant
afa06c518a
deps: update libripgrep crate versions
This prepares them for an initial 0.1.0 release.
2018-08-20 17:34:45 -04:00
Andrew Gallant
d9ca529356 libripgrep: initial commit introducing libripgrep
libripgrep is not any one library, but rather, a collection of libraries
that roughly separate the following key distinct phases in a grep
implementation:

  1. Pattern matching (e.g., by a regex engine).
  2. Searching a file using a pattern matcher.
  3. Printing results.

Ultimately, both (1) and (3) are defined by de-coupled interfaces, of
which there may be multiple implementations. Namely, (1) is satisfied by
the `Matcher` trait in the `grep-matcher` crate and (3) is satisfied by
the `Sink` trait in the `grep2` crate. The searcher (2) ties everything
together and finds results using a matcher and reports those results
using a `Sink` implementation.

Closes #162
2018-08-20 07:10:19 -04:00