1
0
mirror of https://github.com/BurntSushi/ripgrep.git synced 2024-12-12 19:18:24 +02:00
Commit Graph

16 Commits

Author SHA1 Message Date
Andrew Gallant
adb9332f52
regex: fix -F aho-corasick optimization
It turns out that when the -F flag was used, if any of the patterns
contained a regex meta character (such as `.`), then we winded up
escaping the pattern first before handing it off to Aho-Corasick, which
treats all patterns literally.

We continue to apply band-aides here and just avoid Aho-Corasick if
there is an escape in any of the literal patterns. This is unfortunate,
but making this work better requires more refactoring, and the right
solution is to get this optimization pushed down into the regex engine.

Fixes #1334
2019-08-01 16:58:12 -04:00
Andrew Gallant
44cd344438
grep-regex: release 0.1.3 2019-04-15 17:56:04 -04:00
Andrew Gallant
e493e54b9b
grep-matcher: release 0.1.2 2019-04-15 17:53:29 -04:00
Andrew Gallant
bd222ae93f regex: fix HIR analysis bug
An alternate can be empty at this point, so we must handle it. We didn't
before because the regex engine actually disallows empty alternates,
however, this code runs before the regex compiler rejects the regex.
2019-04-14 19:29:27 -04:00
Andrew Gallant
09108b7fda regex: make multi-literal searcher faster
This makes the case of searching for a dictionary of a very large number
of literals much much faster. (~10x or so.) In particular, we achieve this
by short-circuiting the construction of a full regex when we know we have
a simple alternation of literals. Building the regex for a large dictionary
(>100,000 literals) turns out to be quite slow, even if it internally will
dispatch to Aho-Corasick.

Even that isn't quite enough. It turns out that even *parsing* such a regex
is quite slow. So when the -F/--fixed-strings flag is set, we short
circuit regex parsing completely and jump straight to Aho-Corasick.

We aren't quite as fast as GNU grep here, but it's much closer (less than
2x slower).

In general, this is somewhat of a hack. In particular, it seems plausible
that this optimization could be implemented entirely in the regex engine.
Unfortunately, the regex engine's internals are just not amenable to this
at all, so it would require a larger refactoring effort. For now, it's
good enough to add this fairly simple hack at a higher level.

Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will
be slower, because of the aforementioned missing optimization. Moreover,
passing flags like `-i` or `-S` will cause ripgrep to abandon this
optimization and fall back to something potentially much slower. Again,
this fix really needs to happen inside the regex engine, although we
might be able to special case -i when the input literals are pure ASCII
via Aho-Corasick's `ascii_case_insensitive`.

Fixes #497, Fixes #838
2019-04-07 19:11:03 -04:00
lesnyrumcajs
5962abc465 searcher: add option to disable BOM sniffing
This commit adds a new encoding feature where the -E/--encoding flag
will now accept a value of 'none'. When given this value, all encoding
related machinery is disabled and ripgrep will search the raw bytes of
the file, including the BOM if it's present.

Closes #1207, Closes #1208
2019-04-06 10:35:08 -04:00
Andrew Gallant
be7d6dd9ce regex: print out final regex in trace mode
This is useful for debugging to see what regex is actually being run.
We put this as a trace since the regex can be quite gnarly. (It is not
pretty printed.)
2019-04-05 23:24:08 -04:00
Andrew Gallant
9f15e3b671 regex: fix a perf bug when using -w flag
When looking for an inner literal to speed up searches, if only a prefix
is found, then we generally give up doing inner literal optimizations since
the regex engine will generally handle it for us. Unfortunately, this
decision was being made *before* we wrap the regex in (^|\W)...($|\W) when
using the -w/--word-regexp flag, which would then defeat the literal
optimizations inside the regex engine.

We fix this with a bit of a hack that says, "if we're doing a word regexp,
then give me back any literal you find, even if it's a prefix."
2019-04-05 23:24:08 -04:00
Andrew Gallant
c84cfb6756
grep-regex-0.1.2 2019-02-16 09:30:06 -05:00
Andrew Gallant
9d703110cf
regex: make CRLF hack more robust
This commit improves the CRLF hack to be more robust. In particular, in
addition to rewriting `$` as `(?:\r??$)`, we now strip `\r` from the end
of a match if and only if the regex has an ending line anchor required for
a match. This doesn't quite make the hack 100% correct, but should fix most
use cases in practice. An example of a regex that will still be incorrect
is `foo|bar$`, since the analysis isn't quite sophisticated enough to
determine that a `\r` can be safely stripped from any match. Even if we
fix that, regexes like `foo\r|bar$` still won't be handled correctly. Alas,
more work on this front should really be focused on enabling this in the
regex engine itself.

The specific cause of this bug was that grep-searcher was sneakily
stripping CRLF from matching lines when it really shouldn't have. We remove
that code now, and instead rely on better match semantics provided at a
lower level.

Fixes #1095
2019-01-26 12:34:28 -05:00
Andrew Gallant
63b0f31a22 deps: update various dependencies
We also increase the MSRV to 1.32, the current stable release, which sets
the stage for migrating to Rust 2018.
2019-01-19 10:44:30 -05:00
Andrew Gallant
ba503eb677 grep-regex: fix inner literal detection
It seems the inner literal detector fails spectacularly in cases of
concatenations that involve groups. The issue here is that if the prefix
of a group inside a concatenation can match the empty string, then any
literals generated to that point in the concatenation need to be cut
such that they are never extended. The detector isn't really built to
handle this case, so we just act conservative cut literals whenever we
see a sub-group. This may make some regexes slower, but the inner
literal detector already misses plenty of cases.

Literal detection (including in the regex engine) is a key component
that needs to be completely rethought at some point.

Fixes #1064
2018-09-25 16:56:04 -04:00
Andrew Gallant
d14f0b37d6
deps: update versions for all crates
I don't think every change here is needed, but this ensures we're using
the latest version of every direct dependency.
2018-09-07 14:00:22 -04:00
Andrew Gallant
3b5cdea862
doc: minor touchups to API docs 2018-09-07 12:06:05 -04:00
Andrew Gallant
afa06c518a
deps: update libripgrep crate versions
This prepares them for an initial 0.1.0 release.
2018-08-20 17:34:45 -04:00
Andrew Gallant
d9ca529356 libripgrep: initial commit introducing libripgrep
libripgrep is not any one library, but rather, a collection of libraries
that roughly separate the following key distinct phases in a grep
implementation:

  1. Pattern matching (e.g., by a regex engine).
  2. Searching a file using a pattern matcher.
  3. Printing results.

Ultimately, both (1) and (3) are defined by de-coupled interfaces, of
which there may be multiple implementations. Namely, (1) is satisfied by
the `Matcher` trait in the `grep-matcher` crate and (3) is satisfied by
the `Sink` trait in the `grep2` crate. The searcher (2) ties everything
together and finds results using a matcher and reports those results
using a `Sink` implementation.

Closes #162
2018-08-20 07:10:19 -04:00