1
0
mirror of https://github.com/BurntSushi/ripgrep.git synced 2025-02-04 06:08:39 +02:00

48 Commits

Author SHA1 Message Date
Andrew Gallant
341a19e0d0
regex: fix fast path for -w/--word-regexp flag (#2576)
It turns out our fast path for -w/--word-regexp wasn't quite correct in
some cases. Namely, we use `(?m:^|\W)(<original-regex>)(?m:\W|$)` as the
implementation of -w/--word-regexp since `\b(<original-regex>)\b` has
some unintuitive results in certain cases, specifically when
<original-regex> matches non-word characters at match boundaries.

The problem is that using this formulation means that you need to
extract the capture group around <original-regex> to find the "real"
match, since the surrounding (^|\W) and (\W|$) aren't part of the match.
This is fine, but the capture group engine is usually slow, so we have a
fast path where we try to deduce the correct match boundary after an
initial match (before running capture groups). The problem is that doing
this is rather tricky because it's hard to know, in general, whether the
`^` or the `\W` matched.

This still doesn't seem quite right overall, but we at least fix one
more case.

Fixes #2574
2023-07-31 08:51:09 -04:00
Ludi Rehak
7c83b90f95 doc: fix typo
Closes #2153
2023-07-08 18:52:42 -04:00
Andrew Gallant
3ac4541e9f regex: remove old inner literal extractor
(It had already been removed from the crate.)
2023-07-05 14:04:29 -04:00
Andrew Gallant
a68db3ac02 deps: drop temporary patch and move to bstr 1.6
Now that regex 1.9 is out, we can depend on it from crates.io.
2023-07-05 14:04:29 -04:00
Andrew Gallant
ca740d9ace regex: add new inner literal extractor
This is mostly a copy of the prefix literal extractor in regex-syntax,
but with a tweaked notion of Seq that keeps track of whether it's a
prefix of an expression or not. If it isn't, then we can't cross it as a
suffix to another Seq.

This new extractor should be a lot more robust than the old one. We
actually will keep going through the regex to try and find the "best"
literals to search for (according to some heuristic).
2023-07-05 14:04:29 -04:00
Andrew Gallant
e80c102dee regex: tweak formatting of regex-automata version spec
This makes it easier to enable the `logging` feature for regex-automata.

I wish I could just enable it unconditionally, but it winds up producing
a lot of output because ripgrep uses regexes for things other than the
primary search (like every glob). Sigh.
2023-07-05 14:04:29 -04:00
Andrew Gallant
8ac66a9e04 regex: refactor matcher construction
This does a little bit of refactoring so that we can pass both a
ConfiguredHIR and a Regex to the inner literal extraction routine.

One downside of this approach is that a regex object hangs on to a
ConfiguredHIR. But the extra memory usage is probably negligible. A
benefit though is that converting the HIR to its concrete syntax is now
lazy and only happens when logging is enabled.
2023-07-05 14:04:29 -04:00
Andrew Gallant
04dde9a4eb regex: tweak DFA settings
This increases the limits a bit for when the regex engine will build and
use a fully compiled DFA. They can faster in some circumstances. For
example, '(?-u)^\w{30,}$' gets a nice speed boost from state
acceleration.

We are also able to remove `regex` proper as a dependency. Wow.
2023-07-05 14:04:29 -04:00
Andrew Gallant
81341702af regex: push more pattern handling to matcher construction
Previously, ripgrep core was responsible for escaping regex patterns and
implementing the --line-regexp flag. This commit moves that
responsibility down into the matchers such that ripgrep just needs to
hand the patterns it gets off to the matcher builder. The builder will
then take care of escaping and all that.

This was done to make pattern construction completely owned by the
matcher builders. With the arrival regex-automata, this means we can
move to the HIR very quickly and then never move back to the concrete
syntax. We can then build our regex directly from the HIR. This overall
can save quite a bit of time, especially when searching for large
dictionaries.

We still aren't quite as fast as GNU grep when searching something on
the scale of /usr/share/dict/words, but we are basically within spitting
distance. Prior to this, we were about an order of magnitude slower.

This architecture in particular lets us write a pretty simple fast path
that avoids AST parsing and HIR translation entirely: the case where one
is just searching for a literal. In that case, we can hand construct the
HIR directly.
2023-07-05 14:04:29 -04:00
Andrew Gallant
a775b493fd regex: small cleanups
Just some small polishing. We also get rid of thread_local in favor of
using regex-automata, mostly just in the name of reducing dependencies.
(We should eventually be able to drop thread_local completely.)
2023-07-05 14:04:29 -04:00
Andrew Gallant
a6dbff502f regex: s/locations/captures
Now that we use regex-automata, we no longer use any type with
"locations" in it. Instead, that's mostly legacy from the top-level
regex crate.
2023-07-05 14:04:29 -04:00
Andrew Gallant
51480d57a6 regex: simplify AST analysis a bit
The verbatim literal stuff hasn't been used for a while and I don't
foresee it being used. If it's really needed, it would probably better
to just implement it by looking at the pattern string itself, which
avoids parsing it into an AST altogether.
2023-07-05 14:04:29 -04:00
Andrew Gallant
d9bd261be8 regex: some small cleanup in 'strip.rs'
We also utilize bstr's methods to get rid of some helpers we had written
by hand.
2023-07-05 14:04:29 -04:00
Andrew Gallant
9d62eb997a BREAKING: regex: finally remove CRLF hack
Now that Rust's regex crate finally supports a CRLF mode, we can remove
this giant hack in ripgrep to enable it. (And assuredly did not work in
all cases.)

The way this works in the regex engine is actually subtly different than
what ripgrep previously did. Namely, --crlf would previously treat
either \r\n or \n as a line terminator. But now it treats \r\n, \n and
\r as line terminators. In effect, it is implemented by treating \r and
\n as line terminators, but ^ and $ will never match at a position
between a \r and a \n.

So basically this means that $ will end up matching in more cases than
it might be intended too, but I don't expect this to be a big problem in
practice.

Note that passing --crlf to ripgrep and enabling CRLF mode in the regex
via the `R` inline flag (e.g., `(?R:$)`) are subtly different. The `R`
flag just controls the regex engine, but --crlf instructs all of ripgrep
to use \r\n as a line terminator. There are likely some inconsistencies
or corner cases that are wrong as a result of this cognitive dissonance,
but we choose to leave well enough alone for now.

Fixing this for real will probably require re-thinking how line
terminators are handled in ripgrep. For example, one "problem" with how
they're handled now is that ripgrep will re-insert its own line
terminators when printing output instead of copying the input. This is
maybe not so great and perhaps unexpected. (ripgrep probably can't get
away with not inserting any line terminators. Users probably expect
files that don't end with a line terminator whose last line matches to
have a line terminator inserted.)
2023-07-05 14:04:29 -04:00
Andrew Gallant
e028ea3792 regex: migrate grep-regex to regex-automata
We just do a "basic" dumb migration. We don't try to improve anything
here.
2023-07-05 14:04:29 -04:00
Andrew Gallant
1035f6b1ff deps: initial migration steps to regex 1.9
This leaves the grep-regex crate in tatters. Pretty much the entire
thing needs to be re-worked. The upshot is that it should result in some
big simplifications. I hope.

The idea here is to drop down and actually use regex-automata 0.3
instead of the regex crate itself.
2023-07-05 14:04:29 -04:00
Andrew Gallant
81529288cf
grep-regex-0.1.11 2023-01-05 09:02:55 -05:00
Andrew Gallant
bcc7473a87
deps: update to grep-matcher 0.1.6 2023-01-05 09:02:40 -05:00
Andrew Gallant
ac8fecbbf2
deps: upgrade bstr to 1.1 2023-01-05 08:21:15 -05:00
Andrew Gallant
5e975c43f8
doc: appease rustdoc 2022-07-15 10:13:55 -04:00
Andrew Gallant
2cae30e399
grep-regex-0.1.10 2022-07-15 10:01:42 -04:00
Andrew Gallant
8e57989cd2
regex: fix matching bug when text anchors are used
It turns out that if there are text anchors (that is, \A or \z, or ^/$
when multi-line is disabled), then the "fast" line searching path isn't
quite correct. Since searching without multi-line mode is exceptionally
rare, we just look for the presence of text anchors and specifically
disable the line terminator option in 'grep-regex'. This in turn
inhibits the "fast" line searching path.

Fixes #2260
2022-07-15 09:53:39 -04:00
Alex Touchet
36d03b4101
cargo: use SPDX license format for all crates
This was done for the main crate in d11a3b33773620bcc593f82b557757f0c2ec8a05.

See also #987.

PR #2204
2022-05-09 07:52:11 -04:00
Andrew Gallant
bc76a30c23
regex: fix -w when regex can match empty string
This is a weird bug where our optimization for handling -w more quickly
than we would otherwise failed. In particular, if the original regex can
match the empty string, then our word boundary detection would produce
invalid indices to the start the next search at. We "fix" it by simply
bailing when the indices are known to be incorrect.

This wasn't a problem in a previous release since ripgrep 13 tweaked how
word boundaries are detected in commit efd9cfb2.

Fixes #1891
2021-06-12 14:18:53 -04:00
Andrew Gallant
7f3fd6f7ce
grep-regex-0.1.9 2021-06-12 08:03:56 -04:00
Andrew Gallant
6331a7ac18
deps/matcher: update minimal versions 2021-06-12 08:03:47 -04:00
Andrew Gallant
e824531e38 edition: manual changes
This is mostly just about removing 'extern crate' everywhere and fixing
the fallout.
2021-06-01 21:07:37 -04:00
Andrew Gallant
af54069c51 edition: run 'cargo fix --edition --edition-idioms --all' 2021-06-01 21:07:37 -04:00
Andrew Gallant
77a9e99964 edition: set edition=2018 2021-06-01 21:07:37 -04:00
Andrew Gallant
459a9c5637 edition: initial 'cargo fix --edition' run 2021-06-01 21:07:37 -04:00
Andrew Gallant
efd9cfb2fc grep: fix bugs in handling multi-line look-around
This commit hacks in a bug fix for handling look-around across multiple
lines. The main problem is that by the time the matching lines are sent
to the printer, the surrounding context---which some look-behind or
look-ahead might have matched---could have been dropped if it wasn't
part of the set of matching lines. Therefore, when the printer re-runs
the regex engine in some cases (to do replacements, color matches, etc
etc), it won't be guaranteed to see the same matches that the searcher
found.

Overall, this is a giant clusterfuck and suggests that the way I divided
the abstraction boundary between the printer and the searcher is just
wrong. It's likely that the searcher needs to handle more of the work of
matching and pass that info on to the printer. The tricky part is that
this additional work isn't always needed. Ultimately, this means a
serious re-design of the interface between searching and printing. Sigh.

The way this fix works is to smuggle the underlying buffer used by the
searcher through into the printer. Since these bugs only impact
multi-line search (otherwise, searches are only limited to matches
across a single line), and since multi-line search always requires
having the entire file contents in a single contiguous slice (memory
mapped or on the heap), it follows that the buffer we pass through when
we need it is, in fact, the entire haystack. So this commit refactors
the printer's regex searching to use that buffer instead of the intended
bundle of bytes containing just the relevant matching portions of that
same buffer.

There is one last little hiccup: PCRE2 doesn't seem to have a way to
specify an ending position for a search. So when we re-run the search to
find matches, we can't say, "but don't search past here." Since the
buffer is likely to contain the entire file, we really cannot do
anything here other than specify a fixed upper bound on the number of
bytes to search. So if look-ahead goes more than N bytes beyond the
match, this code will break by simply being unable to find the match. In
practice, this is probably pretty rare. I believe that if we did a
better fix for this bug by fixing the interfaces, then we'd probably try
to have PCRE2 find the pertinent matches up front so that it never needs
to re-discover them.

Fixes #1412
2021-05-31 21:51:18 -04:00
Marco Ieni
b3a6a69f9d ci: check docs for all crates
This also replaces '--all' in Cargo commands with '--workspace'. The
former has apparently been deprecated.

We also fix a couple warnings that this new step detected.

Closes #1848
2021-05-31 21:51:18 -04:00
Andrew Gallant
35b52d33b9 regex: add unit tests for non-matching anchor bytes
This is in addition to the integration level test added in
581a35e568c3acd32461d276a4cfe746524e17cd.
2021-05-31 21:51:18 -04:00
Andrew Gallant
581a35e568
impl: fix --multiline anchored match bug
This fixes a bug where using \A or (?-m)^ in combination with
-U/--multiline would permit matches that aren't anchored to the
beginning of the file. The underlying cause was an optimization that
occurred when mmaps couldn't be used. Namely, ripgrep tries to still
read the input incrementally if it knows the pattern can't match through
a new line. But the detection logic was flawed, since it didn't account
for line anchors. This commit fixes that.

Fixes #1878, Fixes #1879
2021-05-29 07:37:28 -04:00
Andrew Gallant
7899a4b931
regex: s/CachedThreadLocal/ThreadLocal
CachedThreadLocal has been deprecated. We bump thread_local's minimal
version corresponding to that deprecation as well.
2021-01-25 10:38:05 -05:00
Andrew Gallant
d97fb72d84
doc: update CI links in crate READMEs
I switched to GitHub Actions long ago, which replaces both Travis and
AppVeyor.

Fixes #1732
2020-11-16 19:07:16 -05:00
Alex Touchet
3ca324fda7
doc: update several links to use https
PR #1724
2020-11-03 10:33:36 -05:00
Martin Michlmayr
1b2c1dc675
doc: fix typos
PR #1605
2020-06-04 09:06:09 -04:00
Andrew Gallant
c0f0492b98
grep-regex-0.1.8 2020-05-09 10:31:29 -04:00
Andrew Gallant
1c4b5adb7b
regex: fix another inner literal bug
It looks like `is_simple` wasn't quite correct.

I can't wait until this code is rewritten. It is still not quite clearly
correct to me.

Fixes #1537
2020-04-01 20:37:48 -04:00
Andrew Gallant
543f99dbf1
grep-regex-0.1.7 2020-03-22 21:08:19 -04:00
Andrew Gallant
0ea65efd6d
regex: special case literal extraction
In a prior commit, we fixed a performance problem with the -w flag by
doing a little extra work to extract literals. It turns out that using
literals in this case when the -w flag is NOT used results in a
performance regression. The reasoning is that we end up using a "fast"
regex as a prefilter when the regex engine itself uses its own
equivalent prefilter, so ripgrep ends up redoing a fair amount of work.

Instead, we only do this extra work when we know the -w flag is enabled.
2020-03-22 21:02:51 -04:00
Andrew Gallant
92daa34eb3
ripgrep: release 12.0.0 2020-03-15 21:42:54 -04:00
Andrew Gallant
e772a95b58 regex: avoid using literal optimizations when whitespace is detected
If a literal is entirely whitespace, then it's quite likely that it is
very common. So when that case occurs, just don't do (inner) literal
optimizations at all.

The regex engine may still make sub-optimal decisions here, but that's a
problem for another day.

Fixes #1087
2020-03-15 13:19:14 -04:00
Andrew Gallant
9dd4bf8d7f style: fix rust-analyzer lint warnings 2020-03-15 13:19:14 -04:00
chip
50d2047ae2
crates: update URLs in Cargo.toml
This corrects an oversight when the repo was re-organized to
have its crates moved into a 'crates' sub-directory.

PR #1505
2020-02-28 20:31:43 -05:00
Andrew Gallant
0874aa115c repo: make ripgrep build with the new organization 2020-02-17 19:24:53 -05:00
Andrew Gallant
fdd8510fdd repo: move all source code in crates directory
The top-level listing was just getting a bit too long for my taste. So
put all of the code in one directory and shrink the large top-level mess
to a small top-level mess.

NOTE: This commit only contains renames. The subsequent commit will
actually make ripgrep build again. We do it this way with the naive hope
that this will make it easier for git history to track the renames.
Sigh.
2020-02-17 19:24:53 -05:00