In some rare cases, it was possible for ripgrep's inner literal detector
to extract a set of literals that could produce a false negative. #2884
gives an example: `(?i:e.x|ex)`. In this case, the set extracted can be
discovered by running `rg '(?i:e.x|ex) --trace`:
Seq[E("EX"), E("Ex"), E("eX"), E("ex")]
This extraction leads to building a multi-substring matcher for `EX`,
`Ex`, `eX` and `ex`. Searching the haystack `e-x` produces no match,
and thus, ripgrep shows no matches. But the regex `(?i:e.x|ex)` matches
`e-x`.
The issue at play here was that when two extracted literal sequences
were unioned, we were correctly unioning their "prefix" attribute.
And this in turn leads to those literal sequences being combined
incorrectly via cross product. This case in particular triggers it
because two different optimizations combine to produce an incorrect
result. Firslty, the regex has a common prefix extracted and is
rewritten as `(?i:e(?:.x|x))`. Secondly, the `x` in the first branch of
the alternation has its `prefix` attribute set to `false` (correctly),
which means it can't be cross producted with another concatenation. But
in this case, it is unioned with the `x` from the second branch, and
this results in the union result having `prefix` set to `true`. This
in turn pops up and lets it get cross producted with the `e` prefix,
producing an incorrect literal sequence.
We fix this by changing the implementation of `union` to return
`prefix` set to `true` only when *both* literal sequences being unioned
have `prefix` set to `true`.
Doing this exposed a second bug that was present, but was purely
cosmetic: the extracted literals in this case, after the fix, are
`X` and `x`. They were considered "exact" (i.e., lead to a match),
but of course they are not. Observing an `X` or an `x` does not mean
there is a match. This was fixed by making `choose` always return
an inexact literal sequence. This is perhaps too conservative in
aggregate in some cases, but always correct. The idea here is that if
one is choosing between two concatenations, then it is likely the case
that the sequence returned should be considered inexact. The issue
is that this can lead to avoiding cross products in some cases that
would otherwise be correct. This is bad because it means extracting
shorter literals in some cases. (In general, the longer the literal the
better.) But we prioritize correctness for now and fix it. You can see
a few tests where this shortens some extracted literals.
Fixes#2884
This removes `once_cell` (a dependency of `cc`) but adds `shlex` (also a
dependency of `cc`). AFAIK, ripgrep does not utilize anything in `cc`
that requires `shlex`, which is pretty unfortunate that we have to spend
time compiling it. (We use `cc` only when the `pcre2` feature is
enabled.)
Rewrites the char_to_escaped_literal and bytes_to_escaped_literal
functions in a way that minimizes heap allocations. After this, the
resulting string is the only allocation remaining.
I believe when this code was originally written, the routines available
to avoid heap allocations didn't exist.
I'm skeptical that this matters in the grand scheme of things, but I
think this is still worth doing for "good sense" reasons.
PR #2833
This should hopefully avoid confusion where the use of the version
number in the issue template isn't mistaken for the implication that the
version must therefore be recent.
Ref #2824
Stdin heuristic detection is complicated and opaque enough that it's
worth having easy access to the complete story that leads ripgrep to
decide whether to search stdin or not.
Ref #2806
This seems to be causing confusion. And since we don't use it as of
ripgrep 14, let's just remove it.
Man page generation is now done by ripgrep itself. That is:
rg --generate man > rg.1
Closes#2801
Some of the new hyperlink work caused ripgrep to stop compiling
on non-{Unix,Windows} platforms. The most popular of which is WASI.
This commit makes non-{Unix,Windows} compile again. And we add a
very basic WASI test in CI to catch regressions.
More work is needed to make tests on non-{Unix,Windows} platforms
work. And of course, this commit specifically takes the path of disabling
hyperlink support for non-{Unix,Windows} platforms.
Notably, this removes winapi in favor of windows-sys, as a result of
winapi-util switching over to windows-sys[1].
Annoyingly, when PCRE2 is enabled, this brings in a dependency on
`once_cell`[2]. I had worked to remove it from my dependencies and now
it's back. Gah. I suppose I could disable the `parallel` feature of
`cc`, but that doesn't seem like a good trade-off.
[1]: https://github.com/BurntSushi/winapi-util/pull/13
[2]: https://github.com/rust-lang/cc-rs/pull/1037
This feature causes nothing but problems and is frequently broken. The
only optimization it was enabling were SIMD optimizations for
transcoding. In particular, for UTF-16 transcoding. This is performed by
the [`encoding_rs`](https://github.com/hsivonen/encoding_rs) crate,
which specifically uses unstable portable SIMD APIs instead of the
stable non-portable SIMD APIs.
SIMD optimizations that apply to search have long been making use of
stable APIs, and are automatically enabled when your target supports
them. This is, IMO, the correct user experience and one that
`encoding_rs` refuses to support. I'm done dealing with it, so
transcoding will only use scalar code until the SIMD optimizations in
`encoding_rs` work on stable. (This doesn't mean that `encoding_rs` has
to change. This could also be fixed by stabilizing `std::simd`.)
Fixes#2748
In effect, we switch from `path.is_file()` to `!path.is_dir()`. In cases
where process substitution is used, for example, the path can actually
have type "fifo" instead of "file." Even if it's a fifo, we want to
treat it as-if it were a file. The real key here is that we basically
always want to consider a lone argument as a file so long as we know it
isn't a directory. Because a directory is the only thing that will
causes us to (potentially) search more than one thing.
Fixes#2736