When searching subdirectories the path was not correctly built and
included duplicate parts. This fix will remove the duplicate part if
possible.
Fixes#1757, Closes#2295
It turns out our fast path for -w/--word-regexp wasn't quite correct in
some cases. Namely, we use `(?m:^|\W)(<original-regex>)(?m:\W|$)` as the
implementation of -w/--word-regexp since `\b(<original-regex>)\b` has
some unintuitive results in certain cases, specifically when
<original-regex> matches non-word characters at match boundaries.
The problem is that using this formulation means that you need to
extract the capture group around <original-regex> to find the "real"
match, since the surrounding (^|\W) and (\W|$) aren't part of the match.
This is fine, but the capture group engine is usually slow, so we have a
fast path where we try to deduce the correct match boundary after an
initial match (before running capture groups). The problem is that doing
this is rather tricky because it's hard to know, in general, whether the
`^` or the `\W` matched.
This still doesn't seem quite right overall, but we at least fix one
more case.
Fixes#2574
Previously, sorting worked by sorting the parents and then sorting the
children within each parent. This was done during traversal, but it only
works when sorting parents preserves the overall order. This generally
only works for '--sort path' in ascending order.
This commit fixes the rest of the sorting behavior by collecting all of
the paths to search and then sorting them before searching. We only
collect all of the paths when sorting was requested.
Fixes#2243, Closes#2361
This causes ripgrep to stop searching an individual file after it has
found a non-matching line. But this only occurs after it has found a
matching line.
Fixes#1790, Closes#1930
*.adb and *.ads are the usual extensions for Ada source code,
and *.gpr indicates a GPRbuild project file used for Ada, and
these days often being combined with alire for package dependency
resolution. Alire stores a bunch of files named alire.toml in
different directories in your (gitignored) cache/dependencies/...
Closes#2013
When the extra-verbose style is set for the types tag, completed types
are displayed along with the patterns they correspond to. This can be
enabled by e.g. adding the following to .zshrc:
zstyle ':completion:*:rg:*:types' extra-verbose true
This change also makes _rg_types use the actual rg specified on the
command line to look up types, and it fixes a mangled complete-all
style check
Fixes#2195
Eclasses are "ebuild libraries" and generally if you're filtering
for/filtering out an ebuild/eclass, you don't want the other either.
Followup to 4dfea016b9Closes#2437
This matches the behavior of GNU grep which does not ignore
before-context and after-context completely if the context flag is also
provided.
Note that this change wasn't done just to match GNU grep. In this case,
GNU grep has the more sensible behavior.
Fixes#2288, Closes#2451
This was originally fixed by using non-capturing groups when joining
patterns in crates/core/args.rs, but before that landed, it ended up
getting fixed via a refactor in the course of migrating to regex 1.9.
Namely, it's now fixed by pushing pattern joining down into the regex
layer, so that patterns can be joined in the most effective way
possible.
Still, #2488 contains a useful test, so we bring that in here. The
test actually failed for `rg -e ')('`, since it expected the command to
fail with a syntax error. But my refactor actually causes this command
to succeed. And indeed, #2488 worked around this by special casing a
single pattern. That work-around fixes it for the single pattern case,
but doesn't fix it for the -w or -X or multi-pattern case. So for now,
we're content to leave well enough alone. The only real way to fix this
for real is to parse each regexp individual and verify that each is
valid on its own. It's not clear that doing so is worth it.
Fixes#2480, Closes#2488
GraphQL file extensions: .graphql and .graphqls (schema)
We could also add `.gql`, but perhaps it's less correct to do so. We'll
start conservatively here, and we can always add `.gql` later.
Closes#2439, Closes#2508
When `resolve_binary()` attempts to resolve a path to a program on
Windows while searching for a program in `PATH` without an extension,
`ripgrep` will assume the extension of the file to be `.exe` as it's
the *de facto* standard, which will work most (99.99%) of the time...
...unless the binary is a COM executable (we're on Windows, duh).
Closes#2523
When a glob pattern ended with a \/, and since we permit backslash
escapes, the glob parser gave a "dangling escape" error. Which is weird,
because the \ is clearly not dangling.
The issue is that the layer above the glob parser, the gitignore parser,
was stripping the trailing / so that it wouldn't be part of the matching
logic. Of course, stripping the trailing / while it is escaped without
removing the backslash escape is wrong. So we do that here.
Fixes#2236
Sadly, PCRE2 has different behavior (but doesn't panic). We should look
into that, but for now, this is good enough.
Also, update the CHANGELOG.
Ref #1891
In a previous release, I announced that Fish completions were being
removed. But the Fish project decided to remove theirs and have
ripgrep's stay.
Closes#1577
This fixes a bug where PCRE2 look-around could change the result of a
match if it observed a line terminator in the printer. And in
particular, this is precisely how the searcher operates: the line is
considered unto itself *without* the line terminator.
Fixes#1401
This commit hacks in a bug fix for handling look-around across multiple
lines. The main problem is that by the time the matching lines are sent
to the printer, the surrounding context---which some look-behind or
look-ahead might have matched---could have been dropped if it wasn't
part of the set of matching lines. Therefore, when the printer re-runs
the regex engine in some cases (to do replacements, color matches, etc
etc), it won't be guaranteed to see the same matches that the searcher
found.
Overall, this is a giant clusterfuck and suggests that the way I divided
the abstraction boundary between the printer and the searcher is just
wrong. It's likely that the searcher needs to handle more of the work of
matching and pass that info on to the printer. The tricky part is that
this additional work isn't always needed. Ultimately, this means a
serious re-design of the interface between searching and printing. Sigh.
The way this fix works is to smuggle the underlying buffer used by the
searcher through into the printer. Since these bugs only impact
multi-line search (otherwise, searches are only limited to matches
across a single line), and since multi-line search always requires
having the entire file contents in a single contiguous slice (memory
mapped or on the heap), it follows that the buffer we pass through when
we need it is, in fact, the entire haystack. So this commit refactors
the printer's regex searching to use that buffer instead of the intended
bundle of bytes containing just the relevant matching portions of that
same buffer.
There is one last little hiccup: PCRE2 doesn't seem to have a way to
specify an ending position for a search. So when we re-run the search to
find matches, we can't say, "but don't search past here." Since the
buffer is likely to contain the entire file, we really cannot do
anything here other than specify a fixed upper bound on the number of
bytes to search. So if look-ahead goes more than N bytes beyond the
match, this code will break by simply being unable to find the match. In
practice, this is probably pretty rare. I believe that if we did a
better fix for this bug by fixing the interfaces, then we'd probably try
to have PCRE2 find the pertinent matches up front so that it never needs
to re-discover them.
Fixes#1412
This commit fixes a subtle bug in multi-line replacement of line
terminators.
The problem is that even though ripgrep supports multi-line searches, it
is *still* line oriented. It still needs to print line numbers, for
example. For this reason, there are various parts in the printer that
iterate over lines in order to format them into the desired output.
This turns out to be problematic in some cases. #1311 documents one of
those cases (with line numbers enabled to highlight a point later):
$ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
1:hello?
2:world?
But the desired output is this:
$ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
1:hello?world?
At first I had thought that the main problem was that the printer was
taking ownership of writing line terminators, even if the input already
had them. But it's more subtle than that. If we fix that issue, we get
output like this instead:
$ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
1:hello?2:world?
Notice how '2:' is printed before 'world?'. The reason it works this way
is because matches are reported to the printer in a line oriented way.
That is, the printer gets a block of lines. The searcher guarantees that
all matches that start or end in any of those lines also end or start in
another line in that same block. As a result, the printer uses this
assumption: once it has processed a block of lines, the next match will
begin on a new and distinct line. Thus, things like '2:' are printed.
This is generally all fine and good, but an impedance mismatch arises
when replacements are used. Because now, the replacement can be used to
change the "block of lines" approach. Now, in terms of the output, the
subsequent match might actually continue the current line since the
replacement might get rid of the concept of lines altogether.
We can sometimes work around this. For example:
$ printf "hello\nworld\n" | rg -U "\n(.)?" -r '?$1'
hello?world?
Why does this work? It's because the '(.)' after the '\n' causes the
match to overlap between lines. Thus, the searcher guarantees that the
block sent to the printer contains every line.
And there in lay the solution: all we need to do is tweak the multi-line
searcher so that it combines lines with matches that directly adjacent,
instead of requiring at least one byte of overlap. Fixing that solves
the issue above. It does cause some tests to fail:
* The binary3 test in the searcher crate fails because adjacent line
matches are now one part of block, and that block is scanned for
binary data. To preserve the essence of the test, we insert a couple
dummy lines to split up the blocks.
* The JSON CRLF test. It was testing that we didn't output any messages
with an empty 'submatches' array. That is indeed still the case. The
difference is that the messages got combined because of the adjacent
line merging behavior. This is a slight change to the output, but is
still correct.
Fixes#1311
It turns out that the vimgrep format really only wants one line per
match, even when that match spans multiple lines.
We continue to support the previous behavior (print all lines in a
match) in the `grep-printer` crate. We add a new option to enable the
"only print the first line" behavior, and unconditionally enable it in
ripgrep. We can do that because the option has no effect in single-line
mode, since, well, in that case matches are guaranteed to span one line
anyway.
Fixes#1866
These flags permit configuring the bytes used to delimit fields in match
or context lines, where "fields" are things like the file path, line
number, column number and the match/context itself.
Fixes#1842, Closes#1871
Previous, 'foo/**' would match 'foo', but it shouldn't have. In this
case, not matching 'foo' is what is documented and also seems consistent
with other recursive globbing implementations (like that in zsh).
This also updates the prefix extractor to pull 'foo/' out of 'foo/**'.
Closes#1756
This seems like an obvious optimization but becomes critical when
filesystem operations even as simple as stat can result in significant
overheads; an example of this was a bespoke filesystem layer in Windows
that hosted files remotely and would download them on-demand when
particular filesystem operations occurred. Users of this system who
ensured correct file-type fileters were being used could still get
unnecessary file access resulting in large downloads.
Fixes#1657, Closes#1660
In the case where after-context is requested with a match count limit,
we need to be careful not to reset the state tracking the remaining
context lines.
Fixes#1380, Closes#1642
Previously, we were only looking for the UTF-16 BOM for determining
whether to do transcoding or not. But we should also look for the UTF-8
BOM as well.
Fixes#1638, Closes#1697
This is somewhat non-standard, but it seems nice on the surface: short
flag names are in short supply, --hidden is probably somewhat common and
-. has an obvious connection with how hidden files are named on Unix.
Closes#1680
This fixes a bug where it was assumed that 'is_suffix' when CRLF
handling was enabled mean that '\r\n' was present. But that's not the
case, and it is intentional that 'is_suffix' only looks for '\n'. (Which
is why #1803 wasn't taken, which tries to fix this by changing
'is_suffix'.)
Fixes#1765, Closes#1803
This was once part of ripgrep, but at some point, was unintentionally
removed. The value of this warning is that since ripgrep tries to be
"smart" by default, it can be surprising if it doesn't search certain
things. This warning covers the case when ripgrep searches *nothing*,
which happens somewhat more frequently than you might expect. e.g., If
you're searching within an ignore directory.
Note that for now, we only print this message when the user has not
supplied any explicit paths. It's not clear that we want to print this
otherwise, and in particular, it seems that the message shows up too
eagerly. e.g., 'rg foo does-not-exist' will both print an error about
'does-not-exist' not existing, *and* the message about no files being
searched, which seems annoying in this case. We can always refine this
logic later.
Fixes#1404, Closes#1762
This fixes a bug only present on Windows that would permit someone to
execute an arbitrary program if they crafted an appropriate directory
tree. Namely, if someone put an executable named 'xz.exe' in the root of
a directory tree and one ran 'rg -z foo' from the root of that tree,
then the 'xz.exe' executable in that tree would execute if there are any
'xz' files anywhere in the tree.
The root cause of this problem is that 'CreateProcess' on Windows will
implicitly look in the current working directory for an executable when
it is given a relative path to a program. Rust's standard library allows
this behavior to occur, so we work around it here. We work around it by
explicitly resolving programs like 'xz' via 'PATH'. That way, we only
ever pass an absolute path to 'CreateProcess', which avoids the implicit
behavior of checking the current working directory.
This fix doesn't apply to non-Windows systems as it is believed to only
impact Windows. In theory, the bug could apply on Unix if '.' is in
one's PATH, but at that point, you reap what you sow.
While the extent to which this is a security problem isn't clear, I
think users generally expect to be able to download or clone
repositories from the Internet and run ripgrep on them without fear of
anything too awful happening. Being able to execute an arbitrary program
probably violates that expectation. Therefore, CVE-2021-3013[1] was
created for this issue.
We apply the same logic to the --pre command, since the --pre command is
likely in a user's config file and it would be surprising for something
that the user is searching to modify which preprocessor command is used.
The --pre and -z/--search-zip flags are the only two ways that ripgrep
will invoke external programs, so this should cover any possible
exploitable cases of this bug.
[1] - https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3013
This fixes a bug where using \A or (?-m)^ in combination with
-U/--multiline would permit matches that aren't anchored to the
beginning of the file. The underlying cause was an optimization that
occurred when mmaps couldn't be used. Namely, ripgrep tries to still
read the input incrementally if it knows the pattern can't match through
a new line. But the detection logic was flawed, since it didn't account
for line anchors. This commit fixes that.
Fixes#1878, Fixes#1879
It turned out that --vimgrep wasn't quite getting the column of each
match correctly. Instead of printing column numbers relative to the
current line, it was printing column numbers as byte offsets relative to
where the match began. To fix this, we simply subtract the offset of the
line number from the beginning of the match. If the beginning of the
match came before the start of the current line, then there's really
nothing sensible we can do other than to use a column number of 1, which
we now document.
Interestingly, existing tests were checking that the previous behavior
was intended. My only defense is that I somehow tricked myself into
thinking it was a byte offset instead of a column number.
Kudos to @bfrg for calling this out in #1866:
https://github.com/BurntSushi/ripgrep/issues/1866#issuecomment-841635553
It seems that PowerShell uses sockets instead of FIFOs to redirect the
output between commands. So add `is_socket` to our `is_readable_stdin`
check.
This seems unlikely to cause problems and it probably more generally
correct than what we had before. In theory, it could cause problems if
it produces false positives, in which case, ripgrep will try to read
stdin when it should search the current working directory. (And this
usually winds up manifesting as ripgrep blocking forever.) But, if the
stdin handle reports itself as a socket, then it seems like we should
read it.
Fixes#1741, Closes#1742
We use '+++' syntax to output a literal '**' for a '--glob' example.
This '+++' syntax is pretty ugly when rendered literally via --help. We
fix this by hackily inserting the '+++' syntax for its one specific case
that we need it during man page generation.
Not ideal but it works. And --help still has some '*foo*' markup, but we
live with that for now.
Fixes#1581
While Linux distributions (at least Arch Linux, RHEL, Debian) do not support
compressing files with compress(1), macOS & AIX do (the utility is part of
POSIX). Additionally, gzip is able to uncompress such compressed files and
provides an `uncompress` binary.
Closes#1547
When a pattern with invalid UTF-8 is given, the error message suggests
unqualified use of hex escape sequences to match arbitrary bytes. But
you *also* need to disable Unicode mode. So include that in the error
message.
Fixes#1339
In order to implement --count-matches, we simply re-execute the regex on
the spans reported by the searcher. The spans always correspond to the
lines that participated in the match. This is the correct thing to do,
except when the regex contains look-ahead (or look-behind).
In particular, the look-around permits the regex's match success to
depends on an arbitrary point before or after the lines actually
reported as participating in the match. Since only the matched lines are
reported to the printer, it is possible for subsequent searching on
those lines to fail.
A true fix for this would somehow make the total span available to the
printer. But that seems tricky since it isn't always available. For
PCRE2's case in multiline mode, it is available because we force it to
be so for correctness.
For now, we simply detect this corner case heuristically. If the match
count is zero, then it necessarily means there is some kind of
look-around that isn't matching. So we set the match count to 1. This is
probably incorrect in some cases, although my brain can't quite come up
with a concrete example. Nevertheless, this is strictly better than the
status quo.
Fixes#1573
We should not assume that the commondir file actually exists. If it
doesn't, then just move on. This otherwise emits an error message when
searching normal submodules, which is not OK.
This regression was introduced in #1446.
Fixes#1520
If a literal is entirely whitespace, then it's quite likely that it is
very common. So when that case occurs, just don't do (inner) literal
optimizations at all.
The regex engine may still make sub-optimal decisions here, but that's a
problem for another day.
Fixes#1087
The purpose of this flag is to force ripgrep to ignore all --ignore-file
flags (whether they come before or after --no-ignore-files).
This flag can be overridden with --ignore-files.
Fixes#1466
It doesn't really belong in the man page since it's an artifact of a
build/runtime configuration. Moreover, it inhibits reproducible builds.
Fixes#1441
This permits switching between the different regex engine modes that
ripgrep supports. The purpose of this flag is to make it easier to
extend ripgrep with additional regex engines.
Closes#1488, Closes#1502
Due to how walkdir works if symlinks are not followed, symlinks to
directories are seen as simple files by ripgrep. This caused a panic
in some cases due to receiving a WalkEvent::Exit event without a
corresponding WalkEvent::Dir event.
This is fixed by looking at the metadata of the file in the case of a
symlink to determine if it's a directory. We are careful to only do
this stat check when the depth of the entry is 0, as this bug only
impacts us when 1) we aren't following symlinks generally and 2) the
user provides a symlinked directory that we do follow as a top-level
path to search.
Fixes#1389, Closes#1397
This adds a universal --no-unicode flag that is intended to work for all
supported regex engines. There is no point in retaining
--no-pcre2-unicode, so we make them aliases to the new flags and
deprecate them.
This flag prevents ripgrep from requiring one to search a git repository
in order to respect git-related ignore rules (global, .gitignore and
local excludes). This actually corresponds to behavior ripgrep had long
ago, but #934 changed that. It turns out that users were relying on this
buggy behavior. In most cases, fixing it as simple as converting one's
rules to .ignore or .rgignore files. Unfortunately, there are other use
cases---like Perforce automatically respecting .gitignore files---that
make a strong case for ripgrep to at least support this.
The UX of a flag like this is absolutely atrocious. It's so obscure that
it's really not worth explicitly calling it out anywhere. Moreover, the
error cases that occur when this flag isn't used (but its behavior is
desirable) will not be intuitive, do not seem easily detectable and will
not guide users to this flag. Nevertheless, the motivation for this is
just barely strong enough for me to begrudgingly accept this.
Fixes#1414, Closes#1416
This appears to be another transcription bug from copying this code from
the prefix literal detection from inside the regex crate. Namely, when
it comes to inner literals, we only want to treat counted repetition as
two separate cases: the case when the minimum match is 0 and the case
when the minimum match is more than 0. In the former case, we treat
`e{0,n}` as `e*` and in the latter we treat `e{m,n}` where `m >= 1` as
just `e`.
We could definitely do better here. e.g., This means regexes like
`(foo){10}` will only have `foo` extracted as a literal, where searching
for the full literal would likely be faster.
The actual bug here was that we were not implementing this logic
correctly. Namely, we weren't always "cutting" the literals in the
second case to prevent them from being expanded.
Fixes#1319, Closes#1367
Git looks for this file in GIT_COMMON_DIR, which is usually the same
as GIT_DIR (.git). However, when searching inside a linked worktree,
.git is usually a file that contains the path of the actual git dir,
which in turn contains a file "commondir" which references the directory
where info/exclude may reside, alongside other configuration shared across
all worktrees. This directory is usually the git dir of the main worktree.
Unlike git this does *not* read environment variables GIT_DIR and
GIT_COMMON_DIR, because it is not clear how to interpret them when
searching multiple repositories.
Fixes#1445, Closes#1446
It turns out that querying the CWD while in a directory that no longer
exists results in an error. Since the CWD is queried every time ripgrep
starts---whether it needs it or not---for dealing with glob matching,
ripgrep winds up being completely useless inside a non-existent
directory.
We fix this in a few different ways:
* Firstly, if std::env::current_dir() fails, then we fall back to trying
to read the `PWD` environment variable.
* If that fails, that we return a more sensible error message so that a
user can at least react to the problem. Previously, the error message
was inscrutable.
* Finally, we try to avoid the problem altogether by building empty glob
matchers if not globs were provided, thus side-stepping querying the
CWD completely.
Fixes#1291, Closes#1400
This commit adds a new --no-ignore-exclude flag that permits disabling
the use of .git/info/exclude filtering. Local exclusions are manual
configurations to a repository and are not shared, so it is sometimes
useful to disable to get a consistent view of a repository.
This also adds a new section to the man page that describes automatic
filtering.
Closes#1420
Previously, ripgrep would always defer to the regex engine's capturing
matches in order to implement word matching. Namely, ripgrep would
determine the correct match offsets via a capturing group, since the
word regex is itself generated from the user supplied regex.
Unfortunately, the regex engine's capturing mode is still fairly slow,
so this commit adds a fast path to avoid capturing mode in the vast
majority of cases. See comments in the code for details.
When the -w/--word-regexp was used, ripgrep would in many cases fail to
apply literal optimizations. This occurs specifically when the regex
given by the user is an alternation of literals with no common prefixes
or suffixes, e.g.,
rg -w 'foo|bar|baz|quux'
In this case, the inner literal detector fails. Normally, this would
result in literal prefixes being detected by the regex engine. But
because of the -w/--word-regexp flag, the actual regex that we run ends
up looking like this:
(^|\W)(foo|bar|baz|quux)($|\W)
which of course defeats any prefix or suffix literal optimizations in
the regex crate's somewhat naive extractor. (A better extractor could
still do literal optimizations in the above case.)
So this commit fixes this by falling back to prefix or suffix literals
when they're available instead of prematurely giving up and assuming the
regex engine will do the rest.
This fixes an interesting performance bug where the inner literal
extractor would sometimes choose a sub-optimal literal. For example,
consider the regex:
\x20+Sherlock Holmes\x20+
(The `\x20` is the ASCII code for a space character, which we use here
to just make it clearer. It otherwise does not matter.)
Previously, this would see the initial \x20 and then stop collecting
literals after the `+` repetition operator. This was because the inner
literal detector was adapter from the prefix literal detector, which had
to stop here. Namely, while \x20S would be a valid prefix (for example),
\x20\x20S would also be a valid prefix. As would \x20\x20\x20S and so
on. So the prefix detector would have to stop at the repetition
operator. Otherwise, only searching for \x20S could potentially scan
farther then the starting position of the next match.
However, for inner literals, this calculus no longer makes sense. We can
freely search for, e.g., \x20S without missing matches that start with
\x20\x20S precisely because we know this is an inner literal which may
not correspond to the start of a match.
With this fix, the literal that is now detected is
\x20Sherlock Holmes\x20
Which is much better. We achieve this by no longer "cutting" literals
after seeing a `+` repetition operator. Instead, we permit literals to
continue to be extended.
The reason why this is important is because using \x20 as the literal to
search for is generally bad juju since it is so common. In fact, we
should probably add more logic here to either avoid such things or give
up entirely on the inner literal optimization if it detected a literal
that we think is very common. But we punt on such things here.