1
0
mirror of https://github.com/BurntSushi/ripgrep.git synced 2024-12-12 19:18:24 +02:00
Commit Graph

218 Commits

Author SHA1 Message Date
Andrew Gallant
fbb2cfed28 printer: trim line terminator before doing replacements
This is basically the same bug as #1401, but applied to replacements
instead of --only-matching.

Fixes #1739
2021-05-31 21:51:18 -04:00
Andrew Gallant
af8b27ffae changelog: fish completions are staying
In a previous release, I announced that Fish completions were being
removed. But the Fish project decided to remove theirs and have
ripgrep's stay.

Closes #1577
2021-05-31 21:51:18 -04:00
Andrew Gallant
ee23ab5173 printer: trim line terminator before finding submatches
This fixes a bug where PCRE2 look-around could change the result of a
match if it observed a line terminator in the printer. And in
particular, this is precisely how the searcher operates: the line is
considered unto itself *without* the line terminator.

Fixes #1401
2021-05-31 21:51:18 -04:00
Andrew Gallant
efd9cfb2fc grep: fix bugs in handling multi-line look-around
This commit hacks in a bug fix for handling look-around across multiple
lines. The main problem is that by the time the matching lines are sent
to the printer, the surrounding context---which some look-behind or
look-ahead might have matched---could have been dropped if it wasn't
part of the set of matching lines. Therefore, when the printer re-runs
the regex engine in some cases (to do replacements, color matches, etc
etc), it won't be guaranteed to see the same matches that the searcher
found.

Overall, this is a giant clusterfuck and suggests that the way I divided
the abstraction boundary between the printer and the searcher is just
wrong. It's likely that the searcher needs to handle more of the work of
matching and pass that info on to the printer. The tricky part is that
this additional work isn't always needed. Ultimately, this means a
serious re-design of the interface between searching and printing. Sigh.

The way this fix works is to smuggle the underlying buffer used by the
searcher through into the printer. Since these bugs only impact
multi-line search (otherwise, searches are only limited to matches
across a single line), and since multi-line search always requires
having the entire file contents in a single contiguous slice (memory
mapped or on the heap), it follows that the buffer we pass through when
we need it is, in fact, the entire haystack. So this commit refactors
the printer's regex searching to use that buffer instead of the intended
bundle of bytes containing just the relevant matching portions of that
same buffer.

There is one last little hiccup: PCRE2 doesn't seem to have a way to
specify an ending position for a search. So when we re-run the search to
find matches, we can't say, "but don't search past here." Since the
buffer is likely to contain the entire file, we really cannot do
anything here other than specify a fixed upper bound on the number of
bytes to search. So if look-ahead goes more than N bytes beyond the
match, this code will break by simply being unable to find the match. In
practice, this is probably pretty rare. I believe that if we did a
better fix for this bug by fixing the interfaces, then we'd probably try
to have PCRE2 find the pertinent matches up front so that it never needs
to re-discover them.

Fixes #1412
2021-05-31 21:51:18 -04:00
Andrew Gallant
656aa12649 printer: fix multi-line replacement bug
This commit fixes a subtle bug in multi-line replacement of line
terminators.

The problem is that even though ripgrep supports multi-line searches, it
is *still* line oriented. It still needs to print line numbers, for
example. For this reason, there are various parts in the printer that
iterate over lines in order to format them into the desired output.

This turns out to be problematic in some cases. #1311 documents one of
those cases (with line numbers enabled to highlight a point later):

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?
    2:world?

But the desired output is this:

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?world?

At first I had thought that the main problem was that the printer was
taking ownership of writing line terminators, even if the input already
had them. But it's more subtle than that. If we fix that issue, we get
output like this instead:

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?2:world?

Notice how '2:' is printed before 'world?'. The reason it works this way
is because matches are reported to the printer in a line oriented way.
That is, the printer gets a block of lines. The searcher guarantees that
all matches that start or end in any of those lines also end or start in
another line in that same block. As a result, the printer uses this
assumption: once it has processed a block of lines, the next match will
begin on a new and distinct line. Thus, things like '2:' are printed.

This is generally all fine and good, but an impedance mismatch arises
when replacements are used. Because now, the replacement can be used to
change the "block of lines" approach. Now, in terms of the output, the
subsequent match might actually continue the current line since the
replacement might get rid of the concept of lines altogether.

We can sometimes work around this. For example:

    $ printf "hello\nworld\n" | rg -U "\n(.)?" -r '?$1'
    hello?world?

Why does this work? It's because the '(.)' after the '\n' causes the
match to overlap between lines. Thus, the searcher guarantees that the
block sent to the printer contains every line.

And there in lay the solution: all we need to do is tweak the multi-line
searcher so that it combines lines with matches that directly adjacent,
instead of requiring at least one byte of overlap. Fixing that solves
the issue above. It does cause some tests to fail:

* The binary3 test in the searcher crate fails because adjacent line
  matches are now one part of block, and that block is scanned for
  binary data. To preserve the essence of the test, we insert a couple
  dummy lines to split up the blocks.
* The JSON CRLF test. It was testing that we didn't output any messages
  with an empty 'submatches' array. That is indeed still the case. The
  difference is that the messages got combined because of the adjacent
  line merging behavior. This is a slight change to the output, but is
  still correct.

Fixes #1311
2021-05-31 21:51:18 -04:00
Andrew Gallant
fc31aedcf3 printer: vimgrep now only prints one line
It turns out that the vimgrep format really only wants one line per
match, even when that match spans multiple lines.

We continue to support the previous behavior (print all lines in a
match) in the `grep-printer` crate. We add a new option to enable the
"only print the first line" behavior, and unconditionally enable it in
ripgrep. We can do that because the option has no effect in single-line
mode, since, well, in that case matches are guaranteed to span one line
anyway.

Fixes #1866
2021-05-31 21:51:18 -04:00
Anthony Huang
578e1992fa cli: add --field-{context,match}-separator flags
These flags permit configuring the bytes used to delimit fields in match
or context lines, where "fields" are things like the file path, line
number, column number and the match/context itself.

Fixes #1842, Closes #1871
2021-05-31 21:51:18 -04:00
Austin Wise
46d0130597 cargo: statically link binary on Windows/MSVC
Before this change, rg.exe depended on vcruntime140.dll, which does not
exist on a fresh install of Windows.

Closes #1613
2021-05-31 21:51:18 -04:00
Andres Suarez
7534d5144f globset: fix recursive suffix over matching
Previous, 'foo/**' would match 'foo', but it shouldn't have. In this
case, not matching 'foo' is what is documented and also seems consistent
with other recursive globbing implementations (like that in zsh).

This also updates the prefix extractor to pull 'foo/' out of 'foo/**'.

Closes #1756
2021-05-31 21:51:18 -04:00
Richard Khoury
a28e664abd ignore: check ignore rules before issuing stat calls
This seems like an obvious optimization but becomes critical when
filesystem operations even as simple as stat can result in significant
overheads; an example of this was a bespoke filesystem layer in Windows
that hosted files remotely and would download them on-demand when
particular filesystem operations occurred. Users of this system who
ensured correct file-type fileters were being used could still get
unnecessary file access resulting in large downloads.

Fixes #1657, Closes #1660
2021-05-31 21:51:18 -04:00
Pen Tree
0ca96e004c printer: fix context bug when --max-count is used
In the case where after-context is requested with a match count limit,
we need to be careful not to reset the state tracking the remaining
context lines.

Fixes #1380, Closes #1642
2021-05-31 21:51:18 -04:00
Alessandro Menezes
2295061e80 searcher: do UTF-8 BOM sniffing like UTF-16
Previously, we were only looking for the UTF-16 BOM for determining
whether to do transcoding or not. But we should also look for the UTF-8
BOM as well.

Fixes #1638, Closes #1697
2021-05-31 21:51:18 -04:00
Raimon Grau
53c4855517 ignore/types: add red
See: https://www.red-lang.org/

Closes #1663
2021-05-31 21:51:18 -04:00
Simon Morgan
121e0135c1 ignore/types: replace duplicate glob with *.aspx.vb
*.aspx.cs was listed twice and the VB variant is missing.

Closes #1683
2021-05-31 21:51:18 -04:00
João Marcos
4566882521 cli: add -. as short option for --hidden
This is somewhat non-standard, but it seems nice on the surface: short
flag names are in short supply, --hidden is probably somewhat common and
-. has an obvious connection with how hidden files are named on Unix.

Closes #1680
2021-05-31 21:51:18 -04:00
Andrew Gallant
12dd455ee9 printer: fix \r\n line terminator handling
This fixes a bug where it was assumed that 'is_suffix' when CRLF
handling was enabled mean that '\r\n' was present. But that's not the
case, and it is intentional that 'is_suffix' only looks for '\n'. (Which
is why #1803 wasn't taken, which tries to fix this by changing
'is_suffix'.)

Fixes #1765, Closes #1803
2021-05-31 21:51:18 -04:00
goto-engineering
e6cac8b119 cli: print warning if nothing was searched
This was once part of ripgrep, but at some point, was unintentionally
removed. The value of this warning is that since ripgrep tries to be
"smart" by default, it can be surprising if it doesn't search certain
things. This warning covers the case when ripgrep searches *nothing*,
which happens somewhat more frequently than you might expect. e.g., If
you're searching within an ignore directory.

Note that for now, we only print this message when the user has not
supplied any explicit paths. It's not clear that we want to print this
otherwise, and in particular, it seems that the message shows up too
eagerly. e.g., 'rg foo does-not-exist' will both print an error about
'does-not-exist' not existing, *and* the message about no files being
searched, which seems annoying in this case. We can always refine this
logic later.

Fixes #1404, Closes #1762
2021-05-31 21:51:18 -04:00
Ilya Grigoriev
51d2db7f19 doc: document '{a,b}' glob syntax
This syntax does not exist in `git`, so it is not documented in `man
gitignore`. There is a question of whether it *should* exist, but as
long as it does, it should be documented somewhere.

See also:
https://github.com/BurntSushi/ripgrep/issues/1221
https://github.com/BurntSushi/ripgrep/issues/1368

Closes #1816
2021-05-31 21:51:18 -04:00
Jade
26a29c750e doc: clarify --files-with-matches and --files-without-match
Ref https://github.com/BurntSushi/ripgrep/issues/103#issuecomment-763083510

Closes #1869
2021-05-31 21:51:18 -04:00
Andrew Gallant
a77b914e7a args: make --passthru and -A/-B/-C override each other
Fixes #1868
2021-05-31 21:51:18 -04:00
Andrew Gallant
2e2af50a4d
doc: add vulnerability report docs
Fixes #1773
2021-05-29 09:53:18 -04:00
Andrew Gallant
229d1a8d41
cli: fix arbitrary execution of program bug
This fixes a bug only present on Windows that would permit someone to
execute an arbitrary program if they crafted an appropriate directory
tree. Namely, if someone put an executable named 'xz.exe' in the root of
a directory tree and one ran 'rg -z foo' from the root of that tree,
then the 'xz.exe' executable in that tree would execute if there are any
'xz' files anywhere in the tree.

The root cause of this problem is that 'CreateProcess' on Windows will
implicitly look in the current working directory for an executable when
it is given a relative path to a program. Rust's standard library allows
this behavior to occur, so we work around it here. We work around it by
explicitly resolving programs like 'xz' via 'PATH'. That way, we only
ever pass an absolute path to 'CreateProcess', which avoids the implicit
behavior of checking the current working directory.

This fix doesn't apply to non-Windows systems as it is believed to only
impact Windows. In theory, the bug could apply on Unix if '.' is in
one's PATH, but at that point, you reap what you sow.

While the extent to which this is a security problem isn't clear, I
think users generally expect to be able to download or clone
repositories from the Internet and run ripgrep on them without fear of
anything too awful happening. Being able to execute an arbitrary program
probably violates that expectation. Therefore, CVE-2021-3013[1] was
created for this issue.

We apply the same logic to the --pre command, since the --pre command is
likely in a user's config file and it would be surprising for something
that the user is searching to modify which preprocessor command is used.

The --pre and -z/--search-zip flags are the only two ways that ripgrep
will invoke external programs, so this should cover any possible
exploitable cases of this bug.

[1] - https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3013
2021-05-29 09:36:48 -04:00
Andrew Gallant
8ec6ef373f
changelog: sync with commits since last release
I'm hoping to get a release out soon, and this is the first step.
2021-05-29 08:26:46 -04:00
Andrew Gallant
581a35e568
impl: fix --multiline anchored match bug
This fixes a bug where using \A or (?-m)^ in combination with
-U/--multiline would permit matches that aren't anchored to the
beginning of the file. The underlying cause was an optimization that
occurred when mmaps couldn't be used. Namely, ripgrep tries to still
read the input incrementally if it knows the pattern can't match through
a new line. But the detection logic was flawed, since it didn't account
for line anchors. This commit fixes that.

Fixes #1878, Fixes #1879
2021-05-29 07:37:28 -04:00
Andrew Gallant
94e4b8e301
printer: fix --vimgrep for multi-line mode
It turned out that --vimgrep wasn't quite getting the column of each
match correctly. Instead of printing column numbers relative to the
current line, it was printing column numbers as byte offsets relative to
where the match began. To fix this, we simply subtract the offset of the
line number from the beginning of the match. If the beginning of the
match came before the start of the current line, then there's really
nothing sensible we can do other than to use a column number of 1, which
we now document.

Interestingly, existing tests were checking that the previous behavior
was intended. My only defense is that I somehow tricked myself into
thinking it was a byte offset instead of a column number.

Kudos to @bfrg for calling this out in #1866:
https://github.com/BurntSushi/ripgrep/issues/1866#issuecomment-841635553
2021-05-15 08:27:59 -04:00
Roey Darwish Dror
020c5453a5
cli: fix stdin detection for Powershell on Unix
It seems that PowerShell uses sockets instead of FIFOs to redirect the
output between commands. So add `is_socket` to our `is_readable_stdin`
check.

This seems unlikely to cause problems and it probably more generally
correct than what we had before. In theory, it could cause problems if
it produces false positives, in which case, ripgrep will try to read
stdin when it should search the current working directory. (And this
usually winds up manifesting as ripgrep blocking forever.) But, if the
stdin handle reports itself as a socket, then it seems like we should
read it.

Fixes #1741, Closes #1742
2020-11-23 10:23:34 -05:00
Andrew Gallant
2819212f89 printer: tweak binary detection message format
This roughly matches similar changes made in GNU grep recently.
2020-11-02 10:52:51 -05:00
Josh Soref
def993bad1
spelling: fix various misspellings
These were found by the check spelling action[1] and reported
here[2].

PR #1685 

[1] - https://github.com/marketplace/actions/check-spelling
[2] - 6f02d05671 (commitcomment-42625778)
2020-09-22 10:29:16 -04:00
Andrew Gallant
e6e50054b0
doc: document cygwin path translation behavior
Kudos to @Pyker for posting more details about this.

Closes #1277
2020-09-13 09:29:28 -04:00
Martin Michlmayr
1b2c1dc675
doc: fix typos
PR #1605
2020-06-04 09:06:09 -04:00
Andrew Gallant
b1e3de246c
changelog: add empty TBD section to CHANGELOG
And update the release checklist to mention this process.
2020-05-29 09:49:45 -04:00
Andrew Gallant
a73c0a21d9
changelog: 12.1.1 2020-05-29 09:26:33 -04:00
Andrew Gallant
a700b75843
doc: clarify capture group indices
And in particular, note the special $0 index, which corresponds to the
entire match.

Fixes #1591
2020-05-21 22:22:51 -04:00
Andrew Gallant
1980630f17
doc: fix egregious markup output
We use '+++' syntax to output a literal '**' for a '--glob' example.
This '+++' syntax is pretty ugly when rendered literally via --help. We
fix this by hackily inserting the '+++' syntax for its one specific case
that we need it during man page generation.

Not ideal but it works. And --help still has some '*foo*' markup, but we
live with that for now.

Fixes #1581
2020-05-13 08:13:05 -04:00
Andrew Gallant
6162b000a3
changelog: 12.1.0 2020-05-09 11:36:44 -04:00
Andrew Gallant
b56315ea84
changelog: add #1550 to CHANGELOG 2020-05-08 23:37:17 -04:00
Andrew Gallant
e02bb6b99a changelog: add downstream notices 2020-05-08 23:24:40 -04:00
Chayoung You
16a1221fc7 doc: use asciidoctor instead of a2x
AsciiDoc development is continued under asciidoctor. See
https://github.com/asciidoc/asciidoc.

We do however fallback to a2x if asciidoctor is not present. This is to
ease migration, but at some point, it's likely that support for a2x will
be dropped.

Originally reported downstream:
https://github.com/Homebrew/linuxbrew-core/issues/19885

Closes #1544
2020-05-08 23:24:40 -04:00
Wieland Hoffmann
df7a3bfc7f grep-cli: support files compressed by compress(1)
While Linux distributions (at least Arch Linux, RHEL, Debian) do not support
compressing files with compress(1), macOS & AIX do (the utility is part of
POSIX). Additionally, gzip is able to uncompress such compressed files and
provides an `uncompress` binary.

Closes #1547
2020-05-08 23:24:40 -04:00
Andrew Gallant
0eb2501b6e doc: add a section about --pre to the GUIDE
Fixes #1252
2020-05-08 23:24:40 -04:00
Andrew Gallant
64a4dee495 cli: improve invalid UTF-8 pattern error message
When a pattern with invalid UTF-8 is given, the error message suggests
unqualified use of hex escape sequences to match arbitrary bytes. But
you *also* need to disable Unicode mode. So include that in the error
message.

Fixes #1339
2020-05-08 23:24:40 -04:00
Andrew Gallant
50840ea43b doc: note how to escape a '$' in --replace
Fixes #1524
2020-05-08 23:24:40 -04:00
Andrew Gallant
9a858e4909 doc: add config file note for --type-{add,clear}
This clarifies that persistence is possible via a configuration file.

Fixes #1571
2020-05-08 23:24:40 -04:00
Andrew Gallant
7ed9a31819 printer: fix --count-matches output
In order to implement --count-matches, we simply re-execute the regex on
the spans reported by the searcher. The spans always correspond to the
lines that participated in the match. This is the correct thing to do,
except when the regex contains look-ahead (or look-behind).

In particular, the look-around permits the regex's match success to
depends on an arbitrary point before or after the lines actually
reported as participating in the match. Since only the matched lines are
reported to the printer, it is possible for subsequent searching on
those lines to fail.

A true fix for this would somehow make the total span available to the
printer. But that seems tricky since it isn't always available. For
PCRE2's case in multiline mode, it is available because we force it to
be so for correctness.

For now, we simply detect this corner case heuristically. If the match
count is zero, then it necessarily means there is some kind of
look-around that isn't matching. So we set the match count to 1. This is
probably incorrect in some cases, although my brain can't quite come up
with a concrete example. Nevertheless, this is strictly better than the
status quo.

Fixes #1573
2020-05-08 23:24:40 -04:00
Andrew Gallant
1c4b5adb7b
regex: fix another inner literal bug
It looks like `is_simple` wasn't quite correct.

I can't wait until this code is rewritten. It is still not quite clearly
correct to me.

Fixes #1537
2020-04-01 20:37:48 -04:00
Andrew Gallant
1bb30b72fc
changelog: prepare for 12.0.1 release, redux 2020-03-29 18:50:31 -04:00
Andrew Gallant
58c428827d
changelog: prepare for 12.0.1 release 2020-03-29 18:47:46 -04:00
Andrew Gallant
34edb8123a
ignore: squash noisy error message
We should not assume that the commondir file actually exists. If it
doesn't, then just move on. This otherwise emits an error message when
searching normal submodules, which is not OK.

This regression was introduced in #1446.

Fixes #1520
2020-03-16 18:50:02 -04:00
Andrew Gallant
a8c1fb7c88 changelog: prepare for 12.0.0 release 2020-03-15 21:06:45 -04:00
Andrew Gallant
e772a95b58 regex: avoid using literal optimizations when whitespace is detected
If a literal is entirely whitespace, then it's quite likely that it is
very common. So when that case occurs, just don't do (inner) literal
optimizations at all.

The regex engine may still make sub-optimal decisions here, but that's a
problem for another day.

Fixes #1087
2020-03-15 13:19:14 -04:00
Andrew Gallant
c4c43c733e cli: add --no-ignore-files flag
The purpose of this flag is to force ripgrep to ignore all --ignore-file
flags (whether they come before or after --no-ignore-files).

This flag can be overridden with --ignore-files.

Fixes #1466
2020-03-15 13:19:14 -04:00
Andrew Gallant
447506ebe0 doc: clarify globing behavior
Fixes #1442, Fixes #1478
2020-03-15 13:19:14 -04:00
Andrew Gallant
12e4180985 doc: remove CPU features from man pages
It doesn't really belong in the man page since it's an artifact of a
build/runtime configuration. Moreover, it inhibits reproducible builds.

Fixes #1441
2020-03-15 13:19:14 -04:00
Andrew Gallant
daa8319398 doc: note ripgrep's stdin behavior
Fixes #1439
2020-03-15 13:19:14 -04:00
pierrenn
3a6a24a52a
cli: add engine flag
This permits switching between the different regex engine modes that
ripgrep supports. The purpose of this flag is to make it easier to
extend ripgrep with additional regex engines.

Closes #1488, Closes #1502
2020-03-15 09:30:58 -04:00
Andrew Gallant
66f045e055
changelog: add commit links
... now that we have stable identifiers.
2020-02-17 17:34:19 -05:00
Andrew Gallant
52d7f47420 ignore: treat symbolic links to directories as directories
Due to how walkdir works if symlinks are not followed, symlinks to
directories are seen as simple files by ripgrep. This caused a panic
in some cases due to receiving a WalkEvent::Exit event without a
corresponding WalkEvent::Dir event.

This is fixed by looking at the metadata of the file in the case of a
symlink to determine if it's a directory. We are careful to only do
this stat check when the depth of the entry is 0, as this bug only
impacts us when 1) we aren't following symlinks generally and 2) the
user provides a symlinked directory that we do follow as a top-level
path to search.

Fixes #1389, Closes #1397
2020-02-17 17:16:28 -05:00
Andrew Gallant
75cbe88fa2 cli: add --no-unicode, deprecate --no-pcre2-unicode
This adds a universal --no-unicode flag that is intended to work for all
supported regex engines. There is no point in retaining
--no-pcre2-unicode, so we make them aliases to the new flags and
deprecate them.
2020-02-17 17:16:28 -05:00
Andrew Gallant
711426a632 cli: add --no-require-git flag
This flag prevents ripgrep from requiring one to search a git repository
in order to respect git-related ignore rules (global, .gitignore and
local excludes). This actually corresponds to behavior ripgrep had long
ago, but #934 changed that. It turns out that users were relying on this
buggy behavior. In most cases, fixing it as simple as converting one's
rules to .ignore or .rgignore files. Unfortunately, there are other use
cases---like Perforce automatically respecting .gitignore files---that
make a strong case for ripgrep to at least support this.

The UX of a flag like this is absolutely atrocious. It's so obscure that
it's really not worth explicitly calling it out anywhere. Moreover, the
error cases that occur when this flag isn't used (but its behavior is
desirable) will not be intuitive, do not seem easily detectable and will
not guide users to this flag. Nevertheless, the motivation for this is
just barely strong enough for me to begrudgingly accept this.

Fixes #1414, Closes #1416
2020-02-17 17:16:28 -05:00
Andrew Gallant
01eeec56bb deb: fix fish completion install location
It looks like `completions` is owned by Fish itself. Third party
completions should go in `vendor_completions.d`.

Fixes #1485
2020-02-17 17:16:28 -05:00
Jakub Wieczorek
b435eaafc8 grep-regex: fix inner literal extraction bug
This appears to be another transcription bug from copying this code from
the prefix literal detection from inside the regex crate. Namely, when
it comes to inner literals, we only want to treat counted repetition as
two separate cases: the case when the minimum match is 0 and the case
when the minimum match is more than 0. In the former case, we treat
`e{0,n}` as `e*` and in the latter we treat `e{m,n}` where `m >= 1` as
just `e`.

We could definitely do better here. e.g., This means regexes like
`(foo){10}` will only have `foo` extracted as a literal, where searching
for the full literal would likely be faster.

The actual bug here was that we were not implementing this logic
correctly. Namely, we weren't always "cutting" the literals in the
second case to prevent them from being expanded.

Fixes #1319, Closes #1367
2020-02-17 17:16:28 -05:00
Andrew Gallant
5c1eac41a3 changelog: highlight a bad performance regression 2020-02-17 17:16:28 -05:00
Johannes Altmanninger
6f2b79f584 ignore: use git commondir for sourcing .git/info/exclude
Git looks for this file in GIT_COMMON_DIR, which is usually the same
as GIT_DIR (.git). However, when searching inside a linked worktree,
.git is usually a file that contains the path of the actual git dir,
which in turn contains a file "commondir" which references the directory
where info/exclude may reside, alongside other configuration shared across
all worktrees. This directory is usually the git dir of the main worktree.

Unlike git this does *not* read environment variables GIT_DIR and
GIT_COMMON_DIR, because it is not clear how to interpret them when
searching multiple repositories.

Fixes #1445, Closes #1446
2020-02-17 17:16:28 -05:00
Andrew Gallant
0c3b673e4c cli: make ripgrep work in non-existent directories
It turns out that querying the CWD while in a directory that no longer
exists results in an error. Since the CWD is queried every time ripgrep
starts---whether it needs it or not---for dealing with glob matching,
ripgrep winds up being completely useless inside a non-existent
directory.

We fix this in a few different ways:

* Firstly, if std::env::current_dir() fails, then we fall back to trying
  to read the `PWD` environment variable.
* If that fails, that we return a more sensible error message so that a
  user can at least react to the problem. Previously, the error message
  was inscrutable.
* Finally, we try to avoid the problem altogether by building empty glob
  matchers if not globs were provided, thus side-stepping querying the
  CWD completely.

Fixes #1291, Closes #1400
2020-02-17 17:16:28 -05:00
Naveen Nathan
297b428c8c cli: add --no-ignore-exclude flag
This commit adds a new --no-ignore-exclude flag that permits disabling
the use of .git/info/exclude filtering. Local exclusions are manual
configurations to a repository and are not shared, so it is sometimes
useful to disable to get a consistent view of a repository.

This also adds a new section to the man page that describes automatic
filtering.

Closes #1420
2020-02-17 17:16:28 -05:00
Andrew Gallant
cd8ec38a68 grep-regex: add fast path for -w/--word-regexp
Previously, ripgrep would always defer to the regex engine's capturing
matches in order to implement word matching. Namely, ripgrep would
determine the correct match offsets via a capturing group, since the
word regex is itself generated from the user supplied regex.

Unfortunately, the regex engine's capturing mode is still fairly slow,
so this commit adds a fast path to avoid capturing mode in the vast
majority of cases. See comments in the code for details.
2020-02-17 17:16:28 -05:00
Andrew Gallant
6a0e0147e0 grep-regex: improve literal detection with -w
When the -w/--word-regexp was used, ripgrep would in many cases fail to
apply literal optimizations. This occurs specifically when the regex
given by the user is an alternation of literals with no common prefixes
or suffixes, e.g.,

    rg -w 'foo|bar|baz|quux'

In this case, the inner literal detector fails. Normally, this would
result in literal prefixes being detected by the regex engine. But
because of the -w/--word-regexp flag, the actual regex that we run ends
up looking like this:

    (^|\W)(foo|bar|baz|quux)($|\W)

which of course defeats any prefix or suffix literal optimizations in
the regex crate's somewhat naive extractor. (A better extractor could
still do literal optimizations in the above case.)

So this commit fixes this by falling back to prefix or suffix literals
when they're available instead of prematurely giving up and assuming the
regex engine will do the rest.
2020-02-17 17:16:28 -05:00
Andrew Gallant
ad97e9c93f grep-regex: improve inner literal detection
This fixes an interesting performance bug where the inner literal
extractor would sometimes choose a sub-optimal literal. For example,
consider the regex:

    \x20+Sherlock Holmes\x20+

(The `\x20` is the ASCII code for a space character, which we use here
to just make it clearer. It otherwise does not matter.)

Previously, this would see the initial \x20 and then stop collecting
literals after the `+` repetition operator. This was because the inner
literal detector was adapter from the prefix literal detector, which had
to stop here. Namely, while \x20S would be a valid prefix (for example),
\x20\x20S would also be a valid prefix. As would \x20\x20\x20S and so
on. So the prefix detector would have to stop at the repetition
operator. Otherwise, only searching for \x20S could potentially scan
farther then the starting position of the next match.

However, for inner literals, this calculus no longer makes sense. We can
freely search for, e.g., \x20S without missing matches that start with
\x20\x20S precisely because we know this is an inner literal which may
not correspond to the start of a match.

With this fix, the literal that is now detected is

    \x20Sherlock Holmes\x20

Which is much better. We achieve this by no longer "cutting" literals
after seeing a `+` repetition operator. Instead, we permit literals to
continue to be extended.

The reason why this is important is because using \x20 as the literal to
search for is generally bad juju since it is so common. In fact, we
should probably add more logic here to either avoid such things or give
up entirely on the inner literal optimization if it detected a literal
that we think is very common. But we punt on such things here.
2020-02-17 17:16:28 -05:00
Robert Irelan
24f8a3e5ec doc: document all file type
This adds it to the guide and the docs for the --type flag.

Fixes #1344, Closes #1472
2020-02-17 17:16:28 -05:00
Collin Styles
a070722ff2 cli: add --include-zero flag
This flag, when used in conjunction with --count or --count-matches,
will print a result for each file searched even if there were zero
matches in that file. This is off by default but can be enabled to make
ripgrep behave more like grep.

This also clarifies some of the defaults for the
grep-printer::SummaryBuilder type.

Closes #1370, Closes #1405
2020-02-17 17:16:28 -05:00
Matěj Cepl
4628d77808 ignore/types: add spec file type
This is for RPM package SPEC files.

Fixes #946, Closes #1449
2020-02-17 17:16:28 -05:00
luh2
040ca45ba0 ignore/types: add xhtml to xml file type
Closes #1426
2020-02-17 17:16:28 -05:00
Andrew Gallant
91470572cd changelog: add notes about new file types 2020-02-17 17:16:28 -05:00
Sven-Hendrik Haase
027adbf485 ignore/types: add 'diff' file type
This includes .patch and .diff files.

Fixes #1418, Closes #1419
2020-02-17 17:16:28 -05:00
Mohammad AlSaleh
e71eedf0eb cli: add --no-context-separator flag
--context-separator='' still adds a new line separator, which could
still potentially be useful. So we add a new `--no-context-separator`
flag that completely disables context separators even when the -A/-B/-C
context flags are used.

Closes #1390
2020-02-17 17:16:28 -05:00
sharkdp
a18cf6ec39 ignore: add existence check for ignore files
This commit adds a simple `.exists()` check for `.gitignore`,
`.ignore`, and other similar files before actually calling
`File::open(…)` in `GitIgnoreBuilder::add`.

The reason is that a simple existence check via `stat` can be faster
than actually trying to `open` the file, see
https://stackoverflow.com/a/12774387/704831. As we typically expect(?)
the number of directories *without* ignore files to be much larger
than the number of directories *with* ignore files, this leads to an
overall speedup.

The performance gain is not huge for `rg`, but can be quite significant
if more `.gitignore`-like files are added via
`add_custom_ignore_filename`. The speedup is *larger* for folders with
*low* files-per-directory ratios.

Note though that we do not do this check on Windows until a specific
analysis there suggests this is beneficial. Namely, Windows generally
has slower file system operations, so it's not clear whether this
speculative check is actually a benefit or not.

Benchmark results
-----------------

`rg --files` in my home folder (200k results, 6.5 files per directory):

| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `./rg-master --files` | 396.4 ± 3.2 | 390.9 | 400.0 | 1.05 |
| `./rg-feature --files` | 376.0 ± 3.6 | 369.3 | 383.5 | 1.00 |

`rg --files --hidden` in my home folder (800k results, 5.4
files per directory)

| Command | Mean [s] | Min [s] | Max [s] | Relative |
|:---|---:|---:|---:|---:|
| `./rg-master --files --hidden` | 1.575 ± 0.012 | 1.560 | 1.597 | 1.06 |
| `./rg-feature --files --hidden` | 1.479 ± 0.011 | 1.464 | 1.496 | 1.00 |

`rg --files` in the chromium-79.0.3915.2 source tree (300k results, 12.7 files per
directory)

| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `~/rg-master --files` | 445.2 ± 5.3 | 435.6 | 453.0 | 1.04 |
| `~/rg-feature --files` | 428.9 ± 7.0 | 418.2 | 440.0 | 1.00 |

`rg --files` in the linux-5.3 source tree (65k results, 15.1
files per directory)

| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `./rg-master --files` | 94.5 ± 1.9 | 89.8 | 98.5 | 1.02 |
| `./rg-feature --files` | 92.6 ± 2.7 | 88.4 | 98.7 | 1.00 |

Closes #1381
2020-02-17 17:16:28 -05:00
Andrew Gallant
5b10328f41
changelog: update with bug fix 2019-08-02 07:37:27 -04:00
Andrew Gallant
931ab35f76
changelog: start work on 11.0.2 release 2019-08-01 17:42:38 -04:00
Andrew Gallant
d1389db2e3
search: better errors for preprocessor commands
If a preprocessor command could not be started, we now show some
additional context with the error message. Previously, it showed
something like this:

  some/file: No such file or directory (os error 2)

Which is itself pretty misleading. Now it shows:

  some/file: preprocessor command could not start: '"nonexist" "some/file"': No such file or directory (os error 2)

Fixes #1302
2019-06-16 19:02:02 -04:00
Andrew Gallant
e7829c05d3
cli: fix bug where last byte was stripped
In an effort to strip line terminators, we assumed their existence. But
a pattern file may not end with a line terminator, so we shouldn't
unconditionally strip them.

We fix this by moving to bstr's line handling, which does this for us
automatically.
2019-04-19 07:11:44 -04:00
Andrew Gallant
5f8805a496
ripgrep: release 11.0.1 2019-04-16 13:10:29 -04:00
Andrew Gallant
d7f57d9aab
ripgrep: release 11.0.0 2019-04-15 18:09:40 -04:00
Andrew Gallant
ef1611b5f5
ripgrep: max-column-preview --> max-columns-preview
Credit to @okdana for catching this. This naming is a bit more
consistent with the existing --max-columns flag.
2019-04-15 06:51:51 -04:00
Andrew Gallant
45d12abbc5
changelog: small fixups 2019-04-14 20:21:55 -04:00
Andrew Gallant
5fde8391f9
changelog: backfill it
I went through every commit since the 0.10.0 release and added anything
that I thought was missing.
2019-04-14 20:04:01 -04:00
Andrew Gallant
967e7ad0de ripgrep: add --auto-hybrid-regex flag
This flag, when set, will automatically dispatch to PCRE2 if the given
regex cannot be compiled by Rust's regex engine. If both engines fail to
compile the regex, then both errors are surfaced.

Closes #1155
2019-04-14 19:29:27 -04:00
Andrew Gallant
8f14cb18a5 ripgrep: increase pcre2's default JIT stack size
The default stack size is 32KB, and this increases it to 10MB. 32KB is
pretty paltry in the environments in which ripgrep runs, and 10MB is
easily afforded as a maximum size. (The size limit we set for Rust's
regex engine is considerably larger.)

This was motivated due to the fack that JIT stack limits have been
observed to be hit in the wild:
https://github.com/Microsoft/vscode/issues/64606
2019-04-14 19:29:27 -04:00
Andrew Gallant
da9d720431 ripgrep: add --pcre2-version flag
This flag will output details about the version of PCRE2 that ripgrep
is using (if any).
2019-04-14 19:29:27 -04:00
Andrew Gallant
5a565354f8 versioning: next version will be ripgrep 11
This sets up the release announcement and briefly describes the
versioning change. The actual version change itself won't happen until
the release.

Closes #1172
2019-04-14 19:29:27 -04:00
Andrew Gallant
2a6532ae71 doc: note cases of exorbitant memory usage
Fixes #1189
2019-04-14 19:29:27 -04:00
Andrew Gallant
ece1f50cfe printer: support previews for long lines
This commit adds support for showing a preview of long lines. While the
default still remains as completely suppressing the entire line, this
new functionality will show the first N graphemes of a matching line,
including the number of matches that are suppressed.

This was unfortunately a fairly invasive change to the printer that
required a bit of refactoring. On the bright side, the single line
and multi-line coloring are now more unified than they were before.

Closes #1078
2019-04-14 19:29:27 -04:00
Andrew Gallant
a7d26c8f14 binary: rejigger ripgrep's handling of binary files
This commit attempts to surface binary filtering in a slightly more
user friendly way. Namely, before, ripgrep would silently stop
searching a file if it detected a NUL byte, even if it had previously
printed a match. This can lead to the user quite reasonably assuming
that there are no more matches, since a partial search is fairly
unintuitive. (ripgrep has this behavior by default because it really
wants to NOT search binary files at all, just like it doesn't search
gitignored or hidden files.)

With this commit, if a match has already been printed and ripgrep detects
a NUL byte, then it will print a warning message indicating that the search
stopped prematurely.

Moreover, this commit adds a new flag, --binary, which causes ripgrep to
stop filtering binary files, but in a way that still avoids dumping
binary data into terminals. That is, the --binary flag makes ripgrep
behave more like grep's default behavior.

For files explicitly specified in a search, e.g., `rg foo some-file`,
then no binary filtering is applied (just like no gitignore and no
hidden file filtering is applied). Instead, ripgrep behaves as if you
gave the --binary flag for all explicitly given files.

This was a fairly invasive change, and potentially increases the UX
complexity of ripgrep around binary files. (Before, there were two
binary modes, where as now there are three.) However, ripgrep is now a
bit louder with warning messages when binary file detection might
otherwise be hiding potential matches, so hopefully this is a net
improvement.

Finally, the `-uuu` convenience now maps to `--no-ignore --hidden
--binary`, since this is closer to the actualy intent of the
`--unrestricted` flag, i.e., to reduce ripgrep's smart filtering. As a
consequence, `rg -uuu foo` should now search roughly the same number of
bytes as `grep -r foo`, and `rg -uuua foo` should search roughly the
same number of bytes as `grep -ra foo`. (The "roughly" weasel word is
used because grep's and ripgrep's binary file detection might differ
somewhat---perhaps based on buffer sizes---which can impact exactly what
is and isn't searched.)

See the numerous tests in tests/binary.rs for intended behavior.

Fixes #306, Fixes #855
2019-04-14 19:29:27 -04:00
Andrew Gallant
09108b7fda regex: make multi-literal searcher faster
This makes the case of searching for a dictionary of a very large number
of literals much much faster. (~10x or so.) In particular, we achieve this
by short-circuiting the construction of a full regex when we know we have
a simple alternation of literals. Building the regex for a large dictionary
(>100,000 literals) turns out to be quite slow, even if it internally will
dispatch to Aho-Corasick.

Even that isn't quite enough. It turns out that even *parsing* such a regex
is quite slow. So when the -F/--fixed-strings flag is set, we short
circuit regex parsing completely and jump straight to Aho-Corasick.

We aren't quite as fast as GNU grep here, but it's much closer (less than
2x slower).

In general, this is somewhat of a hack. In particular, it seems plausible
that this optimization could be implemented entirely in the regex engine.
Unfortunately, the regex engine's internals are just not amenable to this
at all, so it would require a larger refactoring effort. For now, it's
good enough to add this fairly simple hack at a higher level.

Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will
be slower, because of the aforementioned missing optimization. Moreover,
passing flags like `-i` or `-S` will cause ripgrep to abandon this
optimization and fall back to something potentially much slower. Again,
this fix really needs to happen inside the regex engine, although we
might be able to special case -i when the input literals are pure ASCII
via Aho-Corasick's `ascii_case_insensitive`.

Fixes #497, Fixes #838
2019-04-07 19:11:03 -04:00
Andrew Gallant
de0bc78982
deps: bump encoding_rs to 0.8.16
This brings in an updated `encoding_rs` crate that uses `packed_simd`,
which compiles on the latest nightly. Compilation times do appear to be
impacted significantly though.

Fixes #1175 (again)
2019-02-07 17:05:14 -05:00
Andrew Gallant
386dd2806d
changelog: BUG #916
This was fixed by bumping the MSRV above Rust 1.28.

Fixes #916
2019-01-27 13:15:17 -05:00
Andrew Gallant
5fe9a954e6
changelog: BUG #1154 2019-01-27 13:05:50 -05:00
Andrew Gallant
0df71240ff
search: fix -F and -f interaction bug
This fixes what appears to be a pretty egregious regression where the
`-F/--fixed-strings` flag wasn't be applied to patterns supplied via
the `-f/--file` flag. The same bug existed for the `-x/--line-regexp`
flag as well, which we fix here.

Fixes #1176
2019-01-26 16:01:52 -05:00
Andrew Gallant
f3164f2615
exit: tweak exit status logic
This changes how ripgrep emit exit status codes. In particular, any error
that occurs while searching will now cause ripgrep to emit a `2` exit
code, where as it previously would emit either a `0` or a `1` code based
on whether it matched or not. That is, ripgrep would only emit a `2` exit
code for a catastrophic error.

This tweak includes additional logic that GNU grep adheres to, which seems
like good sense. Namely, if -q/--quiet is given, and an error occurs and
a match occurs, then ripgrep will emit a `0` exit code.

Closes #1159
2019-01-26 15:44:49 -05:00
Andrew Gallant
31d3e24130
args: prevent panicking in 'rg -h | rg'
Previously, we relied on clap to handle printing either an error
message, or --help/--version output, in addition to setting the exit
status code. Unfortunately, for --help/--version output, clap was
panicking if the write failed, which can happen in fairly common
scenarios via a broken pipe error. e.g., `rg -h | head`.

We fix this by using clap's "safe" API and doing the printing ourselves.
We also set the exit code to `2` when an invalid command has been given.

Fixes #1125 and partially addresses #1159
2019-01-26 14:39:40 -05:00
Andrew Gallant
bf842dbc7f
doc: add note about inverted flags
Fixes #1091
2019-01-26 14:13:06 -05:00