This PR introduces a new sub-crate, `ignore`, which primarily provides a
fast recursive directory iterator that respects ignore files like
gitignore and other configurable filtering rules based on globs or even
file types.
This results in a substantial source of complexity moved out of ripgrep's
core and into a reusable component that others can now (hopefully)
benefit from.
While much of the ignore code carried over from ripgrep's core, a
substantial portion of it was rewritten with the following goals in
mind:
1. Reuse matchers built from gitignore files across directory iteration.
2. Design the matcher data structure to be amenable for parallelizing
directory iteration. (Indeed, writing the parallel iterator is the
next step.)
Fixes#9, #44, #45
This particular bug was triggered whenever a search was run in a directory
with a parent directory that contains a relevant .gitignore file. In
particular, before matching against a parent directory's gitignore rules,
a path's leading `./` was not stripped, which results in errant matching.
We now make sure `./` is stripped.
Fixes#184.
This commit completes the initial move of glob matching to an external
crate, including fixing up cross platform support, polishing the
external crate for others to use and fixing a number of bugs in the
process.
Fixes#87, #127, #131
This commit goes a long way toward refactoring glob sets so that the
code is easier to maintain going forward. In particular, it makes the
literal optimizations that glob sets used a lot more structured and much
easier to extend. Tests have also been modified to include glob sets.
There's still a bit of polish work left to do before a release.
This also fixes the immediate issue where large gitignore files were
causing ripgrep to slow way down. While we don't technically fix it for
good, we're a lot better about reducing the number of regexes we
compile. In particular, if a gitignore file contains thousands of
patterns that can't be matched more simply using literals, then ripgrep
will slow down again. We could fix this for good by avoiding RegexSet if
the number of regexes grows too large.
Fixes#134.
This helps #134 by avoiding a slow regex execution path, but doesn't
actually fix the problem. Namely, we've gone from "so slow I'm not going
to keep waiting for rg to finish" to "wow that was slow but at least it
finished before I lost my patience."
It didn't make sense for --quiet to be part of the printer, because --quiet
doesn't just mean "don't print," it also means, "stop after the first
match is found." This needs to be wired all the way up through directory
traversal, and it also needs to cause all of the search workers to quit
as well. We do it with an atomic that is only checked with --quiet is
given.
Fixes#116.
This was already working correctly in multithreaded mode, but in single
threaded mode, a file failing to open caused search to stop. That's bad.
Fixes#98.
This was a result of misinterpreting a feature in grep where NUL bytes
are replaced with \n. The primary reason for doing this is to avoid
excessive memory usage on truly binary data. However, grep only does this
when searching binary files as if they were binary, and which only reports
whether the file matched or not. When grep is told to search binary data
as text (the -a/--text flag), then it doesn't do any replacement so we
shouldn't either.
In general, this makes sense, because the user is essentially asserting
that a particular file that looks like binary is actually text. In that
case, we shouldn't try to replace any NUL bytes.
ripgrep doesn't actually support searching binary data for whether it
matches or not, so we don't actually need the replace_buf function.
However, it does seem like a potentially useful feature.
It seems silly, but on *nix, we can just dump the bytes of the path
straight to the terminal. There's no need to do a UTF-8 check, which
can be costly when printing lots of matches.
This means that `rg pat < file` won't do the expected thing and search
`fil`. Instead, it will recursively search the current directory for `pat`.
This isn't ideal, but is better than the previous behavior, which was to
wait for stdin when running `rg pat`, given the appearance of hanging
forever. The former is an important use case, but the latter is the
*central* use case of ripgrep, so we should make that work.
`rg` can still be used to search stdin on Windows, it just needs to be
done explicitly. e.g., `rg pat - < file` will search for `pat` in `file`.
Fixes#19