This adds a new walk type in the `ignore` crate, `WalkParallel`, which
provides a way for recursively iterating over a set of paths in parallel
while respecting various ignore rules.
The API is a bit strange, as a closure producing a closure isn't
something one often sees, but it does seem to work well.
This also allowed us to simplify much of the worker logic in ripgrep
proper, where MultiWorker is now gone.
If a user hits Ctrl-C to exit out of a search in the middle of printing
a line, we don't want to leave the terminal colors screwed up for them.
Catch Ctrl-C using the ctrlc crate, obtain a stdout lock to ensure that
other threads don't continue writing after we do so, reset the terminal,
and exit the program.
Closes#119
This PR introduces a new sub-crate, `ignore`, which primarily provides a
fast recursive directory iterator that respects ignore files like
gitignore and other configurable filtering rules based on globs or even
file types.
This results in a substantial source of complexity moved out of ripgrep's
core and into a reusable component that others can now (hopefully)
benefit from.
While much of the ignore code carried over from ripgrep's core, a
substantial portion of it was rewritten with the following goals in
mind:
1. Reuse matchers built from gitignore files across directory iteration.
2. Design the matcher data structure to be amenable for parallelizing
directory iteration. (Indeed, writing the parallel iterator is the
next step.)
Fixes#9, #44, #45
This particular bug was triggered whenever a search was run in a directory
with a parent directory that contains a relevant .gitignore file. In
particular, before matching against a parent directory's gitignore rules,
a path's leading `./` was not stripped, which results in errant matching.
We now make sure `./` is stripped.
Fixes#184.
This commit completes the initial move of glob matching to an external
crate, including fixing up cross platform support, polishing the
external crate for others to use and fixing a number of bugs in the
process.
Fixes#87, #127, #131
This commit goes a long way toward refactoring glob sets so that the
code is easier to maintain going forward. In particular, it makes the
literal optimizations that glob sets used a lot more structured and much
easier to extend. Tests have also been modified to include glob sets.
There's still a bit of polish work left to do before a release.
This also fixes the immediate issue where large gitignore files were
causing ripgrep to slow way down. While we don't technically fix it for
good, we're a lot better about reducing the number of regexes we
compile. In particular, if a gitignore file contains thousands of
patterns that can't be matched more simply using literals, then ripgrep
will slow down again. We could fix this for good by avoiding RegexSet if
the number of regexes grows too large.
Fixes#134.
This helps #134 by avoiding a slow regex execution path, but doesn't
actually fix the problem. Namely, we've gone from "so slow I'm not going
to keep waiting for rg to finish" to "wow that was slow but at least it
finished before I lost my patience."
It didn't make sense for --quiet to be part of the printer, because --quiet
doesn't just mean "don't print," it also means, "stop after the first
match is found." This needs to be wired all the way up through directory
traversal, and it also needs to cause all of the search workers to quit
as well. We do it with an atomic that is only checked with --quiet is
given.
Fixes#116.
This was already working correctly in multithreaded mode, but in single
threaded mode, a file failing to open caused search to stop. That's bad.
Fixes#98.
This was a result of misinterpreting a feature in grep where NUL bytes
are replaced with \n. The primary reason for doing this is to avoid
excessive memory usage on truly binary data. However, grep only does this
when searching binary files as if they were binary, and which only reports
whether the file matched or not. When grep is told to search binary data
as text (the -a/--text flag), then it doesn't do any replacement so we
shouldn't either.
In general, this makes sense, because the user is essentially asserting
that a particular file that looks like binary is actually text. In that
case, we shouldn't try to replace any NUL bytes.
ripgrep doesn't actually support searching binary data for whether it
matches or not, so we don't actually need the replace_buf function.
However, it does seem like a potentially useful feature.
It seems silly, but on *nix, we can just dump the bytes of the path
straight to the terminal. There's no need to do a UTF-8 check, which
can be costly when printing lots of matches.
This means that `rg pat < file` won't do the expected thing and search
`fil`. Instead, it will recursively search the current directory for `pat`.
This isn't ideal, but is better than the previous behavior, which was to
wait for stdin when running `rg pat`, given the appearance of hanging
forever. The former is an important use case, but the latter is the
*central* use case of ripgrep, so we should make that work.
`rg` can still be used to search stdin on Windows, it just needs to be
done explicitly. e.g., `rg pat - < file` will search for `pat` in `file`.
Fixes#19
If no paths are given to ripgrep, only read from stdin if it's a file or
a FIFO. In particular, if something like `rg foo < /dev/null` is used,
then don't try to read from stdin.
Fixes#35, #81
Closes#26.
Acts like --count but emits only the paths of files with matches,
suitable for piping to xargs. Both mmap and no-mmap searches terminate
after the first match is found. Documentation updated and tests added.
Files like /proc/cpuinfo will advertise themselves as a normal file with
size 0. Normally, this isn't a problem, but if ripgrep decides to use a
memory map, it skipped searching if the file was empty since it's an error
to memory map an empty file. Instead of returning 0, we should just fall
back to standard read calls.
Fixes#55.
A standard glob of `foo/**` will match `foo`, but gitignore semantics
specify that `foo/**` should only match the contents of `foo` and not
`foo` itself. We capture those semantics by translating `foo/**` to
`foo/**/*`.
Fixes#30.
This is kind of a ticky-tack change. I do think ./ as a prefix is
reasonable default, *but* we strip ./ when showing search results, so it
does make sense to be consistent.
Fixes#21.
If a gitignore file in a *parent* directory is used, then it must be
matched relative to the directory it's in. ripgrep wasn't actually
adhering to this rule. Consider an example:
.gitignore
src
llvm
foo
Where `.gitignore` contains `/llvm/` and `foo` contains `test`. When
running `rg test` at the top-level directory, `foo` is correctly searched.
If you `cd` into `src` and re-run the same search, `foo` is ignored because
the `/llvm/` pattern is interpreted with respect to the current working
directory, which is wrong. The problem is that the path of `llvm` is
`./llvm`, which makes it look like it should match.
We fix this by rebuilding the directory path of each file when traversing
gitignores in parent directories. This does come with a small performance
hit.
Fixes#25.
Namely, if a .gitignore inside a sub-directory has an absolute pattern,
e.g., `/foo/`, then we should match it relative to the directory containing
the .gitignore.
We were erroneously neglecting to prefix a pattern like `foo/`
with `**/` (to make `**/foo/`) because it had a slash in it. In fact, the
only reason to neglect a **/ prefix is if the pattern already starts
with **/, or if the pattern is absolute.
Fixes#16, #49, #50, #65
Use-case: While not a vogue technology, VB is still a common file type taught in many university settings and used in many commercial settings. Working with VB files out-of-the-box would provide a lot of value to `ripgrep` users.
Example: I'm working on converting a legacy app to a modern infrastructure. The legacy app mixes CS and VB files liberally, so I always need to check both. For portability, it would be nice to just be able to ask for `-tcs -tvb` without registering with `--type-add` first.
Tests: I didn't notice any coverage aimed at this part of the code, but if I'm mistaken I'll amend the PR.
If you're in a directory that has a parent .gitignore (like, your $HOME),
then it can cause ripgrep to simply not do anything depending on your
ignore rules.
There are probably other scenarios where ripgrep applies some filter that
an end user doesn't expect, so try to catch the worst case (when ripgrep
doesn't search anything).
I don't like having multiple flags do the same thing, but -u, -uu and -uuu
are much easier to remember, particularly with -uuu meaning "search
everything."
For example, when only a single file (or stdin) is being searched, then we
should be able to print directly to the terminal instead of intermediate
buffers. (The buffers are only necessary for parallelism.)
Closes#4.
We do this by avoiding using a RegexSet (*sigh*). In particular, file
type matching has much simpler semantics than gitignore files, so we don't
actually need to care which file type matched. Therefore, we can get away
with a single regex with a giant alternation.
I though plain `read` had usurped them, but when searching a very small
number of files, mmaps can be around 20% faster on Linux. It'd be really
unfortunate to leave that on the table.
Mmap searching doesn't support contexts yet, but we probably don't really
care. And duplicating that logic doesn't sound fun. Without contexts, mmap
searching is delightfully simple.
- Refactored interaction between CLI args and rest of xrep.
- Filling in a lot more options, including file type filtering.
- Fixing some bugs in globbing/ignoring.
- More documentation.
Memory maps appear to degrade quite a bit in the presence of multithreading.
Also, switch to lock free data structures for synchronization. Give each
worker an input and output buffer which require no synchronization.
I'm pretty disappointed by the performance of regex sets. They are
apparently spending a lot of their time in construction of the DFA,
which probably means that the DFA is just too big.
It turns out that it's actually faster to build an *additional* normal
regex with the alternation of every glob and use it as a first-pass
filter over every file path. If there's a match, only then do we try the
more expensive RegexSet.