1
0
mirror of https://github.com/BurntSushi/ripgrep.git synced 2024-12-12 19:18:24 +02:00
Commit Graph

277 Commits

Author SHA1 Message Date
Andrew Gallant
8bbe58d623 Add support for additional text encodings.
This includes, but is not limited to, UTF-16, latin-1, GBK, EUC-JP and
Shift_JIS. (Courtesy of the `encoding_rs` crate.)

Specifically, this feature enables ripgrep to search files that are
encoded in an encoding other than UTF-8. The list of available encodings
is tied directly to what the `encoding_rs` crate supports, which is in
turn tied to the Encoding Standard. The full list of available encodings
can be found here: https://encoding.spec.whatwg.org/#concept-encoding-get

This pull request also introduces the notion that text encodings can be
automatically detected on a best effort basis. Currently, the only
support for this is checking for a UTF-16 bom. In all other cases, a
text encoding of `auto` (the default) implies a UTF-8 or ASCII
compatible source encoding. When a text encoding is otherwise specified,
it is unconditionally used for all files searched.

Since ripgrep's regex engine is fundamentally built on top of UTF-8,
this feature works by transcoding the files to be searched from their
source encoding to UTF-8. This transcoding only happens when:

1. `auto` is specified and a non-UTF-8 encoding is detected.
2. A specific encoding is given by end users (including UTF-8).

When transcoding occurs, errors are handled by automatically inserting
the Unicode replacement character. In this case, ripgrep's output is
guaranteed to be valid UTF-8 (excluding non-UTF-8 file paths, if they
are printed).

In all other cases, the source text is searched directly, which implies
an assumption that it is at least ASCII compatible, but where UTF-8 is
most useful. In this scenario, encoding errors are not detected. In this
case, ripgrep's output will match the input exactly, byte-for-byte.

This design may not be optimal in all cases, but it has some advantages:

1. In the happy path ("UTF-8 everywhere") remains happy. I have not been
   able to witness any performance regressions.
2. In the non-UTF-8 path, implementation complexity is kept relatively
   low. The cost here is transcoding itself. A potentially superior
   implementation might build decoding of any encoding into the regex
   engine itself. In particular, the fundamental problem with
   transcoding everything first is that literal optimizations are nearly
   negated.

Future work should entail improving the user experience. For example, we
might want to auto-detect more text encodings. A more elaborate UX
experience might permit end users to specify multiple text encodings,
although this seems hard to pull off in an ergonomic way.

Fixes #1
2017-03-12 19:54:48 -04:00
Andrew Gallant
7c37065911 update deps 2017-03-08 20:23:12 -05:00
Andrew Gallant
4e8c0fc4ad bump clap to 2.20.5
Fixes #383
2017-02-25 18:43:13 -05:00
Andrew Gallant
a114b86063 update termcolor dep 2017-02-18 15:09:25 -05:00
Andrew Gallant
d825648b86 Remove lazy_static from globset 2017-02-12 15:37:50 -05:00
Andrew Gallant
fecef10c1c update deps 2017-01-17 19:36:23 -05:00
Andrew Gallant
f5a2d022ec Replace internal atty module with atty crate.
This removes all use of explicit unsafe in ripgrep proper except for
one: accessing the contents of a memory map. (Which may never go away.)
2017-01-15 16:32:30 -05:00
Andrew Gallant
057ed6305a 0.4.0 2017-01-13 23:46:21 -05:00
Andrew Gallant
a7ca2d6563 update same-file dep 2017-01-13 20:01:06 -05:00
Andrew Gallant
c3de1f58ea another bytecount update, weird 2017-01-10 21:13:40 -05:00
Andrew Gallant
e940bc956d update bytecount
Fixes #313
2017-01-10 18:30:16 -05:00
Andrew Gallant
461e0c4e33 Don't search stdout redirected file.
When running ripgrep like this:

    rg foo > output

we must be careful not to search `output` since ripgrep is actively writing
to it. Searching it can cause massive blowups where the file grows without
bound.

While this is conceptually easy to fix (check the inode of the redirection
and the inode of the file you're about to search), there are a few problems
with it.

First, inodes are a Unix thing, so we need a Windows specific solution to
this as well. To resolve this concern, I created a new crate, `same-file`,
which provides a cross platform abstraction.

Second, stat'ing every file is costly. This is not avoidable on Windows,
but on Unix, we can get the inode number directly from directory traversal.
However, this information wasn't exposed, but now it is (through both the
ignore and walkdir crates).

Fixes #286
2017-01-09 16:12:08 -05:00
Andrew Gallant
aed315e80a bump deps 2017-01-03 07:27:51 -05:00
Andrew Gallant
163e00677a Update to regex 0.2. 2017-01-01 01:03:21 -05:00
Andrew Gallant
d58236fbdc bump various versions 2016-12-30 15:44:08 -05:00
Andrew Gallant
b65bb37b14 Remove superfluous memmap dependency in grep crate.
Fixes #295.
2016-12-27 15:46:40 -05:00
Andrew Gallant
de5cb7d22e Remove special ^C handling.
This means that ripgrep will no longer try to reset your colors in your
terminal if you kill it while searching. This could result in messing up
the colors in your terminal, and the fix is to simply run some other
command that resets them for you. For example:

    $ echo -ne "\033[0m"

The reason why the ^C handling was removed is because it is irrevocably
broken on Windows and is impossible to do correctly and efficiently in
ANSI terminals.

Fixes #281
2016-12-24 12:53:09 -05:00
Andrew Gallant
82ceb818f3 update deps 2016-12-24 08:32:32 -05:00
Lilian Anatolie Moraru
cbacf4f19e Update Cargo.lock to bring in clap 2.19.2 fix for ZSH completions. (#287) 2016-12-23 06:47:55 -05:00
Andrew Gallant
de33003527 0.3.2 2016-12-07 10:59:06 -05:00
Andrew Gallant
d66812102b Fix leading hypen bug by updating clap.
Fixes #270
2016-12-06 17:29:34 -05:00
Andrew Gallant
86f8c3c818 update Cargo.lock 2016-12-05 20:15:45 -05:00
Andrew Gallant
7282706b42 Fix bug reading root symlink.
When give an explicit file path on the command line like `foo` where `foo`
is a symlink, ripgrep should follow it even if `-L` isn't set. This is
consistent with the behavior of `foo/`.

Fixes #256
2016-12-05 20:05:57 -05:00
Andrew Gallant
c4a6733f3b 0.3.1 2016-11-21 20:53:52 -05:00
Andrew Gallant
05b26d5986 bump termcolor 2016-11-21 20:33:57 -05:00
Andrew Gallant
aef46beaf2 0.3.0 2016-11-20 16:07:25 -05:00
Andrew Gallant
e8a30cb893 Completely re-work colored output and tty handling.
This commit completely guts all of the color handling code and replaces
most of it with two new crates: wincolor and termcolor. wincolor
provides a simple API to coloring using the Windows console and
termcolor provides a platform independent coloring API tuned for
multithreaded command line programs. This required a lot more
flexibility than what the `term` crate provided, so it was dropped.
We instead switch to writing ANSI escape sequences directly and ignore
the TERMINFO database.

In addition to fixing several bugs, this commit also permits end users
to customize colors to a certain extent. For example, this command will
set the match color to magenta and the line number background to yellow:

    rg --colors 'match:fg:magenta' --colors 'line:bg:yellow' foo

For tty handling, we've adopted a hack from `git` to do tty detection in
MSYS/mintty terminals. As a result, ripgrep should get both color
detection and piping correct on Windows regardless of which terminal you
use.

Finally, switch to line buffering. Performance doesn't seem to be
impacted and it's an otherwise more user friendly option.

Fixes #37, Fixes #51, Fixes #94, Fixes #117, Fixes #182, Fixes #231
2016-11-20 11:14:52 -05:00
Andrew Gallant
92dc402f7f Switch from Docopt to Clap.
There were two important reasons for the switch:

1. Performance. Docopt does poorly when the argv becomes large, which is
   a reasonable common use case for search tools. (e.g., use with xargs)
2. Better failure modes. Clap knows a lot more about how a particular
   argv might be invalid, and can therefore provide much clearer error
   messages.

While both were important, (1) made it urgent.

Note that since Clap requires at least Rust 1.11, this will in turn
increase the minimum Rust version supported by ripgrep from Rust 1.9 to
Rust 1.11. It is therefore a breaking change, so the soonest release of
ripgrep with Clap will have to be 0.3.

There is also at least one subtle breaking change in real usage.
Previous to this commit, this used to work:

    rg -e -foo

Where this would cause ripgrep to search for the string `-foo`. Clap
currently has problems supporting this use case
(see: https://github.com/kbknapp/clap-rs/issues/742),
but it can be worked around by using this instead:

    rg -e [-]foo

or even

    rg [-]foo

and this still works:

    rg -- -foo

This commit also adds Bash, Fish and PowerShell completion files to the
release, fixes a bug that prevented ripgrep from working on file
paths containing invalid UTF-8 and shows short descriptions in the
output of `-h` but longer descriptions in the output of `--help`.

Fixes #136, Fixes #189, Fixes #210, Fixes #230
2016-11-17 19:53:41 -05:00
Andrew Gallant
5462af4434 Pin rustc-serialize to 0.3.19.
See: https://github.com/rust-lang-nursery/rustc-serialize/pull/159
2016-11-09 20:28:58 -05:00
Andrew Gallant
d2e70da040 0.2.9 2016-11-09 19:07:25 -05:00
Andrew Gallant
64dc9b6709 update deps 2016-11-09 18:54:22 -05:00
Andrew Gallant
18943b9317 0.2.8 2016-11-06 16:16:48 -05:00
Andrew Gallant
2daef51fe5 0.2.7 2016-11-06 15:49:25 -05:00
Andrew Gallant
dada75d2a7 Update sub-crate dependency versions. 2016-11-06 15:48:40 -05:00
Andrew Gallant
5bd0edbbe1 Actually use simd/avx optimizations in bytecount crate.
Also update compile script.
2016-11-05 22:44:33 -04:00
Andre Bogus
02de97b8ce Use the bytecount crate for fast line counting.
Fixes #128
2016-11-05 22:29:26 -04:00
Andrew Gallant
b272be25fa Add parallel recursive directory iterator.
This adds a new walk type in the `ignore` crate, `WalkParallel`, which
provides a way for recursively iterating over a set of paths in parallel
while respecting various ignore rules.

The API is a bit strange, as a closure producing a closure isn't
something one often sees, but it does seem to work well.

This also allowed us to simplify much of the worker logic in ripgrep
proper, where MultiWorker is now gone.
2016-11-05 21:45:55 -04:00
Andrew Gallant
1aeae3e22d update ripgrep 2016-11-04 21:12:08 -04:00
Andrew Gallant
d85a6dd5c8 update ignore dependency 2016-10-31 20:01:31 -04:00
Andrew Gallant
c8e2fa1869 update Cargo.lock 2016-10-31 19:54:38 -04:00
Andrew Gallant
1aae2759ad update deps 2016-10-29 22:27:29 -04:00
Brian Campbell
79a8d0ab3f Reset the terminal when Ctrl-C is pressed
If a user hits Ctrl-C to exit out of a search in the middle of printing
a line, we don't want to leave the terminal colors screwed up for them.
Catch Ctrl-C using the ctrlc crate, obtain a stdout lock to ensure that
other threads don't continue writing after we do so, reset the terminal,
and exit the program.

Closes #119
2016-10-29 21:23:05 -04:00
Andrew Gallant
d79add341b Move all gitignore matching to separate crate.
This PR introduces a new sub-crate, `ignore`, which primarily provides a
fast recursive directory iterator that respects ignore files like
gitignore and other configurable filtering rules based on globs or even
file types.

This results in a substantial source of complexity moved out of ripgrep's
core and into a reusable component that others can now (hopefully)
benefit from.

While much of the ignore code carried over from ripgrep's core, a
substantial portion of it was rewritten with the following goals in
mind:

1. Reuse matchers built from gitignore files across directory iteration.
2. Design the matcher data structure to be amenable for parallelizing
   directory iteration. (Indeed, writing the parallel iterator is the
   next step.)

Fixes #9, #44, #45
2016-10-29 20:48:59 -04:00
Andrew Gallant
94d600e6e1 Update deps. 2016-10-16 10:12:49 -04:00
Andrew Gallant
247a9398f4 Switch to thread_local crate in lieu of thread_local!.
This is to work around a bug where using a thread_local! was causing
a segfault on macos.

Fixes #164.
2016-10-11 18:23:49 -04:00
Andrew Gallant
4737326ed3 Update regex-syntax for bug fix.
The bug fix was in expression pretty printing. ripgrep parses the regex
into an AST and may do some modifications to it, which requires the
ability to go from string -> AST -> string' -> AST' where string == string'
implies AST == AST'.

Also, add a regression test for the specific regex that tripped the bug.

Fixes #156.
2016-10-10 22:04:29 -04:00
Andrew Gallant
e96d93034a Finish overhaul of glob matching.
This commit completes the initial move of glob matching to an external
crate, including fixing up cross platform support, polishing the
external crate for others to use and fixing a number of bugs in the
process.

Fixes #87, #127, #131
2016-10-10 19:24:18 -04:00
Andrew Gallant
175406df01 Refactor and test glob sets.
This commit goes a long way toward refactoring glob sets so that the
code is easier to maintain going forward. In particular, it makes the
literal optimizations that glob sets used a lot more structured and much
easier to extend. Tests have also been modified to include glob sets.

There's still a bit of polish work left to do before a release.

This also fixes the immediate issue where large gitignore files were
causing ripgrep to slow way down. While we don't technically fix it for
good, we're a lot better about reducing the number of regexes we
compile. In particular, if a gitignore file contains thousands of
patterns that can't be matched more simply using literals, then ripgrep
will slow down again. We could fix this for good by avoiding RegexSet if
the number of regexes grows too large.

Fixes #134.
2016-10-04 20:28:56 -04:00
Andrew Gallant
fdf24317ac Move glob implementation to new crate.
It is isolated and complex enough that it deserves attention all on its
own. It's also eminently reusable.
2016-09-30 19:42:41 -04:00
Andrew Gallant
316ffd87b3 bump docopt to 0.6.86 2016-09-28 15:56:59 -04:00
Andrew Gallant
de79be2db2 0.2.1 2016-09-26 20:02:58 -04:00
Andrew Gallant
b1c52b52d6 0.2.0 2016-09-25 22:32:14 -04:00
Andrew Gallant
109bc3f78e bump grep to 0.1.3 2016-09-25 22:30:17 -04:00
Andrew Gallant
af4dc78537 Update to docopt 0.6.85.
The new version won't panic if printing to stdout fails.

Fixes #22.
2016-09-24 19:14:19 -04:00
Andrew Gallant
b33e9cba69 0.1.17 2016-09-23 11:26:23 -04:00
Andrew Gallant
25c259112b 0.1.16 2016-09-22 21:32:41 -04:00
Andrew Gallant
2115774c6e 0.1.15 2016-09-22 19:20:11 -04:00
Andrew Gallant
1b14e245be 0.1.14 2016-09-22 17:48:49 -04:00
Andrew Gallant
263e2b012f 0.1.13 2016-09-21 21:07:40 -04:00
Andrew Gallant
525d051172 0.1.12 2016-09-21 20:47:44 -04:00
Andrew Gallant
fe84928c85 0.1.11 2016-09-21 19:37:37 -04:00
Andrew Gallant
c1c92e4fee 0.1.10 2016-09-21 19:27:16 -04:00
Andrew Gallant
b0d8ff6f4a 0.1.9 2016-09-21 16:41:28 -04:00
Andrew Gallant
0263a401f6 0.1.8 2016-09-21 07:08:37 -04:00
Andrew Gallant
f9bff90842 0.1.7 2016-09-20 22:13:49 -04:00
Andrew Gallant
9e2f10b893 0.1.6 2016-09-20 20:25:51 -04:00
Andrew Gallant
e7fb0fd267 0.1.5 2016-09-19 21:56:00 -04:00
Andrew Gallant
6cb604f38f 0.1.3 2016-09-17 12:55:09 -04:00
Andrew Gallant
8f87a4e8ac 0.1.2 2016-09-17 11:36:11 -04:00
Andrew Gallant
d27d3e675f bump grep 2016-09-17 11:34:27 -04:00
Andrew Gallant
e9ec52b7f9 Update walkdir 2016-09-16 17:56:44 -04:00
Andrew Gallant
0d14c74e63 Some minor performance tweaks.
This includes moving basename-only globs into separate regexes. The hope
is that if the regex processes less input, it will be faster.
2016-09-16 16:13:28 -04:00
Andrew Gallant
0e46171e3b Rework glob sets.
We try to reduce the pressure on regexes and offload some of it to
Aho-Corasick or exact lookups.
2016-09-15 22:06:04 -04:00
Andrew Gallant
c24f8fd50f Replace crossbeam with deque.
deque appears faster.
2016-09-14 07:40:46 -04:00
Andrew Gallant
4212a8b9cb 0.1.1 2016-09-13 21:21:45 -04:00
Andrew Gallant
983c7fd6f9 We don't use thread_local any more, so remove it. 2016-09-13 21:21:36 -04:00
Andrew Gallant
cf3a33cea7 commit Cargo.lock 2016-09-11 19:06:05 -04:00