This includes, but is not limited to, UTF-16, latin-1, GBK, EUC-JP and
Shift_JIS. (Courtesy of the `encoding_rs` crate.)
Specifically, this feature enables ripgrep to search files that are
encoded in an encoding other than UTF-8. The list of available encodings
is tied directly to what the `encoding_rs` crate supports, which is in
turn tied to the Encoding Standard. The full list of available encodings
can be found here: https://encoding.spec.whatwg.org/#concept-encoding-get
This pull request also introduces the notion that text encodings can be
automatically detected on a best effort basis. Currently, the only
support for this is checking for a UTF-16 bom. In all other cases, a
text encoding of `auto` (the default) implies a UTF-8 or ASCII
compatible source encoding. When a text encoding is otherwise specified,
it is unconditionally used for all files searched.
Since ripgrep's regex engine is fundamentally built on top of UTF-8,
this feature works by transcoding the files to be searched from their
source encoding to UTF-8. This transcoding only happens when:
1. `auto` is specified and a non-UTF-8 encoding is detected.
2. A specific encoding is given by end users (including UTF-8).
When transcoding occurs, errors are handled by automatically inserting
the Unicode replacement character. In this case, ripgrep's output is
guaranteed to be valid UTF-8 (excluding non-UTF-8 file paths, if they
are printed).
In all other cases, the source text is searched directly, which implies
an assumption that it is at least ASCII compatible, but where UTF-8 is
most useful. In this scenario, encoding errors are not detected. In this
case, ripgrep's output will match the input exactly, byte-for-byte.
This design may not be optimal in all cases, but it has some advantages:
1. In the happy path ("UTF-8 everywhere") remains happy. I have not been
able to witness any performance regressions.
2. In the non-UTF-8 path, implementation complexity is kept relatively
low. The cost here is transcoding itself. A potentially superior
implementation might build decoding of any encoding into the regex
engine itself. In particular, the fundamental problem with
transcoding everything first is that literal optimizations are nearly
negated.
Future work should entail improving the user experience. For example, we
might want to auto-detect more text encodings. A more elaborate UX
experience might permit end users to specify multiple text encodings,
although this seems hard to pull off in an ergonomic way.
Fixes#1
The --max-filesize option allows filtering files which are larger than
the specified limit. This is potentially useful if one is attempting to
search a number of large files without common file-types/suffixes.
See #369.
Previously, ripgrep would only emit the 'bold' ANSI escape sequence if
no foreground or background color was set. Instead, it would convert colors
to their "intense" versions if bold was set. The intent was to do the same
thing on Windows and Unix. However, this had a few negative side effects:
1. Omitting the 'bold' ANSI escape when 'bold' was set is surprising.
2. Intense colors can look quite bad and be hard to read.
To fix this, we introduce a new setting called 'intense' in the --colors
flag, and thread that down through to the public API of the `termcolor`
crate. The 'intense' setting has environment specific behavior:
1. In ANSI mode, it will convert the selected color to its "intense"
variant.
2. In the Windows console, it will make the text "intense."
There is no longer any "smart" handling of the 'bold' style. The 'bold'
ANSI escape is always emitted when it is selected. In the Windows
console, the 'bold' setting now has no effect. Note that this is a
breaking change.
Fixes#266, #293
This commit completely guts all of the color handling code and replaces
most of it with two new crates: wincolor and termcolor. wincolor
provides a simple API to coloring using the Windows console and
termcolor provides a platform independent coloring API tuned for
multithreaded command line programs. This required a lot more
flexibility than what the `term` crate provided, so it was dropped.
We instead switch to writing ANSI escape sequences directly and ignore
the TERMINFO database.
In addition to fixing several bugs, this commit also permits end users
to customize colors to a certain extent. For example, this command will
set the match color to magenta and the line number background to yellow:
rg --colors 'match:fg:magenta' --colors 'line:bg:yellow' foo
For tty handling, we've adopted a hack from `git` to do tty detection in
MSYS/mintty terminals. As a result, ripgrep should get both color
detection and piping correct on Windows regardless of which terminal you
use.
Finally, switch to line buffering. Performance doesn't seem to be
impacted and it's an otherwise more user friendly option.
Fixes#37, Fixes#51, Fixes#94, Fixes#117, Fixes#182, Fixes#231
There were two important reasons for the switch:
1. Performance. Docopt does poorly when the argv becomes large, which is
a reasonable common use case for search tools. (e.g., use with xargs)
2. Better failure modes. Clap knows a lot more about how a particular
argv might be invalid, and can therefore provide much clearer error
messages.
While both were important, (1) made it urgent.
Note that since Clap requires at least Rust 1.11, this will in turn
increase the minimum Rust version supported by ripgrep from Rust 1.9 to
Rust 1.11. It is therefore a breaking change, so the soonest release of
ripgrep with Clap will have to be 0.3.
There is also at least one subtle breaking change in real usage.
Previous to this commit, this used to work:
rg -e -foo
Where this would cause ripgrep to search for the string `-foo`. Clap
currently has problems supporting this use case
(see: https://github.com/kbknapp/clap-rs/issues/742),
but it can be worked around by using this instead:
rg -e [-]foo
or even
rg [-]foo
and this still works:
rg -- -foo
This commit also adds Bash, Fish and PowerShell completion files to the
release, fixes a bug that prevented ripgrep from working on file
paths containing invalid UTF-8 and shows short descriptions in the
output of `-h` but longer descriptions in the output of `--help`.
Fixes#136, Fixes#189, Fixes#210, Fixes#230