I did this in the course of trying to optimize it. I don't believe I
made it any faster, but the refactoring led to code that I think is
more readable.
Many of these functions should be inlineable, but I'm not 100% sure
that they can be inlined without these annotations. We don't want to
force things, but we do try and nudge the compiler in the right
direction.
It seems like a trifle, but if the match frequency is high enough, the
allocation+formatting of line numbers (and columns and byte offsets)
starts to matter. We squash that part of the profile in this commit by
doing our own decimal formatting. I speculate that we get a speed-up
from this by avoiding the formatting machinery and also a possible
allocation.
An alternative would be to use the `itoa` crate, and it is indeed
marginally faster in ad hoc benchmarks, but I'm satisfied enough with
this solution.
ripgrep does not, and likely never will, report which pattern matched.
Because of that, we can dedup the patterns via just their concrete
syntax without any fuss.
This is somewhat of a pathological case because you don't expect the end
user to pass duplicate patterns in general. But if the end user
generated a list of, say, names and did not dedup them, then ripgrep
could end up spending a lot of extra time on those duplicates if there
are many of them. By deduping them explicitly in the application, we
essentially remove their extra cost completely.
It was apparently using a format specific to a particular plugin. I did
know that, but apparently the plugin is not ubiquitous or de facto
standard[1]. Thus, including it I think just leads to more confusion. We
definitely do not want to be in the business of bundling aliases for
every conceivable plugin to different editors, so just drop it. We
expose the ability to write your own format for exactly this sort of
reason.
[1]: https://github.com/BurntSushi/ripgrep/discussions/2611#discussioncomment-7138302
In the time before, we just used a RegexSet from the regex crate. That
allocated unconditionally, so there was nothing we could do and it
didn't expose any APIs to reuse that memory. But now that we're using
the lower level regex-automata, we can reuse a PatternSet.
Ideally we would just provide a way for the caller to build a PatternSet
(perhaps via an opaque type) so that we don't have to shuffle data into
a PatternSet and then back into the caller's `Vec<usize>`. But this at
least avoids allocating for every search.
This brings the code in line with my current style. It also inlines the
dozen or so lines of code for FNV hashing instead of bringing in a
micro-crate for it. Finally, it drops the dependency on regex in favor
of using regex-syntax and regex-automata directly.
This essentially takes the work done in #2483 and does a bit of a
facelift. A brief summary:
* We reduce the hyperlink API we expose to just the format, a
configuration and an environment.
* We move buffer management into a hyperlink-specific interpolator.
* We expand the documentation on --hyperlink-format.
* We rewrite the hyperlink format parser to be a simple state machine
with support for escaping '{{' and '}}'.
* We remove the 'gethostname' dependency and instead insist on the
caller to provide the hostname. (So grep-printer doesn't get it
itself, but the application will.) Similarly for the WSL prefix.
* Probably some other things.
Overall, the general structure of #2483 was kept. The biggest change is
probably requiring the caller to pass in things like a hostname instead
of having the crate do it. I did this for a couple reasons:
1. I feel uncomfortable with code deep inside the printing logic
reaching out into the environment to assume responsibility for
retrieving the hostname. This feels more like an application-level
responsibility. Arguably, path canonicalization falls into this same
bucket, but it is more difficult to rip that out. (And we can do it
in the future in a backwards compatible fashion I think.)
2. I wanted to permit end users to tell ripgrep about their system's
hostname in their own way, e.g., by running a custom executable. I
want this because I know at least for my own use cases, I sometimes
log into systems using an SSH hostname that is distinct from the
system's actual hostname (usually because the system is shared in
some way or changing its hostname is not allowed/practical).
I think that's about it.
Closes#665, Closes#2483
I originally did not put PathPrinter into grep-printer because I
considered it somewhat extraneous to what a "grep" program does, and
also that its implementation was rather simple. But now with hyperlink
support, its implementation has grown a smidge more complicated. And
more importantly, its existence required exposing a lot more of the
hyperlink guts. Without it, we can keep things like HyperlinkPath and
HyperlinkSpan completely private.
We can now also keep `PrinterPath` completely private as well. And this
is a breaking change.
Like a previous commit did for the grep-cli crate, this does some
polishing to the grep-printer crate. We aren't able to achieve as much
as we did with grep-cli, but we at least eliminate all rust-analyzer
lints and group imports in the way I've been doing recently.
Next we'll start doing some more invasive changes.
This will enable us to query for the current system's hostname in both
Unix and Windows environments.
We could have pulled in the 'gethostname' crate for this, but:
1. I'm not a huge fan of micro-crates.
2. The 'gethostname' crate panics if an error occurs. (Which, to be
fair, an error should never occur, but it seems plausible on borked
systems? ripgrep runs in a lot of places, so I'd rather not take the
chance of a panic bringing down ripgrep for an optional convenience
feature.)
3. The 'gethostname' crate uses the 'windows-targets' crate from
Microsoft. This is arguably the "right" thing to do, but ripgrep
doesn't use them yet and they appear high-churn.
So I just added a safe wrapper to do this to winapi-util[1] and then
inlined the Unix version here. This brings in no extra dependencies and
the routine is fallible so that callers can recover from potentially
strange failures.
[1]: https://github.com/BurntSushi/winapi-util/pull/14
This does a variety of polishing.
1. Deprecate the tty methods in favor of std's IsTerminal trait.
2. Trim down un-needed dependencies.
3. Use bstr to implement escaping.
4. Various aesthetic polishing.
I'm doing this as prep work before adding more to this crate. And as
part of a general effort toward reducing ripgrep's dependencies.
This commit represents the initial work to get hyperlinks working and
was submitted as part of PR #2483. Subsequent commits largely retain the
functionality and structure of the hyperlink support added here, but
rejigger some things around.
This represents yet another iteration on how `ignore` enqueues and
distributes work in parallel. The original implementation used a
multi-producer/multi-consumer thread safe queue from crossbeam. At some
point, I migrated to a simple `Arc<Mutex<Vec<_>>>` and treated it as a
stack so that we did depth first traversal. This helped with memory
usage in very wide directories.
But it turns out that a naive stack-behind-a-mutex can be quite a bit
slower than something that's a little smarter, such as a work-stealing
stack used in this commit. My hypothesis for why this helps is that
without the stealing component, work distribution can get stuck in
sub-optimal configurations that depend on which directory entries get
assigned to a particular worker. It's likely that this can result in
some workers getting "more" work than others, just by chance, and thus
remain idle. But the work-stealing approach heads that off.
This does re-introduce a dependency on parts of crossbeam which is kind
of a bummer, but it's carrying its weight for now.
Closes#1823, Closes#2591
Ref https://github.com/sharkdp/fd/issues/28
When searching subdirectories the path was not correctly built and
included duplicate parts. This fix will remove the duplicate part if
possible.
Fixes#1757, Closes#2295
As of the memchr 2.6 release, its Iterator::count method is specialized
to only count the number of occurrences instead of finding the offset of
each occurrence. This replaces ripgrep's use of the bytecount crate.
While micro-benchmarks suggest that memchr's method has better
throughput than bytecount, it turned out to be an illusion. Namely, on a
~13GB haystack prior to this change:
$ time rg-bytecount 'You killed my friend, my best friend, my lifelong friend!' OpenSubtitles2018.raw.en --line-number
441450441:- You killed my friend, my best friend, my lifelong friend!
real 1.473
user 1.186
sys 0.286
maxmem 12512 MB
faults 0
And then after:
$ time rg 'You killed my friend, my best friend, my lifelong friend!' OpenSubtitles2018.raw.en --line-number
441450441:- You killed my friend, my best friend, my lifelong friend!
real 1.532
user 1.280
sys 0.250
maxmem 12512 MB
faults 0
But perf is just about in the same ballpark. That's good enough for me
at the moment in order to drop the extra dependency.
I did this because the marginal cost of adding the Iterator::count()
specialization to memchr was extremely small.
We currently implement globs by converting them to regexes, and in doing
so, sometimes use grouping. In all but one case, we used non-capturing
groups. But for alternations, we used capturing groups, which was likely
just an oversight. We don't make use of capture groups at all, and while
they usually don't have any overhead, they lead to weird cases like this
one: https://github.com/rust-lang/regex/issues/1059
That particular issue is also a bug in the regex crate itself, which is
fixed in https://github.com/rust-lang/regex/pull/1062. Note though that
the bug fix in the regex crate is required. Even with this patch to
globset, memory usage is reduced (by about half in rust-lang/regex#1059)
but is not returned to where it was prior to the regex 1.9 release.
It turns out our fast path for -w/--word-regexp wasn't quite correct in
some cases. Namely, we use `(?m:^|\W)(<original-regex>)(?m:\W|$)` as the
implementation of -w/--word-regexp since `\b(<original-regex>)\b` has
some unintuitive results in certain cases, specifically when
<original-regex> matches non-word characters at match boundaries.
The problem is that using this formulation means that you need to
extract the capture group around <original-regex> to find the "real"
match, since the surrounding (^|\W) and (\W|$) aren't part of the match.
This is fine, but the capture group engine is usually slow, so we have a
fast path where we try to deduce the correct match boundary after an
initial match (before running capture groups). The problem is that doing
this is rather tricky because it's hard to know, in general, whether the
`^` or the `\W` matched.
This still doesn't seem quite right overall, but we at least fix one
more case.
Fixes#2574
This PR adds `*.bat` and `*.cmd` file types.
In doing so, it makes a distinction between batch files (old standard
from the MS-DOS era) and command scripts (new flavor - can operate on
batch files, although `*.cmd` is preferred for various reasons, the
main one being batch files will set `ERRORLEVEL` following inconsistent
MS-DOS style rules[1]).
PR #2556
[1]: https://groups.google.com/g/microsoft.public.win2000.cmdprompt.admin/c/XHeUq8oe2wk/m/LIEViGNmkK0J#i106
Previously, sorting worked by sorting the parents and then sorting the
children within each parent. This was done during traversal, but it only
works when sorting parents preserves the overall order. This generally
only works for '--sort path' in ascending order.
This commit fixes the rest of the sorting behavior by collecting all of
the paths to search and then sorting them before searching. We only
collect all of the paths when sorting was requested.
Fixes#2243, Closes#2361
This causes ripgrep to stop searching an individual file after it has
found a non-matching line. But this only occurs after it has found a
matching line.
Fixes#1790, Closes#1930
Adds a new eprintln_locked macro which locks STDOUT before logging
to STDERR. This patch also replaces instances of eprintln with
eprintln_locked to avoid interleaving lines.
Fixes#1941, Closes#1968
Previously, we were only doing a binary existence check on Windows. And
in fact, the main point there wasn't binary existence, but ensuring we
didn't accidentally resolve a binary name relative to the CWD, which
could result in executing a program one didn't mean to run.
However, it is useful to be able to check whether a binary exists on any
platform when associating a glob with a binary. If the binary doesn't
exist, then the association can fail eagerly and let some other glob
apply.
Closes#1946
We also make py/python, md/markdown and ts/typescript aliases of one
another.
Note that this only introduces aliases at the point where default types
are defined. This just makes them a bit easier to read/write, and also
makes it easier to expose more names that describe the same thing.
Fixes#1857, Closes#1895
*.adb and *.ads are the usual extensions for Ada source code,
and *.gpr indicates a GPRbuild project file used for Ada, and
these days often being combined with alire for package dependency
resolution. Alire stores a bunch of files named alire.toml in
different directories in your (gitignored) cache/dependencies/...
Closes#2013
Eclasses are "ebuild libraries" and generally if you're filtering
for/filtering out an ebuild/eclass, you don't want the other either.
Followup to 4dfea016b9Closes#2437
This matches the behavior of GNU grep which does not ignore
before-context and after-context completely if the context flag is also
provided.
Note that this change wasn't done just to match GNU grep. In this case,
GNU grep has the more sensible behavior.
Fixes#2288, Closes#2451
I have reservations about this, but it looks useful and doesn't seem
terribly onerous to support. The `ignore` crate will really always need
to have some kind of logic supporting this in some form I think.
Closes#2482
This removes most of the Unicode features as they aren't currently
used. We can always add them back later if necessary.
We can avoid the unicode-perl feature by changing `\s` to `[[:space:]]`,
which uses the ASCII-only definition of `\s`. Since we don't expect
non-ASCII whitespace in git config files, this seems okay.
Closes#2502
GraphQL file extensions: .graphql and .graphqls (schema)
We could also add `.gql`, but perhaps it's less correct to do so. We'll
start conservatively here, and we can always add `.gql` later.
Closes#2439, Closes#2508
When `resolve_binary()` attempts to resolve a path to a program on
Windows while searching for a program in `PATH` without an extension,
`ripgrep` will assume the extension of the file to be `.exe` as it's
the *de facto* standard, which will work most (99.99%) of the time...
...unless the binary is a COM executable (we're on Windows, duh).
Closes#2523
This is mostly a copy of the prefix literal extractor in regex-syntax,
but with a tweaked notion of Seq that keeps track of whether it's a
prefix of an expression or not. If it isn't, then we can't cross it as a
suffix to another Seq.
This new extractor should be a lot more robust than the old one. We
actually will keep going through the regex to try and find the "best"
literals to search for (according to some heuristic).
This makes it easier to enable the `logging` feature for regex-automata.
I wish I could just enable it unconditionally, but it winds up producing
a lot of output because ripgrep uses regexes for things other than the
primary search (like every glob). Sigh.
This does a little bit of refactoring so that we can pass both a
ConfiguredHIR and a Regex to the inner literal extraction routine.
One downside of this approach is that a regex object hangs on to a
ConfiguredHIR. But the extra memory usage is probably negligible. A
benefit though is that converting the HIR to its concrete syntax is now
lazy and only happens when logging is enabled.
This increases the limits a bit for when the regex engine will build and
use a fully compiled DFA. They can faster in some circumstances. For
example, '(?-u)^\w{30,}$' gets a nice speed boost from state
acceleration.
We are also able to remove `regex` proper as a dependency. Wow.
Previously, ripgrep core was responsible for escaping regex patterns and
implementing the --line-regexp flag. This commit moves that
responsibility down into the matchers such that ripgrep just needs to
hand the patterns it gets off to the matcher builder. The builder will
then take care of escaping and all that.
This was done to make pattern construction completely owned by the
matcher builders. With the arrival regex-automata, this means we can
move to the HIR very quickly and then never move back to the concrete
syntax. We can then build our regex directly from the HIR. This overall
can save quite a bit of time, especially when searching for large
dictionaries.
We still aren't quite as fast as GNU grep when searching something on
the scale of /usr/share/dict/words, but we are basically within spitting
distance. Prior to this, we were about an order of magnitude slower.
This architecture in particular lets us write a pretty simple fast path
that avoids AST parsing and HIR translation entirely: the case where one
is just searching for a literal. In that case, we can hand construct the
HIR directly.
0.2.4 updates to PCRE2 10.42 and has a few other nice changes. For
example, when `utf` is enabled, the crate will always set the
PCRE2_MATCH_INVALID_UTF option. That means we no longer need to do
transcoding or UTF-8 validity checks.
Because of this, we actually get to remove one of the two uses of
`unsafe` in ripgrep's `main` program.
(This also updates a couple other dependencies for convenience.)
Just some small polishing. We also get rid of thread_local in favor of
using regex-automata, mostly just in the name of reducing dependencies.
(We should eventually be able to drop thread_local completely.)
The verbatim literal stuff hasn't been used for a while and I don't
foresee it being used. If it's really needed, it would probably better
to just implement it by looking at the pattern string itself, which
avoids parsing it into an AST altogether.
Now that Rust's regex crate finally supports a CRLF mode, we can remove
this giant hack in ripgrep to enable it. (And assuredly did not work in
all cases.)
The way this works in the regex engine is actually subtly different than
what ripgrep previously did. Namely, --crlf would previously treat
either \r\n or \n as a line terminator. But now it treats \r\n, \n and
\r as line terminators. In effect, it is implemented by treating \r and
\n as line terminators, but ^ and $ will never match at a position
between a \r and a \n.
So basically this means that $ will end up matching in more cases than
it might be intended too, but I don't expect this to be a big problem in
practice.
Note that passing --crlf to ripgrep and enabling CRLF mode in the regex
via the `R` inline flag (e.g., `(?R:$)`) are subtly different. The `R`
flag just controls the regex engine, but --crlf instructs all of ripgrep
to use \r\n as a line terminator. There are likely some inconsistencies
or corner cases that are wrong as a result of this cognitive dissonance,
but we choose to leave well enough alone for now.
Fixing this for real will probably require re-thinking how line
terminators are handled in ripgrep. For example, one "problem" with how
they're handled now is that ripgrep will re-insert its own line
terminators when printing output instead of copying the input. This is
maybe not so great and perhaps unexpected. (ripgrep probably can't get
away with not inserting any line terminators. Users probably expect
files that don't end with a line terminator whose last line matches to
have a line terminator inserted.)
This leaves the grep-regex crate in tatters. Pretty much the entire
thing needs to be re-worked. The upshot is that it should result in some
big simplifications. I hope.
The idea here is to drop down and actually use regex-automata 0.3
instead of the regex crate itself.
memmap2 v0.3.0 introduced a regression when trying to map files larger than 4GB
on 32-bit architectures[1] which was subsequently fixed in v0.3.1[2].
This commit bumps locked version of the memmap2 dependency to the current v0.5.0
and reverts fdfc418be5 to re-enable mmap on 32-bit
architectures as a different approach to fixing [3].
This was tested to report matches from the end of a 5GB file using MinGW and Wine.
Ref #1911, PR #2000
[1] 5e271224c8
[2] 9aa838aed9
[3] https://github.com/BurntSushi/ripgrep/issues/1911
This crate is only a shim over a bunch of other crates. I'm not sure
that there's anything to add to each of the `pub extern` items. So
instead of just writing fluff, I removed the lint.
Fixes#2516
This adds some lesser known extensions.
Notably, it adds php7 and php8, but not php6. Apparently,
php6 was never a thing: https://wiki.php.net/rfc/php6
PR #2263
It turns out that if there are text anchors (that is, \A or \z, or ^/$
when multi-line is disabled), then the "fast" line searching path isn't
quite correct. Since searching without multi-line mode is exceptionally
rare, we just look for the presence of text anchors and specifically
disable the line terminator option in 'grep-regex'. This in turn
inhibits the "fast" line searching path.
Fixes#2260
When a glob pattern ended with a \/, and since we permit backslash
escapes, the glob parser gave a "dangling escape" error. Which is weird,
because the \ is clearly not dangling.
The issue is that the layer above the glob parser, the gitignore parser,
was stripping the trailing / so that it wouldn't be part of the matching
logic. Of course, stripping the trailing / while it is escaped without
removing the backslash escape is wrong. So we do that here.
Fixes#2236