1
0
mirror of https://github.com/BurntSushi/ripgrep.git synced 2025-01-08 13:23:34 +02:00
Commit Graph

43 Commits

Author SHA1 Message Date
Andrew Gallant
082245dadb cli: replace clap with lexopt and supporting code
ripgrep began it's life with docopt for argument parsing. Then it moved
to Clap and stayed there for a number of years. Clap has served ripgrep
well, and it probably could continue to serve ripgrep well, but I ended
up deciding to move off of it.

Why?

The first time I had the thought of moving off of Clap was during the
2->3->4 transition. I thought the 3.x and 4.x releases were great, but
for me, it ended up moving a little too quickly. Since the release of
4.x was telegraphed around when 3.x came out, I decided to just hold off
and wait to migrate to 4.x instead of doing a 3.x migration followed
shortly by another 4.x migration. Of course, I just never ended up doing
the migration at all. I never got around to it and there just wasn't a
compelling reason for me to upgrade. While I never investigated it, I
saw an upgrade as a non-trivial amount of work in part because I didn't
encapsulate the usage of Clap enough.

The above is just what got me started thinking about it. It wasn't
enough to get me to move off of it on its own. What ended up pushing me
over the edge was a combination of factors:

* As mentioned above, I didn't want to run on the migration treadmill.
This has proven to not be much of an issue, but at the time of the
2->3->4 releases, I didn't know how long Clap 4.x would be out before a
5.x would come out.
* The release of lexopt[1] caught my eye. IMO, that crate demonstrates
exactly how something new can arrive on the scene and just thoroughly
solve a problem minimalistically. It has the docs, the reasoning, the
simple API, the tests and good judgment. It gets all the weird corner
cases right that Clap also gets right (and is part of why I was
originally attracted to Clap).
* I have an overall desire to reduce the size of my dependency tree. In
part because a smaller dependency tree tends to correlate with better
compile times, but also in part because it reduces my reliance and trust
on others. It lets me be the "master" of ripgrep's destiny by reducing
the amount of behavior that is the result of someone else's decision
(whether good or bad).
* I perceived that Clap solves a more general problem than what I
actually need solved. Despite the vast number of flags that ripgrep has,
its requirements are actually pretty simple. We just need simple
switches and flags that support one value. No multi-value flags. No
sub-commands. And probably a lot of other functionality that Clap has
that makes it so flexible for so many different use cases. (I'm being
hand wavy on the last point.)

With all that said, perhaps most importantly, the future of ripgrep
possibly demands a more flexible CLI argument parser. In today's world,
I would really like, for example, flags like `--type` and `--type-not`
to be able to accumulate their repeated values into a single sequence
while respecting the order they appear on the CLI. For example, prior
to this migration, `rg regex-automata -Tlock -ttoml` would not return
results in `Cargo.lock` in this repository because the `-Tlock` always
took priority even though `-ttoml` appeared after it. But with this
migration, `-ttoml` now correctly overrides `-Tlock`. We would like to
do similar things for `-g/--glob` and `--iglob` and potentially even
now introduce a `-G/--glob-not` flag instead of requiring users to use
`!` to negate a glob. (Which I had done originally to work-around this
problem.) And some day, I'd like to add some kind of boolean matching to
ripgrep perhaps similar to how `git grep` does it. (Although I haven't
thought too carefully on a design yet.) In order to do that, I perceive
it would be difficult to implement correctly in Clap.

I believe that this last point is possible to implement correctly in
Clap 2.x, although it is awkward to do so. I have not looked closely
enough at the Clap 4.x API to know whether it's still possible there. In
any case, these were enough reasons to move off of Clap and own more of
the argument parsing process myself.

This did require a few things:

* I had to write my own logic for how arguments are combined into one
single state object. Of course, I wanted this. This was part of the
upside. But it's still code I didn't have to write for Clap.
* I had to write my own shell completion generator.
* I had to write my own `-h/--help` output generator.
* I also had to write my own man page generator. Well, I had to do this
with Clap 2.x too, although my understanding is that Clap 4.x supports
this. With that said, without having tried it, my guess is that I
probably wouldn't have liked the output it generated because I
ultimately had to write most of the roff by hand myself to get the man
page I wanted. (This also had the benefit of dropping the build
dependency on asciidoc/asciidoctor.)

While this is definitely a fair bit of extra work, it overall only cost
me a couple days. IMO, that's a good trade off given that this code is
unlikely to change again in any substantial way. And it should also
allow for more flexible semantics going forward.

Fixes #884, Fixes #1648, Fixes #1701, Fixes #1814, Fixes #1966

[1]: https://docs.rs/lexopt/0.3.0/lexopt/index.html
2023-11-20 23:51:53 -05:00
Andrew Gallant
f7ff34fdf9 searcher: simplify 'replace_bytes' routine
I did this in the course of trying to optimize it. I don't believe I
made it any faster, but the refactoring led to code that I think is
more readable.
2023-10-09 20:29:52 -04:00
Andrew Gallant
d53b7310ee searcher: polish
This updates some dependencies and brings code style in line with my
current practice.
2023-10-09 20:29:52 -04:00
Andrew Gallant
3ad7a0d95e crates: remove hard-coded links
And use rustdoc's native intra-crate links. So much nicer.
2023-10-09 20:29:52 -04:00
Andrew Gallant
6cdb99ea61
deps: drop bytecount in favor of memchr_iter(..).count()
As of the memchr 2.6 release, its Iterator::count method is specialized
to only count the number of occurrences instead of finding the offset of
each occurrence. This replaces ripgrep's use of the bytecount crate.
While micro-benchmarks suggest that memchr's method has better
throughput than bytecount, it turned out to be an illusion. Namely, on a
~13GB haystack prior to this change:

    $ time rg-bytecount 'You killed my friend, my best friend, my lifelong friend!' OpenSubtitles2018.raw.en --line-number
    441450441:- You killed my friend, my best friend, my lifelong friend!

    real    1.473
    user    1.186
    sys     0.286
    maxmem  12512 MB
    faults  0

And then after:

    $ time rg 'You killed my friend, my best friend, my lifelong friend!' OpenSubtitles2018.raw.en --line-number
    441450441:- You killed my friend, my best friend, my lifelong friend!

    real    1.532
    user    1.280
    sys     0.250
    maxmem  12512 MB
    faults  0

But perf is just about in the same ballpark. That's good enough for me
at the moment in order to drop the extra dependency.

I did this because the marginal cost of adding the Iterator::count()
specialization to memchr was extremely small.
2023-09-02 12:25:34 -04:00
Edoardo Pirovano
6d95c130d5 cli: add --stop-on-nonmatch flag
This causes ripgrep to stop searching an individual file after it has
found a non-matching line. But this only occurs after it has found a
matching line.

Fixes #1790, Closes #1930
2023-07-08 18:52:42 -04:00
Andrew Gallant
a68db3ac02 deps: drop temporary patch and move to bstr 1.6
Now that regex 1.9 is out, we can depend on it from crates.io.
2023-07-05 14:04:29 -04:00
Adam Reichold
803c447845
searcher: re-enable mmap on 32-bit architectures
memmap2 v0.3.0 introduced a regression when trying to map files larger than 4GB
on 32-bit architectures[1] which was subsequently fixed in v0.3.1[2].

This commit bumps locked version of the memmap2 dependency to the current v0.5.0
and reverts fdfc418be5 to re-enable mmap on 32-bit
architectures as a different approach to fixing [3].

This was tested to report matches from the end of a 5GB file using MinGW and Wine.

Ref #1911, PR #2000 

[1] 5e271224c8
[2] 9aa838aed9
[3] https://github.com/BurntSushi/ripgrep/issues/1911
2023-05-19 08:23:53 -04:00
Andrew Gallant
120e55e7c7
grep-searcher-0.1.11 2023-01-05 09:07:09 -05:00
Andrew Gallant
180c4eaf8b
deps: update to grep-regex 0.1.11 2023-01-05 09:05:39 -05:00
Andrew Gallant
bcc7473a87
deps: update to grep-matcher 0.1.6 2023-01-05 09:02:40 -05:00
Andrew Gallant
ac8fecbbf2
deps: upgrade bstr to 1.1 2023-01-05 08:21:15 -05:00
Andrew Gallant
33b81cac48
grep-searcher-0.1.10 2022-07-15 10:05:46 -04:00
Andrew Gallant
6a13a4f64d
searcher: remove stray 'dbg!' 2022-07-15 10:05:20 -04:00
Andrew Gallant
78a35d4d43
grep-searcher-0.1.9 2022-07-15 10:02:24 -04:00
Andrew Gallant
a933d0bc90
searcher: bump grep-regex dep to 0.1.10 2022-07-15 10:02:06 -04:00
Andrew Gallant
8e57989cd2
regex: fix matching bug when text anchors are used
It turns out that if there are text anchors (that is, \A or \z, or ^/$
when multi-line is disabled), then the "fast" line searching path isn't
quite correct. Since searching without multi-line mode is exceptionally
rare, we just look for the presence of text anchors and specifically
disable the line terminator option in 'grep-regex'. This in turn
inhibits the "fast" line searching path.

Fixes #2260
2022-07-15 09:53:39 -04:00
Alex Touchet
36d03b4101
cargo: use SPDX license format for all crates
This was done for the main crate in d11a3b3377.

See also #987.

PR #2204
2022-05-09 07:52:11 -04:00
Andrew Gallant
ced5b92aa9 deps: bump memmap2 to 0.5
Looking at the memmap2 CHANGELOG, there don't appear to be any breaking
changes that impact us.
2022-03-21 08:59:05 -04:00
Andrew Gallant
5370064f00 warnings: remove/tweak some dead code
It looks like the dead code detector got better, so do a little code
cleanup.
2022-03-21 08:59:05 -04:00
Andrew Gallant
fdfc418be5
searcher: disable mmap searching on non-64 bit
It looks like it's possible for mmap to succeed on 32-bit systems even
when the full file can't be addressed in memory. This used to work prior
to ripgrep 13, but (maybe) something about statically linking vcruntime
has caused this to now fail.

It's no big deal to disable mmap searching on 32-bit, so we just do that
instead of returning incorrect results.

Fixes #1911
2021-06-26 12:53:59 -04:00
Andrew Gallant
dd47582619
grep-searcher-0.1.8 2021-06-12 08:06:58 -04:00
Andrew Gallant
a34df1f690
deps/regex: update minimal versions 2021-06-12 08:05:36 -04:00
Andrew Gallant
6331a7ac18
deps/matcher: update minimal versions 2021-06-12 08:03:47 -04:00
Andrew Gallant
0ee85a89f5
deps: update to memmap2
Looking at the changelog for memmap2, the only breaking change was to
MmapOptions, which we don't use. So no migration is needed.
2021-06-12 07:53:42 -04:00
Andrew Gallant
e824531e38 edition: manual changes
This is mostly just about removing 'extern crate' everywhere and fixing
the fallout.
2021-06-01 21:07:37 -04:00
Andrew Gallant
af54069c51 edition: run 'cargo fix --edition --edition-idioms --all' 2021-06-01 21:07:37 -04:00
Andrew Gallant
77a9e99964 edition: set edition=2018 2021-06-01 21:07:37 -04:00
Andrew Gallant
459a9c5637 edition: initial 'cargo fix --edition' run 2021-06-01 21:07:37 -04:00
Andrew Gallant
efd9cfb2fc grep: fix bugs in handling multi-line look-around
This commit hacks in a bug fix for handling look-around across multiple
lines. The main problem is that by the time the matching lines are sent
to the printer, the surrounding context---which some look-behind or
look-ahead might have matched---could have been dropped if it wasn't
part of the set of matching lines. Therefore, when the printer re-runs
the regex engine in some cases (to do replacements, color matches, etc
etc), it won't be guaranteed to see the same matches that the searcher
found.

Overall, this is a giant clusterfuck and suggests that the way I divided
the abstraction boundary between the printer and the searcher is just
wrong. It's likely that the searcher needs to handle more of the work of
matching and pass that info on to the printer. The tricky part is that
this additional work isn't always needed. Ultimately, this means a
serious re-design of the interface between searching and printing. Sigh.

The way this fix works is to smuggle the underlying buffer used by the
searcher through into the printer. Since these bugs only impact
multi-line search (otherwise, searches are only limited to matches
across a single line), and since multi-line search always requires
having the entire file contents in a single contiguous slice (memory
mapped or on the heap), it follows that the buffer we pass through when
we need it is, in fact, the entire haystack. So this commit refactors
the printer's regex searching to use that buffer instead of the intended
bundle of bytes containing just the relevant matching portions of that
same buffer.

There is one last little hiccup: PCRE2 doesn't seem to have a way to
specify an ending position for a search. So when we re-run the search to
find matches, we can't say, "but don't search past here." Since the
buffer is likely to contain the entire file, we really cannot do
anything here other than specify a fixed upper bound on the number of
bytes to search. So if look-ahead goes more than N bytes beyond the
match, this code will break by simply being unable to find the match. In
practice, this is probably pretty rare. I believe that if we did a
better fix for this bug by fixing the interfaces, then we'd probably try
to have PCRE2 find the pertinent matches up front so that it never needs
to re-discover them.

Fixes #1412
2021-05-31 21:51:18 -04:00
Andrew Gallant
656aa12649 printer: fix multi-line replacement bug
This commit fixes a subtle bug in multi-line replacement of line
terminators.

The problem is that even though ripgrep supports multi-line searches, it
is *still* line oriented. It still needs to print line numbers, for
example. For this reason, there are various parts in the printer that
iterate over lines in order to format them into the desired output.

This turns out to be problematic in some cases. #1311 documents one of
those cases (with line numbers enabled to highlight a point later):

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?
    2:world?

But the desired output is this:

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?world?

At first I had thought that the main problem was that the printer was
taking ownership of writing line terminators, even if the input already
had them. But it's more subtle than that. If we fix that issue, we get
output like this instead:

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?2:world?

Notice how '2:' is printed before 'world?'. The reason it works this way
is because matches are reported to the printer in a line oriented way.
That is, the printer gets a block of lines. The searcher guarantees that
all matches that start or end in any of those lines also end or start in
another line in that same block. As a result, the printer uses this
assumption: once it has processed a block of lines, the next match will
begin on a new and distinct line. Thus, things like '2:' are printed.

This is generally all fine and good, but an impedance mismatch arises
when replacements are used. Because now, the replacement can be used to
change the "block of lines" approach. Now, in terms of the output, the
subsequent match might actually continue the current line since the
replacement might get rid of the concept of lines altogether.

We can sometimes work around this. For example:

    $ printf "hello\nworld\n" | rg -U "\n(.)?" -r '?$1'
    hello?world?

Why does this work? It's because the '(.)' after the '\n' causes the
match to overlap between lines. Thus, the searcher guarantees that the
block sent to the printer contains every line.

And there in lay the solution: all we need to do is tweak the multi-line
searcher so that it combines lines with matches that directly adjacent,
instead of requiring at least one byte of overlap. Fixing that solves
the issue above. It does cause some tests to fail:

* The binary3 test in the searcher crate fails because adjacent line
  matches are now one part of block, and that block is scanned for
  binary data. To preserve the essence of the test, we insert a couple
  dummy lines to split up the blocks.
* The JSON CRLF test. It was testing that we didn't output any messages
  with an empty 'submatches' array. That is indeed still the case. The
  difference is that the messages got combined because of the adjacent
  line merging behavior. This is a slight change to the output, but is
  still correct.

Fixes #1311
2021-05-31 21:51:18 -04:00
Alessandro Menezes
2295061e80 searcher: do UTF-8 BOM sniffing like UTF-16
Previously, we were only looking for the UTF-16 BOM for determining
whether to do transcoding or not. But we should also look for the UTF-8
BOM as well.

Fixes #1638, Closes #1697
2021-05-31 21:51:18 -04:00
Vasili Revelas
3f33a83a5f searcher: remove variable shadowing
The previous variable name was the same as one of the method arguments.
2021-05-31 21:51:18 -04:00
aricha1940
1c3eebefec
searcher: update outdated comment for buffer size
Looks like this was accidentally left set to 8 in commit 46fb77c.

PR #1839
2021-03-31 08:18:38 -04:00
Andrew Gallant
64ac2ebe0f
tests: fix tests for buffer size change
Sadly, there were several tests that are coupled to the size of the
buffer used by ripgrep. Making the tests agnostic to the size is
difficult. And it's annoying to fix the tests. But we rarely change the
buffer size, so ¯\_(ツ)_/¯.
2021-03-23 18:14:18 -04:00
Andrew Gallant
46fb77c20c
searcher: bump buffer size
This increases the initial buffer size from 8KB to 64KB. This actually
leads to a reasonably noticeable improvement in at least one work-load,
and is unlikely to regress in any other case. Also, since Rust programs
(at least on Linux) seem to always use a minimum of 6-8MB of memory,
adding an extra 56KB is negligible.

Before:

    $ hyperfine -i "rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap"
    Benchmark #1: rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap
      Time (mean ± σ):      2.109 s ±  0.012 s    [User: 565.5 ms, System: 1541.6 ms]
      Range (min … max):    2.094 s …  2.128 s    10 runs

After:

    $ hyperfine -i "rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap"
    Benchmark #1: rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap
      Time (mean ± σ):      1.802 s ±  0.006 s    [User: 462.3 ms, System: 1337.9 ms]
      Range (min … max):    1.795 s …  1.814 s    10 runs
2021-03-23 17:45:02 -04:00
Andrew Gallant
3a1780d841
deps: replace memmap with memmap2
memmap is unmaintained at this point and it is being flagged as a
RUSTSEC advisory in ripgrep. This doesn't seem like that big of a deal
to me honestly, but memmap2 looks like a fine choice at this point.

Fixes #1785, Closes #1786
2021-01-17 18:49:51 -05:00
Andrew Gallant
d97fb72d84
doc: update CI links in crate READMEs
I switched to GitHub Actions long ago, which replaces both Travis and
AppVeyor.

Fixes #1732
2020-11-16 19:07:16 -05:00
Alex Touchet
3ca324fda7
doc: update several links to use https
PR #1724
2020-11-03 10:33:36 -05:00
Andrew Gallant
92daa34eb3
ripgrep: release 12.0.0 2020-03-15 21:42:54 -04:00
chip
50d2047ae2
crates: update URLs in Cargo.toml
This corrects an oversight when the repo was re-organized to
have its crates moved into a 'crates' sub-directory.

PR #1505
2020-02-28 20:31:43 -05:00
Andrew Gallant
0874aa115c repo: make ripgrep build with the new organization 2020-02-17 19:24:53 -05:00
Andrew Gallant
fdd8510fdd repo: move all source code in crates directory
The top-level listing was just getting a bit too long for my taste. So
put all of the code in one directory and shrink the large top-level mess
to a small top-level mess.

NOTE: This commit only contains renames. The subsequent commit will
actually make ripgrep build again. We do it this way with the naive hope
that this will make it easier for git history to track the renames.
Sigh.
2020-02-17 19:24:53 -05:00