ripgrep

mirror of https://github.com/BurntSushi/ripgrep.git synced 2024-12-12 19:18:24 +02:00

Author	SHA1	Message	Date
Andrew Gallant	0972c6e7c7	grep-searcher-0.1.6	2019-08-06 09:50:52 -04:00
Andrew Gallant	813c676eca	searcher: fix roll buffer bug This commit fixes a subtle bug in how the line buffer was rolling its contents. Specifically, when ripgrep searches without memory maps, it uses a "roll" buffer for incremental line oriented search without needing to read the entire file into memory at once. The roll buffer works by reading a chunk of bytes from the file into memory, and then searching everything in that buffer up to the last `\n` byte. The bytes after the last `\n` byte are preserved, since they likely correspond to part of the next line. Once ripgrep is done searching the buffer, it "rolls" the buffer such that the start of the next line is at the beginning of the buffer, and then ripgrep reads more data into the buffer starting at the (possibly) partial end of that line. The implication of this strategy, necessarily so, is that a buffer must be big enough to fit a single line in memory. This is because the regex engine needs a contiguous block of memory to search, so there is no way to search anything smaller than a single line. So if a file contains a single line with 7.5 million bytes, then the buffer will grow to be at least that size. (Many files have super long lines like this, but they tend to be binary files, which ripgrep will detect and stop searching unless the user forces it with the `-a/--text` flag. So in practice, they aren't usually a problem. However, in this case, #1335 found a case where a plain text file had a line with 7.5 million bytes.) Now, for performance reasons, ripgrep reuses these buffers across its search. Typically, it will create `N` of these line buffers when it starts (where `N` is the number of threads it is using), and then reuse them without creating any new ones as it searches through files. This means that if you search a file with a very long line, that buffer will expand to be big enough to store that line. ripgrep never contracts these buffers, so once it searches the next file, ripgrep will continue to use this large buffer. While it might be prudent to contract these buffers in some circumstances, this isn't otherwise inherently a problem. The memory has already been allocated, and there isn't much cost to using it, other than the fact that ripgrep hangs on to it and never gives it back to the OS. However, the `roll` implementation described above had a really important bug in it that was impacted by the size of the buffer. Specifically, it used the following to "roll" the partial line at the end of the buffer to the beginning: self.buf.copy_within_str(self.pos.., 0); Which means that if the buffer is very large, ripgrep will copy everything from `self.pos` (which might be very small, e.g., for small files) to the end of the buffer, and move it to the beginning of the buffer. This will happen repeatedly each time the buffer is used to search small files, which winds up being quite a large slow down if the line was exceptionally large (say, megabytes). It turns out that copying everything is completely unnecessary. We only need to copy the remainder of the last read to the beginning of the buffer. Everything after the last read in the buffer is just free space that can be filled for the next read. So, all we need to do is copy just those bytes: self.buf.copy_within_str(self.pos..self.end, 0); ... which is typically much much smaller than the rest of the buffer. This was likely also causing small performance losses in other cases as well. For example, when searching a lot of small files, ripgrep would likely do a lot more copying than necessary. Although, given that the default buffer size is 8KB, this extra copying was likely pretty small, and was thus harder to observe. Fixes #1335	2019-08-02 07:23:27 -04:00
Andrew Gallant	785c1f1766	release: globset, grep-cli, grep-printer, grep-searcher	2019-06-26 16:53:30 -04:00
Andrew Gallant	b93762ea7a	bstr: update everything to bstr 0.2	2019-06-26 16:47:33 -04:00
Andrew Gallant	7b9972c308	style: fix deprecations Use `dyn` for trait objects and use `..=` for inclusive ranges.	2019-06-16 18:37:51 -04:00
Andrew Gallant	36d3f235dc	grep-searcher: release 0.1.4	2019-04-15 17:59:22 -04:00
Andrew Gallant	44cd344438	grep-regex: release 0.1.3	2019-04-15 17:56:04 -04:00
Andrew Gallant	e493e54b9b	grep-matcher: release 0.1.2	2019-04-15 17:53:29 -04:00
Andrew Gallant	a7d26c8f14	binary: rejigger ripgrep's handling of binary files This commit attempts to surface binary filtering in a slightly more user friendly way. Namely, before, ripgrep would silently stop searching a file if it detected a NUL byte, even if it had previously printed a match. This can lead to the user quite reasonably assuming that there are no more matches, since a partial search is fairly unintuitive. (ripgrep has this behavior by default because it really wants to NOT search binary files at all, just like it doesn't search gitignored or hidden files.) With this commit, if a match has already been printed and ripgrep detects a NUL byte, then it will print a warning message indicating that the search stopped prematurely. Moreover, this commit adds a new flag, --binary, which causes ripgrep to stop filtering binary files, but in a way that still avoids dumping binary data into terminals. That is, the --binary flag makes ripgrep behave more like grep's default behavior. For files explicitly specified in a search, e.g., `rg foo some-file`, then no binary filtering is applied (just like no gitignore and no hidden file filtering is applied). Instead, ripgrep behaves as if you gave the --binary flag for all explicitly given files. This was a fairly invasive change, and potentially increases the UX complexity of ripgrep around binary files. (Before, there were two binary modes, where as now there are three.) However, ripgrep is now a bit louder with warning messages when binary file detection might otherwise be hiding potential matches, so hopefully this is a net improvement. Finally, the `-uuu` convenience now maps to `--no-ignore --hidden --binary`, since this is closer to the actualy intent of the `--unrestricted` flag, i.e., to reduce ripgrep's smart filtering. As a consequence, `rg -uuu foo` should now search roughly the same number of bytes as `grep -r foo`, and `rg -uuua foo` should search roughly the same number of bytes as `grep -ra foo`. (The "roughly" weasel word is used because grep's and ripgrep's binary file detection might differ somewhat---perhaps based on buffer sizes---which can impact exactly what is and isn't searched.) See the numerous tests in tests/binary.rs for intended behavior. Fixes #306, Fixes #855	2019-04-14 19:29:27 -04:00
lesnyrumcajs	5962abc465	searcher: add option to disable BOM sniffing This commit adds a new encoding feature where the -E/--encoding flag will now accept a value of 'none'. When given this value, all encoding related machinery is disabled and ripgrep will search the raw bytes of the file, including the BOM if it's present. Closes #1207, Closes #1208	2019-04-06 10:35:08 -04:00
Andrew Gallant	7dcbff9a9b	searcher: partially migrate to bstr This commit causes grep-searcher to use byte strings internally for its line buffer support. We manage to remove a use of `unsafe` by doing this (by pushing it down into `bstr`). We stop short of using byte strings everywhere else because we rely heavily on the `impl ops::Index<[u8]> for grep_matcher::Match` impl, which isn't available for byte strings. (It is premature to make bstr a public dep of a core crate like grep-matcher, but maybe some day.)	2019-04-05 23:24:08 -04:00
Andrew Gallant	d6feeb7ff2	grep-searcher-0.1.3	2019-02-10 07:42:37 -05:00
Andrew Gallant	626ed00c19	searcher: revert big-endian patch This undoes the patch to stop using bytecount on big-endian architectures. In particular, we bump our bytecount dependency to the latest release, which has a fix. This reverts commit `a4868b8835`. Fixes #1144 (again), Closes #1194	2019-02-10 07:40:32 -05:00
Andrew Gallant	fc3cf41247	grep-searcher-0.1.2	2019-02-09 16:13:07 -05:00
Andrew Gallant	a4868b8835	searcher: use naive line counting on big-endian This patches out bytecount's "fast" vectorized algorithm on big-endian machines, where it has been observed to fail. Going forward, bytecount should probably fix this on their end, but for now, we take a small performance hit on big-endian machines. Fixes #1144	2019-02-09 16:13:07 -05:00
Andrew Gallant	9d703110cf	regex: make CRLF hack more robust This commit improves the CRLF hack to be more robust. In particular, in addition to rewriting `$` as `(?:\r??$)`, we now strip `\r` from the end of a match if and only if the regex has an ending line anchor required for a match. This doesn't quite make the hack 100% correct, but should fix most use cases in practice. An example of a regex that will still be incorrect is `foo\|bar$`, since the analysis isn't quite sophisticated enough to determine that a `\r` can be safely stripped from any match. Even if we fix that, regexes like `foo\r\|bar$` still won't be handled correctly. Alas, more work on this front should really be focused on enabling this in the regex engine itself. The specific cause of this bug was that grep-searcher was sneakily stripping CRLF from matching lines when it really shouldn't have. We remove that code now, and instead rely on better match semantics provided at a lower level. Fixes #1095	2019-01-26 12:34:28 -05:00
Andrew Gallant	276e2c9b9a	searcher: always strip BOM This fixes a bug where a BOM prefix was included. While this was somewhat intentional in order to have a faithful "UTF8 passthru" option, in practice, this causes problems such as breaking patterns like `^` in a really non-obvious way. The actual fix was to add a new API to encoding_rs_io, which this commit brings in. Fixes #1163	2019-01-25 17:18:57 -05:00
Andrew Gallant	1e9ee2cc85	deps: update memmap	2019-01-19 10:44:30 -05:00
Andrew Gallant	968491f8e9	deps: update to bytecount 0.5 bytecount now uses runtime dispatch for enabling SIMD, which means we can no longer need the avx-accel features. We remove it from ripgrep since the next release will be a minor version bump, but leave them as no-ops for the crates that previously used it.	2019-01-19 10:44:30 -05:00
Andrew Gallant	63b0f31a22	deps: update various dependencies We also increase the MSRV to 1.32, the current stable release, which sets the stage for migrating to Rust 2018.	2019-01-19 10:44:30 -05:00
Andrew Gallant	dbc8ca9cc1	grep-searcher: add docs for assert_eq_printed Looks like the deny(missing_docs) lint got a bit stronger.	2019-01-11 09:03:00 -05:00
Andrew Gallant	fb62266620	deps: update encoding_rs This commit bumps the version of encoding_rs to use the latest release. This appears to fix a panic in UTF-16 decoding. Fixes #1089	2018-10-22 06:50:35 -04:00
Andrew Gallant	ba533f390e	grep-searcher: update to encoding_rs_io 0.1.3 This update includes a work-around for a presumed bug in encoding_rs that causes a panic: https://github.com/hsivonen/encoding_rs/issues/34 Specifically, to reproduce this in ripgrep, one can run the following: $ curl -LO https://cache.ruby-lang.org/pub/ruby/2.5/ruby-2.5.1.tar.gz $ tar xf ruby-2.5.1.tar.gz $ rg ZZZZZ ruby-2.5.1/test/rexml/data/t63-2.svg thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1' Fixes #1052	2018-09-25 16:56:04 -04:00
Andrew Gallant	d14f0b37d6	deps: update versions for all crates I don't think every change here is needed, but this ensures we're using the latest version of every direct dependency.	2018-09-07 14:00:22 -04:00
Andrew Gallant	3dd4b77dfb	grep-searcher: add Box<...> impl for Sink We initially did not have this impl because the first revision of the Sink trait was much more complicated. In particular, each method was parameterized over a Matcher. But not every Sink impl actually needs a Matcher, and it is just as easy to borrow a Matcher explicitly, so the added parameterization wasn't holding its own. This does permit Sink implementations to be used as trait objects. One key use case here is to reduce compile times, since there is quite a bit of code inside grep-searcher that is parameterized on Sink. Unfortunately, that code is also parameterized on Matcher, and the various printers in grep-printer are also parameterized on Matcher, which means Sink trait objects are necessary but no sufficient for a major reduction in compile times. Unfortunately, the path to making Matcher object safe isn't quite clear. Extension traits maybe? There's also stuff in the Serde ecosystem that might help, but the type shenanigans can get pretty gnarly.	2018-09-07 12:06:05 -04:00
Andrew Gallant	54b3e9eb10	grep-printer: delete unused code	2018-09-07 12:06:05 -04:00
Andrew Gallant	afa06c518a	deps: update libripgrep crate versions This prepares them for an initial 0.1.0 release.	2018-08-20 17:34:45 -04:00
Andrew Gallant	d9ca529356	libripgrep: initial commit introducing libripgrep libripgrep is not any one library, but rather, a collection of libraries that roughly separate the following key distinct phases in a grep implementation: 1. Pattern matching (e.g., by a regex engine). 2. Searching a file using a pattern matcher. 3. Printing results. Ultimately, both (1) and (3) are defined by de-coupled interfaces, of which there may be multiple implementations. Namely, (1) is satisfied by the `Matcher` trait in the `grep-matcher` crate and (3) is satisfied by the `Sink` trait in the `grep2` crate. The searcher (2) ties everything together and finds results using a matcher and reports those results using a `Sink` implementation. Closes #162	2018-08-20 07:10:19 -04:00

28 Commits