ripgrep

mirror of https://github.com/BurntSushi/ripgrep.git synced 2025-07-06 06:27:36 +02:00

Author	SHA1	Message	Date
Jakub Wieczorek	b435eaafc8	grep-regex: fix inner literal extraction bug This appears to be another transcription bug from copying this code from the prefix literal detection from inside the regex crate. Namely, when it comes to inner literals, we only want to treat counted repetition as two separate cases: the case when the minimum match is 0 and the case when the minimum match is more than 0. In the former case, we treat `e{0,n}` as `e*` and in the latter we treat `e{m,n}` where `m >= 1` as just `e`. We could definitely do better here. e.g., This means regexes like `(foo){10}` will only have `foo` extracted as a literal, where searching for the full literal would likely be faster. The actual bug here was that we were not implementing this logic correctly. Namely, we weren't always "cutting" the literals in the second case to prevent them from being expanded. Fixes #1319, Closes #1367	2020-02-17 17:16:28 -05:00
Andrew Gallant	6a0e0147e0	grep-regex: improve literal detection with -w When the -w/--word-regexp was used, ripgrep would in many cases fail to apply literal optimizations. This occurs specifically when the regex given by the user is an alternation of literals with no common prefixes or suffixes, e.g., rg -w 'foo\|bar\|baz\|quux' In this case, the inner literal detector fails. Normally, this would result in literal prefixes being detected by the regex engine. But because of the -w/--word-regexp flag, the actual regex that we run ends up looking like this: (^\|\W)(foo\|bar\|baz\|quux)($\|\W) which of course defeats any prefix or suffix literal optimizations in the regex crate's somewhat naive extractor. (A better extractor could still do literal optimizations in the above case.) So this commit fixes this by falling back to prefix or suffix literals when they're available instead of prematurely giving up and assuming the regex engine will do the rest.	2020-02-17 17:16:28 -05:00
Andrew Gallant	ad97e9c93f	grep-regex: improve inner literal detection This fixes an interesting performance bug where the inner literal extractor would sometimes choose a sub-optimal literal. For example, consider the regex: \x20+Sherlock Holmes\x20+ (The `\x20` is the ASCII code for a space character, which we use here to just make it clearer. It otherwise does not matter.) Previously, this would see the initial \x20 and then stop collecting literals after the `+` repetition operator. This was because the inner literal detector was adapter from the prefix literal detector, which had to stop here. Namely, while \x20S would be a valid prefix (for example), \x20\x20S would also be a valid prefix. As would \x20\x20\x20S and so on. So the prefix detector would have to stop at the repetition operator. Otherwise, only searching for \x20S could potentially scan farther then the starting position of the next match. However, for inner literals, this calculus no longer makes sense. We can freely search for, e.g., \x20S without missing matches that start with \x20\x20S precisely because we know this is an inner literal which may not correspond to the start of a match. With this fix, the literal that is now detected is \x20Sherlock Holmes\x20 Which is much better. We achieve this by no longer "cutting" literals after seeing a `+` repetition operator. Instead, we permit literals to continue to be extended. The reason why this is important is because using \x20 as the literal to search for is generally bad juju since it is so common. In fact, we should probably add more logic here to either avoid such things or give up entirely on the inner literal optimization if it detected a literal that we think is very common. But we punt on such things here.	2020-02-17 17:16:28 -05:00
Andrew Gallant	9f15e3b671	regex: fix a perf bug when using -w flag When looking for an inner literal to speed up searches, if only a prefix is found, then we generally give up doing inner literal optimizations since the regex engine will generally handle it for us. Unfortunately, this decision was being made before we wrap the regex in (^\|\W)...($\|\W) when using the -w/--word-regexp flag, which would then defeat the literal optimizations inside the regex engine. We fix this with a bit of a hack that says, "if we're doing a word regexp, then give me back any literal you find, even if it's a prefix."	2019-04-05 23:24:08 -04:00
Andrew Gallant	ba503eb677	grep-regex: fix inner literal detection It seems the inner literal detector fails spectacularly in cases of concatenations that involve groups. The issue here is that if the prefix of a group inside a concatenation can match the empty string, then any literals generated to that point in the concatenation need to be cut such that they are never extended. The detector isn't really built to handle this case, so we just act conservative cut literals whenever we see a sub-group. This may make some regexes slower, but the inner literal detector already misses plenty of cases. Literal detection (including in the regex engine) is a key component that needs to be completely rethought at some point. Fixes #1064	2018-09-25 16:56:04 -04:00
Andrew Gallant	d9ca529356	libripgrep: initial commit introducing libripgrep libripgrep is not any one library, but rather, a collection of libraries that roughly separate the following key distinct phases in a grep implementation: 1. Pattern matching (e.g., by a regex engine). 2. Searching a file using a pattern matcher. 3. Printing results. Ultimately, both (1) and (3) are defined by de-coupled interfaces, of which there may be multiple implementations. Namely, (1) is satisfied by the `Matcher` trait in the `grep-matcher` crate and (3) is satisfied by the `Sink` trait in the `grep2` crate. The searcher (2) ties everything together and finds results using a matcher and reports those results using a `Sink` implementation. Closes #162	2018-08-20 07:10:19 -04:00

6 Commits