1
0
mirror of https://github.com/BurntSushi/ripgrep.git synced 2025-01-13 21:28:13 +02:00

printer: fix multi-line replacement bug

This commit fixes a subtle bug in multi-line replacement of line
terminators.

The problem is that even though ripgrep supports multi-line searches, it
is *still* line oriented. It still needs to print line numbers, for
example. For this reason, there are various parts in the printer that
iterate over lines in order to format them into the desired output.

This turns out to be problematic in some cases. #1311 documents one of
those cases (with line numbers enabled to highlight a point later):

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?
    2:world?

But the desired output is this:

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?world?

At first I had thought that the main problem was that the printer was
taking ownership of writing line terminators, even if the input already
had them. But it's more subtle than that. If we fix that issue, we get
output like this instead:

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?2:world?

Notice how '2:' is printed before 'world?'. The reason it works this way
is because matches are reported to the printer in a line oriented way.
That is, the printer gets a block of lines. The searcher guarantees that
all matches that start or end in any of those lines also end or start in
another line in that same block. As a result, the printer uses this
assumption: once it has processed a block of lines, the next match will
begin on a new and distinct line. Thus, things like '2:' are printed.

This is generally all fine and good, but an impedance mismatch arises
when replacements are used. Because now, the replacement can be used to
change the "block of lines" approach. Now, in terms of the output, the
subsequent match might actually continue the current line since the
replacement might get rid of the concept of lines altogether.

We can sometimes work around this. For example:

    $ printf "hello\nworld\n" | rg -U "\n(.)?" -r '?$1'
    hello?world?

Why does this work? It's because the '(.)' after the '\n' causes the
match to overlap between lines. Thus, the searcher guarantees that the
block sent to the printer contains every line.

And there in lay the solution: all we need to do is tweak the multi-line
searcher so that it combines lines with matches that directly adjacent,
instead of requiring at least one byte of overlap. Fixing that solves
the issue above. It does cause some tests to fail:

* The binary3 test in the searcher crate fails because adjacent line
  matches are now one part of block, and that block is scanned for
  binary data. To preserve the essence of the test, we insert a couple
  dummy lines to split up the blocks.
* The JSON CRLF test. It was testing that we didn't output any messages
  with an empty 'submatches' array. That is indeed still the case. The
  difference is that the messages got combined because of the adjacent
  line merging behavior. This is a slight change to the output, but is
  still correct.

Fixes #1311
This commit is contained in:
Andrew Gallant 2021-05-31 06:10:48 -04:00
parent fc31aedcf3
commit 656aa12649
5 changed files with 108 additions and 17 deletions

View File

@ -62,6 +62,8 @@ Bug fixes:
* [BUG #1277](https://github.com/BurntSushi/ripgrep/issues/1277):
Document cygwin path translation behavior in the FAQ.
* [BUG #1311](https://github.com/BurntSushi/ripgrep/issues/1311):
Fix multi-line bug where a search & replace for `\n` didn't work as expected.
* [BUG #1642](https://github.com/BurntSushi/ripgrep/issues/1642):
Fixes a bug where using `-m` and `-A` printed more matches than the limit.
* [BUG #1703](https://github.com/BurntSushi/ripgrep/issues/1703):

View File

@ -3224,6 +3224,80 @@ Holmeses, success in the province of detective work must always
assert_eq_printed!(expected, got);
}
// This is a somewhat weird test that checks the behavior of attempting
// to replace a line terminator with something else.
//
// See: https://github.com/BurntSushi/ripgrep/issues/1311
#[test]
fn replacement_multi_line() {
let matcher = RegexMatcher::new(r"\n").unwrap();
let mut printer = StandardBuilder::new()
.replacement(Some(b"?".to_vec()))
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.line_number(true)
.multi_line(true)
.build()
.search_reader(
&matcher,
"hello\nworld\n".as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "1:hello?world?\n";
assert_eq_printed!(expected, got);
}
#[test]
fn replacement_multi_line_diff_line_term() {
let matcher = RegexMatcherBuilder::new()
.line_terminator(Some(b'\x00'))
.build(r"\n")
.unwrap();
let mut printer = StandardBuilder::new()
.replacement(Some(b"?".to_vec()))
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.line_terminator(LineTerminator::byte(b'\x00'))
.line_number(true)
.multi_line(true)
.build()
.search_reader(
&matcher,
"hello\nworld\n".as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "1:hello?world?\x00";
assert_eq_printed!(expected, got);
}
#[test]
fn replacement_multi_line_combine_lines() {
let matcher = RegexMatcher::new(r"\n(.)?").unwrap();
let mut printer = StandardBuilder::new()
.replacement(Some(b"?$1".to_vec()))
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.line_number(true)
.multi_line(true)
.build()
.search_reader(
&matcher,
"hello\nworld\n".as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "1:hello?world?\n";
assert_eq_printed!(expected, got);
}
#[test]
fn replacement_max_columns() {
let matcher = RegexMatcher::new(r"Sherlock|Doctor (\w+)").unwrap();

View File

@ -226,10 +226,19 @@ impl<'s, M: Matcher, S: Sink> MultiLine<'s, M, S> {
}
Some(last_match) => {
// If the lines in the previous match overlap with the lines
// in this match, then simply grow the match and move on.
// This happens when the next match begins on the same line
// that the last match ends on.
if last_match.end() > line.start() {
// in this match, then simply grow the match and move on. This
// happens when the next match begins on the same line that the
// last match ends on.
//
// Note that we do not technically require strict overlap here.
// Instead, we only require that the lines are adjacent. This
// provides larger blocks of lines to the printer, and results
// in overall better behavior with respect to how replacements
// are handled.
//
// See: https://github.com/BurntSushi/ripgrep/issues/1311
// And also the associated commit fixing #1311.
if last_match.end() >= line.start() {
self.last_match = Some(last_match.with_end(line.end()));
Ok(true)
} else {
@ -714,21 +723,23 @@ d
haystack.push_str("zzz\n");
}
haystack.push_str("a\n");
haystack.push_str("zzz\n");
haystack.push_str("a\x00a\n");
haystack.push_str("zzz\n");
haystack.push_str("a\n");
// The line buffered searcher has slightly different semantics here.
// Namely, it will *always* detect binary data in the current buffer
// before searching it. Thus, the total number of bytes searched is
// smaller than below.
let exp = "0:a\n\nbyte count:262146\nbinary offset:262149\n";
let exp = "0:a\n\nbyte count:262146\nbinary offset:262153\n";
// In contrast, the slice readers (for multi line as well) will only
// look for binary data in the initial chunk of bytes. After that
// point, it only looks for binary data in matches. Note though that
// the binary offset remains the same. (See the binary4 test for a case
// where the offset is explicitly different.)
let exp_slice =
"0:a\n262146:a\n\nbyte count:262149\nbinary offset:262149\n";
"0:a\n262146:a\n\nbyte count:262153\nbinary offset:262153\n";
SearcherTester::new(&haystack, "a")
.binary_detection(BinaryDetection::quit(0))

View File

@ -323,24 +323,19 @@ rgtest!(r1095_crlf_empty_match, |dir: Dir, mut cmd: TestCommand| {
// Check without --crlf flag.
let msgs = json_decode(&cmd.arg("-U").arg("--json").arg("\n").stdout());
assert_eq!(msgs.len(), 5);
assert_eq!(msgs.len(), 4);
let m = msgs[1].unwrap_match();
assert_eq!(m.lines, Data::text("test\r\n"));
assert_eq!(m.submatches[0].m, Data::text("\n"));
let m = msgs[2].unwrap_match();
assert_eq!(m.lines, Data::text("\n"));
assert_eq!(m.lines, Data::text("test\r\n\n"));
assert_eq!(m.submatches[0].m, Data::text("\n"));
assert_eq!(m.submatches[1].m, Data::text("\n"));
// Now check with --crlf flag.
let msgs = json_decode(&cmd.arg("--crlf").stdout());
assert_eq!(msgs.len(), 4);
let m = msgs[1].unwrap_match();
assert_eq!(m.lines, Data::text("test\r\n"));
assert_eq!(m.submatches[0].m, Data::text("\n"));
let m = msgs[2].unwrap_match();
assert_eq!(m.lines, Data::text("\n"));
assert_eq!(m.lines, Data::text("test\r\n\n"));
assert_eq!(m.submatches[0].m, Data::text("\n"));
assert_eq!(m.submatches[1].m, Data::text("\n"));
});

View File

@ -744,6 +744,15 @@ rgtest!(r1259_drop_last_byte_nonl, |dir: Dir, mut cmd: TestCommand| {
eqnice!("fz\n", cmd.arg("-f").arg("patterns-nl").arg("test").stdout());
});
// See: https://github.com/BurntSushi/ripgrep/issues/1311
rgtest!(r1311_multi_line_term_replace, |dir: Dir, mut cmd: TestCommand| {
dir.create("input", "hello\nworld\n");
eqnice!(
"1:hello?world?\n",
cmd.args(&["-U", "-r?", "-n", "\n", "input"]).stdout()
);
});
// See: https://github.com/BurntSushi/ripgrep/issues/1319
rgtest!(r1319, |dir: Dir, mut cmd: TestCommand| {
dir.create("input", "CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC");