printer: fix multi-line replacement bug

This commit fixes a subtle bug in multi-line replacement of line terminators. The problem is that even though ripgrep supports multi-line searches, it is *still* line oriented. It still needs to print line numbers, for example. For this reason, there are various parts in the printer that iterate over lines in order to format them into the desired output. This turns out to be problematic in some cases. #1311 documents one of those cases (with line numbers enabled to highlight a point later): $ printf "hello\nworld\n" | rg -n -U "\n" -r "?" 1:hello? 2:world? But the desired output is this: $ printf "hello\nworld\n" | rg -n -U "\n" -r "?" 1:hello?world? At first I had thought that the main problem was that the printer was taking ownership of writing line terminators, even if the input already had them. But it's more subtle than that. If we fix that issue, we get output like this instead: $ printf "hello\nworld\n" | rg -n -U "\n" -r "?" 1:hello?2:world? Notice how '2:' is printed before 'world?'. The reason it works this way is because matches are reported to the printer in a line oriented way. That is, the printer gets a block of lines. The searcher guarantees that all matches that start or end in any of those lines also end or start in another line in that same block. As a result, the printer uses this assumption: once it has processed a block of lines, the next match will begin on a new and distinct line. Thus, things like '2:' are printed. This is generally all fine and good, but an impedance mismatch arises when replacements are used. Because now, the replacement can be used to change the "block of lines" approach. Now, in terms of the output, the subsequent match might actually continue the current line since the replacement might get rid of the concept of lines altogether. We can sometimes work around this. For example: $ printf "hello\nworld\n" | rg -U "\n(.)?" -r '?$1' hello?world? Why does this work? It's because the '(.)' after the '\n' causes the match to overlap between lines. Thus, the searcher guarantees that the block sent to the printer contains every line. And there in lay the solution: all we need to do is tweak the multi-line searcher so that it combines lines with matches that directly adjacent, instead of requiring at least one byte of overlap. Fixing that solves the issue above. It does cause some tests to fail: * The binary3 test in the searcher crate fails because adjacent line matches are now one part of block, and that block is scanned for binary data. To preserve the essence of the test, we insert a couple dummy lines to split up the blocks. * The JSON CRLF test. It was testing that we didn't output any messages with an empty 'submatches' array. That is indeed still the case. The difference is that the messages got combined because of the adjacent line merging behavior. This is a slight change to the output, but is still correct. Fixes #1311
2025-07-16 22:42:20 +02:00 · 2021-05-31 06:10:48 -04:00
parent fc31aedcf3
commit 656aa12649
5 changed files with 108 additions and 17 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -62,6 +62,8 @@ Bug fixes:

 * [BUG #1277](https://github.com/BurntSushi/ripgrep/issues/1277):
  Document cygwin path translation behavior in the FAQ.
+* [BUG #1311](https://github.com/BurntSushi/ripgrep/issues/1311):
+  Fix multi-line bug where a search & replace for `\n` didn't work as expected.
 * [BUG #1642](https://github.com/BurntSushi/ripgrep/issues/1642):
  Fixes a bug where using `-m` and `-A` printed more matches than the limit.
 * [BUG #1703](https://github.com/BurntSushi/ripgrep/issues/1703):
--- a/crates/printer/src/standard.rs
+++ b/crates/printer/src/standard.rs
@ -3224,6 +3224,80 @@ Holmeses, success in the province of detective work must always
        assert_eq_printed!(expected, got);
    }

+    // This is a somewhat weird test that checks the behavior of attempting
+    // to replace a line terminator with something else.
+    //
+    // See: https://github.com/BurntSushi/ripgrep/issues/1311
+    #[test]
+    fn replacement_multi_line() {
+        let matcher = RegexMatcher::new(r"\n").unwrap();
+        let mut printer = StandardBuilder::new()
+            .replacement(Some(b"?".to_vec()))
+            .build(NoColor::new(vec![]));
+        SearcherBuilder::new()
+            .line_number(true)
+            .multi_line(true)
+            .build()
+            .search_reader(
+                &matcher,
+                "hello\nworld\n".as_bytes(),
+                printer.sink(&matcher),
+            )
+            .unwrap();
+
+        let got = printer_contents(&mut printer);
+        let expected = "1:hello?world?\n";
+        assert_eq_printed!(expected, got);
+    }
+
+    #[test]
+    fn replacement_multi_line_diff_line_term() {
+        let matcher = RegexMatcherBuilder::new()
+            .line_terminator(Some(b'\x00'))
+            .build(r"\n")
+            .unwrap();
+        let mut printer = StandardBuilder::new()
+            .replacement(Some(b"?".to_vec()))
+            .build(NoColor::new(vec![]));
+        SearcherBuilder::new()
+            .line_terminator(LineTerminator::byte(b'\x00'))
+            .line_number(true)
+            .multi_line(true)
+            .build()
+            .search_reader(
+                &matcher,
+                "hello\nworld\n".as_bytes(),
+                printer.sink(&matcher),
+            )
+            .unwrap();
+
+        let got = printer_contents(&mut printer);
+        let expected = "1:hello?world?\x00";
+        assert_eq_printed!(expected, got);
+    }
+
+    #[test]
+    fn replacement_multi_line_combine_lines() {
+        let matcher = RegexMatcher::new(r"\n(.)?").unwrap();
+        let mut printer = StandardBuilder::new()
+            .replacement(Some(b"?$1".to_vec()))
+            .build(NoColor::new(vec![]));
+        SearcherBuilder::new()
+            .line_number(true)
+            .multi_line(true)
+            .build()
+            .search_reader(
+                &matcher,
+                "hello\nworld\n".as_bytes(),
+                printer.sink(&matcher),
+            )
+            .unwrap();
+
+        let got = printer_contents(&mut printer);
+        let expected = "1:hello?world?\n";
+        assert_eq_printed!(expected, got);
+    }
+
    #[test]
    fn replacement_max_columns() {
        let matcher = RegexMatcher::new(r"Sherlock|Doctor (\w+)").unwrap();
--- a/crates/searcher/src/searcher/glue.rs
+++ b/crates/searcher/src/searcher/glue.rs
@ -226,10 +226,19 @@ impl<'s, M: Matcher, S: Sink> MultiLine<'s, M, S> {
            }
            Some(last_match) => {
                // If the lines in the previous match overlap with the lines
-                // in this match, then simply grow the match and move on.
-                // This happens when the next match begins on the same line
-                // that the last match ends on.
-                if last_match.end() > line.start() {
+                // in this match, then simply grow the match and move on. This
+                // happens when the next match begins on the same line that the
+                // last match ends on.
+                //
+                // Note that we do not technically require strict overlap here.
+                // Instead, we only require that the lines are adjacent. This
+                // provides larger blocks of lines to the printer, and results
+                // in overall better behavior with respect to how replacements
+                // are handled.
+                //
+                // See: https://github.com/BurntSushi/ripgrep/issues/1311
+                // And also the associated commit fixing #1311.
+                if last_match.end() >= line.start() {
                    self.last_match = Some(last_match.with_end(line.end()));
                    Ok(true)
                } else {
@ -714,21 +723,23 @@ d
            haystack.push_str("zzz\n");
        }
        haystack.push_str("a\n");
+        haystack.push_str("zzz\n");
        haystack.push_str("a\x00a\n");
+        haystack.push_str("zzz\n");
        haystack.push_str("a\n");

        // The line buffered searcher has slightly different semantics here.
        // Namely, it will *always* detect binary data in the current buffer
        // before searching it. Thus, the total number of bytes searched is
        // smaller than below.
-        let exp = "0:a\n\nbyte count:262146\nbinary offset:262149\n";
+        let exp = "0:a\n\nbyte count:262146\nbinary offset:262153\n";
        // In contrast, the slice readers (for multi line as well) will only
        // look for binary data in the initial chunk of bytes. After that
        // point, it only looks for binary data in matches. Note though that
        // the binary offset remains the same. (See the binary4 test for a case
        // where the offset is explicitly different.)
        let exp_slice =
-            "0:a\n262146:a\n\nbyte count:262149\nbinary offset:262149\n";
+            "0:a\n262146:a\n\nbyte count:262153\nbinary offset:262153\n";

        SearcherTester::new(&haystack, "a")
            .binary_detection(BinaryDetection::quit(0))
--- a/tests/json.rs
+++ b/tests/json.rs
@ -323,24 +323,19 @@ rgtest!(r1095_crlf_empty_match, |dir: Dir, mut cmd: TestCommand| {

    // Check without --crlf flag.
    let msgs = json_decode(&cmd.arg("-U").arg("--json").arg("\n").stdout());
-    assert_eq!(msgs.len(), 5);
+    assert_eq!(msgs.len(), 4);

    let m = msgs[1].unwrap_match();
-    assert_eq!(m.lines, Data::text("test\r\n"));
-    assert_eq!(m.submatches[0].m, Data::text("\n"));
-
-    let m = msgs[2].unwrap_match();
-    assert_eq!(m.lines, Data::text("\n"));
+    assert_eq!(m.lines, Data::text("test\r\n\n"));
    assert_eq!(m.submatches[0].m, Data::text("\n"));
+    assert_eq!(m.submatches[1].m, Data::text("\n"));

    // Now check with --crlf flag.
    let msgs = json_decode(&cmd.arg("--crlf").stdout());
+    assert_eq!(msgs.len(), 4);

    let m = msgs[1].unwrap_match();
-    assert_eq!(m.lines, Data::text("test\r\n"));
-    assert_eq!(m.submatches[0].m, Data::text("\n"));
-
-    let m = msgs[2].unwrap_match();
-    assert_eq!(m.lines, Data::text("\n"));
+    assert_eq!(m.lines, Data::text("test\r\n\n"));
    assert_eq!(m.submatches[0].m, Data::text("\n"));
+    assert_eq!(m.submatches[1].m, Data::text("\n"));
 });
--- a/tests/regression.rs
+++ b/tests/regression.rs
@ -744,6 +744,15 @@ rgtest!(r1259_drop_last_byte_nonl, |dir: Dir, mut cmd: TestCommand| {
    eqnice!("fz\n", cmd.arg("-f").arg("patterns-nl").arg("test").stdout());
 });

+// See: https://github.com/BurntSushi/ripgrep/issues/1311
+rgtest!(r1311_multi_line_term_replace, |dir: Dir, mut cmd: TestCommand| {
+    dir.create("input", "hello\nworld\n");
+    eqnice!(
+        "1:hello?world?\n",
+        cmd.args(&["-U", "-r?", "-n", "\n", "input"]).stdout()
+    );
+});
+
 // See: https://github.com/BurntSushi/ripgrep/issues/1319
 rgtest!(r1319, |dir: Dir, mut cmd: TestCommand| {
    dir.create("input", "CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC");