diff --git a/CHANGELOG.md b/CHANGELOG.md index 4ce6ae7d..bd860458 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,8 @@ TBD === Bug fixes: +* [BUG #1252](https://github.com/BurntSushi/ripgrep/issues/1252): + Add a section on the `--pre` flag to the GUIDE. * [BUG #1339](https://github.com/BurntSushi/ripgrep/issues/1339): Improve error message when a pattern with invalid UTF-8 is provided. * [BUG #1524](https://github.com/BurntSushi/ripgrep/issues/1524): diff --git a/GUIDE.md b/GUIDE.md index 22ad183d..3e3bd583 100644 --- a/GUIDE.md +++ b/GUIDE.md @@ -19,6 +19,7 @@ translatable to any command line shell environment. * [Configuration file](#configuration-file) * [File encoding](#file-encoding) * [Binary data](#binary-data) +* [Preprocessor](#preprocessor) * [Common options](#common-options) @@ -767,6 +768,212 @@ via the `--no-mmap` flag. (The cost will be a small performance regression when searching very large files on some platforms.) +### Preprocessor + +In ripgrep, a preprocessor is any type of command that can be run to transform +the input of every file before ripgrep searches it. This makes it possible to +search virtually any kind of content that can be automatically converted to +text without having to teach ripgrep how to read said content. + +One common example is searching PDFs. PDFs are first and foremost meant to be +displayed to users. But PDFs often have text streams in them that can be useful +to search. In our case, we want to search Bruce Watson's excellent +dissertation, +[Taxonomies and Toolkits of Regular Language Algorithms](https://burntsushi.net/stuff/1995-watson.pdf). +After downloading it, let's try searching it: + +``` +$ rg 'The Commentz-Walter algorithm' 1995-watson.pdf +$ +``` + +Surely, a dissertation on regular language algorithms would mention +Commentz-Walter. Indeed it does, but our search isn't picking it up because +PDFs are a binary format, and the text shown in the PDF may not be encoded as +simple contiguous UTF-8. Namely, even passing the `-a/--text` flag to ripgrep +will not make our search work. + +One way to fix this is to convert the PDF to plain text first. This won't work +well for all PDFs, but does great in a lot of cases. (Note that the tool we +use, `pdftotext`, is part of the [poppler](https://poppler.freedesktop.org) +PDF rendering library.) + +``` +$ pdftotext 1995-watson.pdf > 1995-watson.txt +$ rg 'The Commentz-Walter algorithm' 1995-watson.txt +316:The Commentz-Walter algorithms : : : : : : : : : : : : : : : +7165:4.4 The Commentz-Walter algorithms +10062:in input string S , we obtain the Boyer-Moore algorithm. The Commentz-Walter algorithm +17218:The Commentz-Walter algorithm (and its variants) displayed more interesting behaviour, +17249:Aho-Corasick algorithms are used extensively. The Commentz-Walter algorithms are used +17297: The Commentz-Walter algorithms (CW). In all versions of the CW algorithms, a common program skeleton is used with di erent shift functions. The CW algorithms are +``` + +But having to explicitly convert every file can be a pain, especially when you +have a directory full of PDF files. Instead, we can use ripgrep's preprocessor +feature to search the PDF. ripgrep's `--pre` flag works by taking a single +command name and then executing that command for every file that it searches. +ripgrep passes the file path as the first and only argument to the command and +also sends the contents of the file to stdin. So let's write a simple shell +script that wraps `pdftotext` in a way that conforms to this interface: + +``` +$ cat preprocess +#!/bin/sh + +exec pdftotext - - +``` + +With `preprocess` in the same directory as `1995-watson.pdf`, we can now use it +to search the PDF: + +``` +$ rg --pre ./preprocess 'The Commentz-Walter algorithm' 1995-watson.pdf +316:The Commentz-Walter algorithms : : : : : : : : : : : : : : : +7165:4.4 The Commentz-Walter algorithms +10062:in input string S , we obtain the Boyer-Moore algorithm. The Commentz-Walter algorithm +17218:The Commentz-Walter algorithm (and its variants) displayed more interesting behaviour, +17249:Aho-Corasick algorithms are used extensively. The Commentz-Walter algorithms are used +17297: The Commentz-Walter algorithms (CW). In all versions of the CW algorithms, a common program skeleton is used with di erent shift functions. The CW algorithms are +``` + +Note that `preprocess` must be resolvable to a command that ripgrep can read. +The simplest way to do this is to put your preprocessor command in a directory +that is in your `PATH` (or equivalent), or otherwise use an absolute path. + +As a bonus, this turns out to be quite a bit faster than other specialized PDF +grepping tools: + +``` +$ time rg --pre ./preprocess 'The Commentz-Walter algorithm' 1995-watson.pdf -c +6 + +real 0.697 +user 0.684 +sys 0.007 +maxmem 16 MB +faults 0 + +$ time pdfgrep 'The Commentz-Walter algorithm' 1995-watson.pdf -c +6 + +real 1.336 +user 1.310 +sys 0.023 +maxmem 16 MB +faults 0 +``` + +If you wind up needing to search a lot of PDFs, then ripgrep's parallelism can +make the speed difference even greater. + +#### A more robust preprocessor + +One of the problems with the aforementioned preprocessor is that it will fail +if you try to search a file that isn't a PDF: + +``` +$ echo foo > not-a-pdf +$ rg --pre ./preprocess 'The Commentz-Walter algorithm' not-a-pdf +not-a-pdf: preprocessor command failed: '"./preprocess" "not-a-pdf"': +------------------------------------------------------------------------------- +Syntax Warning: May not be a PDF file (continuing anyway) +Syntax Error: Couldn't find trailer dictionary +Syntax Error: Couldn't find trailer dictionary +Syntax Error: Couldn't read xref table +``` + +To fix this, we can make our preprocessor script a bit more robust by only +running `pdftotext` when we think the input is a non-empty PDF: + +``` +$ cat preprocessor +#!/bin/sh + +case "$1" in +*.pdf) + # The -s flag ensures that the file is non-empty. + if [ -s "$1" ]; then + exec pdftotext - - + else + exec cat + fi + ;; +*) + exec cat + ;; +esac +``` + +We can even extend our preprocessor to search other kinds of files. Sometimes +we don't always know the file type from the file name, so we can use the `file` +utility to "sniff" the type of the file based on its contents: + +``` +$ cat processor +#!/bin/sh + +case "$1" in +*.pdf) + # The -s flag ensures that the file is non-empty. + if [ -s "$1" ]; then + exec pdftotext - - + else + exec cat + fi + ;; +*) + case $(file "$1") in + *Zstandard*) + exec pzstd -cdq + ;; + *) + exec cat + ;; + esac + ;; +esac +``` + +#### Reducing preprocessor overhead + +There is one more problem with the above approach: it requires running a +preprocessor for every single file that ripgrep searches. If every file needs +a preprocessor, then this is OK. But if most don't, then this can substantially +slow down searches because of the overhead of launching new processors. You +can avoid this by telling ripgrep to only invoke the preprocessor when the file +path matches a glob. For example, consider the performance difference even when +searching a repository as small as ripgrep's: + +``` +$ time rg --pre pre-rg 'fn is_empty' -c +crates/globset/src/lib.rs:1 +crates/matcher/src/lib.rs:2 +crates/ignore/src/overrides.rs:1 +crates/ignore/src/gitignore.rs:1 +crates/ignore/src/types.rs:1 + +real 0.138 +user 0.485 +sys 0.209 +maxmem 7 MB +faults 0 + +$ time rg --pre pre-rg --pre-glob '*.pdf' 'fn is_empty' -c +crates/globset/src/lib.rs:1 +crates/ignore/src/types.rs:1 +crates/ignore/src/gitignore.rs:1 +crates/ignore/src/overrides.rs:1 +crates/matcher/src/lib.rs:2 + +real 0.008 +user 0.010 +sys 0.002 +maxmem 7 MB +faults 0 +``` + + ### Common options ripgrep has a lot of flags. Too many to keep in your head at once. This section diff --git a/README.md b/README.md index b58ca90c..b13e27ed 100644 --- a/README.md +++ b/README.md @@ -116,9 +116,10 @@ times are unaffected by the presence or absence of `-n`. specifically specified with the `-E/--encoding` flag.) * ripgrep supports searching files compressed in a common format (brotli, bzip2, gzip, lz4, lzma, xz, or zstandard) with the `-z/--search-zip` flag. -* ripgrep supports arbitrary input preprocessing filters which could be PDF - text extraction, less supported decompression, decrypting, automatic encoding - detection and so on. +* ripgrep supports + [arbitrary input preprocessing filters](GUIDE.md#preprocessor) + which could be PDF text extraction, less supported decompression, decrypting, + automatic encoding detection and so on. In other words, use ripgrep if you like speed, filtering by default, fewer bugs and Unicode support.