1
0
mirror of https://github.com/laurent22/joplin.git synced 2024-11-27 08:21:03 +02:00

Docs: Improve explanation for using custom OCR language data (#10910)

This commit is contained in:
Henry Heino 2024-08-21 10:24:05 -07:00 committed by GitHub
parent 9dc29d034b
commit e05d7197bd
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -32,8 +32,19 @@ This pluggable interface is present in the software but not currently exposed. W
## Custom OCR language data URL
After enabling OCR, Joplin downloads language files from https://cdn.jsdelivr.net/npm/@tesseract.js-data/. This URL can be customized in settings > advanced > "OCR: Language data URL or path". This URL or path should point to a directory with a `.traineddata.gz` file for each language to be used for OCR.
After enabling OCR, Joplin downloads language files from https://cdn.jsdelivr.net/npm/@tesseract.js-data/. This URL can be customized in settings > advanced > "OCR: Language data URL or path". This URL or path should point to a directory with a `.traineddata.gz` file for each language to be used for OCR. After the first download, language data files are cached.
For reference, an example `.traineddata.gz` file can be found [here](https://cdn.jsdelivr.net/npm/@tesseract.js-data/eng/4.0.0_best_int/eng.traineddata.gz).
For example, to use OCR on a computer without internet access:
1. Transfer the `.traineddata.gz` files for the languages that will be OCRed.
- English: https://cdn.jsdelivr.net/npm/@tesseract.js-data/eng/4.0.0_best_int/eng.traineddata.gz
- French: https://cdn.jsdelivr.net/npm/@tesseract.js-data/fra/4.0.0_best_int/fra.traineddata.gz
- In general, trained data can be obtained from `https://cdn.jsdelivr.net/npm/@tesseract.js-data/[language]/4.0.0_best_int/[language].traineddata.gz` where `[language]` should be replaced with `eng`, `fra`, `chi_sim`, `deu`, `spa`, or one of the other supported language codes.
2. Transfer the `.traineddata.gz` files to the offline computer.
3. Move all of the files to the same directory (e.g. `C:\Users\User\Documents\joplin-ocr-data\`).
4. In Joplin, open settings > general > advanced.
5. Set the "OCR: Language data URL or path" to the filepath of the directory with training data.
- This should be the path to the directory selected in step 3.
6. Click "Apply".
7. Enable OCR.
To fully replace the cached language data with custom data, it may be necessary to click "Clear cache and re-download language data files".
To replace existing cached language data, click "Clear cache and re-download language data files".