1
0
mirror of https://github.com/laurent22/joplin.git synced 2025-03-29 21:21:15 +02:00
joplin/docs/gsoc/idea_ocr.html
2019-12-16 17:22:44 +00:00

441 lines
12 KiB
HTML

<!doctype html>
<html>
<!--
!!! WARNING !!!
This file was auto-generated from readme/gsoc/idea_ocr.md and any manual change
made to it will be overwritten. To make a change to this file please modify
the source Markdown file:
https://github.com/laurent22/joplin/blob/master/readme/gsoc/idea_ocr.md
-->
<head>
<title>GSoC: Idea: OCR Support | Joplin</title>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://joplinapp.org/css/bootstrap.min.css">
<link rel="shortcut icon" type="image/x-icon" href="favicon.ico">
<!-- <link rel="stylesheet" href="https://joplinapp.org/css/fontawesome-all.min.css"> -->
<link rel="stylesheet" href="https://joplinapp.org/css/fork-awesome.min.css">
<script src="https://joplinapp.org/js/jquery-3.2.1.slim.min.js"></script>
<style>
body {
background-color: #F1F1F1;
color: #333333;
}
table {
margin-bottom: 1em;
}
td, th {
padding: .8em;
border: 1px solid #ccc;
}
.page-markdown table pre,
.page-markdown table blockquote {
margin-bottom: 0;
}
.page-markdown table pre,
.page-markdown table blockquote {
margin-bottom: 0;
}
.page-markdown table pre {
background-color: rgba(0,0,0,0);
border: none;
margin: 0;
padding: 0;
}
h1, h2 {
border-bottom: 1px solid #eaecef;
padding-bottom: 0.3em;
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol";
font-weight: 600;
font-size: 2em;
margin-bottom: 16px;
}
h2 {
font-size: 1.6em;
}
h3 {
font-size: 1.3em;
}
code {
color: black;
background-color: #eee;
border: 1px solid #ccc;
font-size: .85em;
}
pre code {
border: none;
}
pre {
font-size: .85em;
}
blockquote {
font-size: 1em;
color: #555;
};
#toc ul {
margin-bottom: 10px;
}
#toc {
padding-bottom: 1em;
}
.title {
display: flex;
align-items: center;
}
.title-icon {
display: flex;
height: 1em;
}
.title-text {
display: flex;
font-weight: normal;
margin-bottom: .2em;
margin-left: .5em;
}
.sub-title {
font-weight: normal;
}
.container {
background-color: white;
padding: 0;
box-shadow: 0 10px 20px #888888;
}
table.screenshots {
margin-top: 2em;
margin-bottom: 2em;
}
table.screenshots th {
height: 3em;
text-align: center;
}
table.screenshots th,
table.screenshots td {
border: 1px solid #C2C2C2;
}
img[align="left"] {
margin-right: 10px;
margin-bottom: 10px;
}
.mobile-screenshot {
height: 40em;
padding: 1em;
}
.cli-screenshot-wrapper {
background-color: black;
vertical-align: top;
padding: 1em 2em 1em 1em;
}
.cli-screenshot {
font-family: "Monaco", "Inconsolata", "CONSOLAS", "Deja Vu Sans Mono", "Droid Sans Mono", "Andale Mono", monospace;
background-color: black;
color: white;
border: none;
}
.cli-screenshot .prompt {
color: #48C2F0;
}
.top-screenshot {
margin-top: 2em;
text-align: center;
}
.header {
position: relative;
padding-left: 2em;
padding-right: 2em;
padding-top: 1em;
padding-bottom: 1em;
color: white;
background-color: #2B2B3D;
}
.header a h1 {
color: white;
}
.header a:hover {
text-decoration: none;
}
.content {
padding-left: 2em;
padding-right: 2em;
padding-bottom: 2em;
padding-top: 2em;
}
.forkme {
position: absolute;
right: 0;
top:0;
}
.nav-wrapper {
position: relative;
width: inherit;
}
.nav {
background-color: black;
display: table;
width: inherit;
}
.nav.sticky {
position:fixed;
top: 0;
width: inherit;
box-shadow: 0 0 10px #000000;
}
.nav a {
color: white;
display: inline-block;
padding: .6em .9em .6em .9em;
}
.nav ul {
padding-left: 2em;
margin-bottom: 0;
display: table-cell;
/* min-width: 250px; */
/* For GSoC: */
min-width: 470px;
}
.nav ul li {
display: inline-block;
padding: 0;
}
.nav li.selected {
background-color: #222;
font-weight: bold;
}
.nav-right {
display: table-cell;
width: 100%;
text-align: right;
vertical-align: middle;
line-height: 0;
}
.nav-right .share-btn {
display: none;
}
.nav-right .small-share-btn {
display: none;
}
.footer {
padding-top: 1em;
border-top: 1px solid #d4d4d4;
margin-top: 2em;
color: gray;
font-size: .9em;
}
a.heading-anchor {
display: inline-block;
opacity: 0;
width: 1.3em;
font-size: 0.7em;
margin-left: 0.4em;
line-height: 1em;
text-decoration: none;
transition: opacity 0.3s;
}
a.heading-anchor:hover,
h1:hover a.heading-anchor,
h2:hover a.heading-anchor,
h3:hover a.heading-anchor,
h4:hover a.heading-anchor,
h5:hover a.heading-anchor,
h6:hover a.heading-anchor {
opacity: 1;
}
.bottom-links {
display: flex;
justify-content: center;
border-top: 1px solid #d4d4d4;
margin-top: 30px;
padding-top: 25px;
}
@media all and (min-width: 400px) {
.nav-right .share-btn {
display: inline-block;
}
.nav-right .small-share-btn {
display: none;
}
}
</style>
</head>
<body>
<div class="container page-idea_ocr">
<div class="header">
<a class="forkme" href="https://github.com/laurent22/joplin"><img src="https://joplinapp.org/images/ForkMe.png"/></a>
<a href="https://joplinapp.org"><h1 class="title"><img class="title-icon" src="https://joplinapp.org/images/Icon512.png"><span class="title-text">Joplin</span></h1></a>
<p class="sub-title">An open source note taking and to-do application with synchronisation capabilities</p>
</div>
<div class="nav-wrapper">
<div class="nav">
<ul>
<li class=""><a href="https:&#x2F;&#x2F;joplinapp.org/" title="Home"><i class="fa fa-home"></i></a></li>
<li><a href="https://discourse.joplinapp.org" title="Forum">Forum</a></li>
<li><a class="help" href="#" title="Menu">Menu</a></li>
<li><a class="gsoc" href="https://joplinapp.org/gsoc/" title="Google Summer of Code 2020">GSoC 2020</a></li>
</ul>
<div class="nav-right">
<!--
<iframe class="share-btn" src="https://www.facebook.com/plugins/share_button.php?href=http%3A%2F%2Fjoplinapp.org&layout=button&size=small&mobile_iframe=true&width=60&height=20&appId" width="60" height="20" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>
<iframe class="share-btn" src="https://platform.twitter.com/widgets/tweet_button.html?url=http%3A%2F%2Fjoplinapp.org" width="62" height="20" title="Tweet" style="border: 0; overflow: hidden;"></iframe>
-->
<iframe class="share-btn share-btn-github" src="https://ghbtns.com/github-btn.html?user=laurent22&repo=joplin&type=star&count=true" frameborder="0" scrolling="0" width="100px" height="20px"></iframe>
</div>
</div>
</div>
<div class="content">
<div id="toc"><ul>
<li>
<p>Applications</p>
<ul>
<li><a href="https://joplinapp.org/desktop/">Desktop application</a></li>
<li><a href="https://joplinapp.org/mobile/">Mobile applications</a></li>
<li><a href="https://joplinapp.org/terminal/">Terminal application</a></li>
<li><a href="https://joplinapp.org/clipper/">Web Clipper</a></li>
</ul>
</li>
<li>
<p>Support</p>
<ul>
<li><a href="https://discourse.joplinapp.org">Joplin Forum</a></li>
<li><a href="https://joplinapp.org/markdown/">Markdown Guide</a></li>
<li><a href="https://joplinapp.org/e2ee/">How to enable end-to-end encryption</a></li>
<li><a href="https://joplinapp.org/spec/">End-to-end encryption spec</a></li>
<li><a href="https://joplinapp.org/debugging/">How to enable debug mode</a></li>
<li><a href="https://joplinapp.org/api/">API documentation</a></li>
<li><a href="https://joplinapp.org/faq/">FAQ</a></li>
</ul>
</li>
<li>
<p>About</p>
<ul>
<li><a href="https://joplinapp.org/changelog/">Changelog (Desktop App)</a></li>
<li><a href="https://joplinapp.org/changelog_cli/">Changelog (CLI App)</a></li>
<li><a href="https://joplinapp.org/stats/">Stats</a></li>
<li><a href="https://joplinapp.org/donate/">Donate</a></li>
</ul>
</li>
</ul>
</div>
<h1>GSoC: OCR<a name="gsoc-ocr" href="#gsoc-ocr" class="heading-anchor">🔗</a></h1>
<p>It seems possible to add support for OCR content in Joplin via the [<a href="http://tesseract.projectnaptha.com/">http://tesseract.projectnaptha.com/</a>](Tesseract library).</p>
<p>A first step would be to assess the feasability of this project by integrating the lib in the desktop app and trying to OCR an image.</p>
<ul>
<li>Is the image correctly OCRed?</li>
<li>Does it work with non-English text?</li>
<li>How slow/fast is it? Test with a very large image to be sure. It should not freeze the app while processing an image.</li>
</ul>
<p>If everything works well, we can add the feature to the app.</p>
<h2>Specification<a name="specification" href="#specification" class="heading-anchor">🔗</a></h2>
<ul>
<li>On desktop app: Create service that runs in the background and process the resources that need to be OCRed.</li>
<li>When a document is OCRed: Append block to end of note that contains the extracted plain text</li>
<li>When attaching resource, ask what user wants to do:
<ul>
<li>Always OCR all files</li>
<li>Never OCR any file</li>
<li>Always OCR files with extension &quot;.ext&quot;</li>
<li>Never OCR files with extension &quot;.ext&quot;</li>
</ul>
</li>
<li>Can be changed in settings</li>
<li>Right-click resource, or note, to OCR content</li>
<li>Add resource ocr_status on resource table: Can be: none, todo, processing, done</li>
<li>Add ocr_text to resource: must include detailed coordinates, and a way to get plain text back</li>
</ul>
<p><strong>Advantage of it doing that way:</strong></p>
<ul>
<li>Search engine just works - no need for special indexing of OCR content since it is inside the note directly</li>
<li>Will work with all clients (mobile, desktop, terminal)</li>
<li>When a note is exported to Markdown, it will include the OCR content</li>
</ul>
<p><strong>Format of OCR text block</strong></p>
<pre><code>&lt;!-- autogen-ocr :resource.id --&gt;
* * *
**:resource.title**
:resource.ocr_text
&lt;!-- autogen-ocr :resourceId --&gt;
</code></pre>
<p>For example, for a resource called &quot;TrainTicket.png&quot;:</p>
<pre><code>&lt;!-- autogen-ocr 2ee4eec909734f7197654a9a040dfba7 --&gt;
* * *
**TrainTicket.png**
From: London
To: Paris
Date: 01/12/2019
Time: 15:00
...etc.
&lt;!-- autogen-ocr :resourceId --&gt;
</code></pre>
<p>The advantage of this format is that it will render nicely in the viewer, and it will still be clearly identified as OCR content, which means later we can identify these blocks, update them, remove them, etc.</p>
<p><strong>Later</strong></p>
<ul>
<li>Support PDF files - for example by converting each page to an image first, then passing it to Tesseract.</li>
<li>Make ocr_text searchable</li>
<li>Display search results directly on document. i.e. if it's an image, highlight the parts of the image that contain the search text.</li>
</ul>
<h2>See also<a name="see-also" href="#see-also" class="heading-anchor">🔗</a></h2>
<ul>
<li>GitHub issue: <a href="https://github.com/laurent22/joplin/issues/807">https://github.com/laurent22/joplin/issues/807</a></li>
</ul>
<div class="bottom-links">
<a href="https://github.com/laurent22/joplin/blob/master/readme/gsoc/idea_ocr.md">
<i class="fa fa-github"></i> Improve this doc
</a>
</div>
<script>
function stickyHeader() {
return; // Disabled
if ($(window).scrollTop() > 179) {
$('.nav').addClass('sticky');
} else {
$('.nav').removeClass('sticky');
}
}
$('#toc').hide();
$('.help').click(function(event) {
event.preventDefault();
$('#toc').show();
});
$(window).scroll(function() {
stickyHeader();
});
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-103586105-1', 'auto');
ga('send', 'pageview');
</script>
<div class="footer">
Copyright (c) 2016-2019 Laurent Cozic
</div>
</body>
</html>