tutorials
Entries tagged: tutorials-
March 29, 2020.
OCR at your fingertips (part 3)
tags: tutorials, mac
In the real world, you’ll encounter text that isn’t very machine-readable. You’ll ask OCR to extract text, and it will fail miserably.
Using tesseract out of the box like
... | tesseract stdin stdout | ...
got me very okay results, but certainly not the near-perfect results I showed off in part 1. There are three different solutions to this, and they can be combined. Briefly, they are: preprocessing, postprocessing, and tesseract config. (Skip to the bottom for the resulting command.)Preprocessing
The best step you can take towards improving OCR accuracy is not giving it a noisy image in the first place.
One preprocessing “trick”, is just having more detailed (read: bigger) images. Apparently
tesseract
data is trained on 600 dpi (highly detailed) images, while Mac retina displays hover at around 200-300 dpi. In practical terms: if the OCR fails, just zoom in and take a bigger (and therefore less pixelated looking) screenshot.Otherwise, there are a huge number of tools for OCR image preprocessing. My favorites involve imagemagick, specifically the textcleaner script by Fred Weinhaus. My preprocessing consists of the following line:
... | textcleaner -g -e stretch -f 75 -o 10 -t 50 -s 1 - png:- | ...
I never remember what these flags are (though they are all explained well in the link above). What matters is that these flags have the following result:
The left image has a little bit of noisiness to it due to the white-outlined text and fuzzy font. OCR has less trouble with the right image, which is sharpened and black/white. The right image gets scanned without error.
So doing some preprocessing (before scanning the image with OCR) makes a huge difference in the result.
Postprocessing
Even with preprocessing, the output of OCR can be noisy, out of order, bizarrely spaced, etc.
One error that’s pretty common, in my experience, is the strange spacing. I solve this by removing leading and trailing spaces on each line. This can be fixed by inserting the following
sed
command into the pipeline from the last post.... | sed -Ee 's/^[[:space:]]+|[[:space:]]+$//g' | pbcopy
But most errors do not have a simple programmable fix like this. They often warrant manual corrections. Misspellings, random breaks in words, text out of order, smart quotes when you don’t want them, etc.
Because of this, I often postprocess the OCR text output manually. You can see some examples of this manual cleanup in part 1.
Changing the flags to
tesseract
By default,
tesseract
meanstesseract -l eng --psm 3 --oem 3 --dpi 300
, which affects:-l
: trained dataset(s) to use (e.g.-l eng+fra
), see part 2 for more on language datasets--psm
: what kind of text to expect from the image, seetesseract --help-psm
--oem
: underlying engine for tesseract, seetesseract --help-oem
--dpi
: density in dpi of the input image
Typically I add
--psm 6 --dpi 226
, for “6: Assume a single uniform block of text” and since my MacBook display has a density of 226 dpi.All together now
Using all of the above techniques for enhancing OCR results, you might arrive at something like this (assuming you’ve installed imagemagick and textcleaner):
PATH="/usr/local/bin/:$PATH" pngpaste - \ | textcleaner -g -e stretch -f 75 -o 10 -t 50 -s 1 - png:- \ | tesseract --psm 6 --dpi 226 stdin stdout \ | sed -Ee 's/^[[:space:]]+|[[:space:]]+$//g' \ | pbcopy osascript -e 'tell application "System Events" to keystroke "v" using {command down}'
This is what I’d bind to my
tagged: tutorials, mac OCR at your fingertips (part 3) (permalink) (tweet)cmd ctrl v as described back in part 2.
-
March 28, 2020.
OCR at your fingertips (part 2)
tags: tutorials, mac
Here’s a five minute tutorial on how to bind OCR capabilities to a shortcut on Mac. It’s really that simple!
High-level overview
The key insight is that those tasks become trivial if only I could take a screenshot, and magically paste it as text. Let’s develop this magic. The plan is:
- You hit
ctrl shift cmd 4 to save a portion of the screen in pasteboard - You hit a special paste shortcut that we create:
ctrl cmd v - Image in pasteboard is sent through OCR
- Resulting text is pasted onto the frontmost app
which all turn out to be doable using the following open-source packages:
Installing dependencies
With Homebrew:
brew install pngpaste tesseract
or manually install pngpaste and tesseract. Ensure that they are installed by running:
pngpaste -v
and
tesseract --version
New toys
Try it out! Hit
ctrl shift cmd 4 and grab a screenshot of some text. Then in a terminal, typepngpaste - | tesseract stdin stdout | pbcopy
This uses
pngpaste
to send the image to our OCR tool,tesseract
. The result is sent topbcopy
which places the resulting scanned text into our pasteboard, ready to be pasted. Try selecting this paragraph, running the above, and pasting the result into this text box:Hopefully the results are acceptable! Hint: The next article will deal with refining our OCR results.
If nothing happened, try running
pngpaste - | tesseract stdin stdout
in a terminal, with an image in your pasteboard. Most likely, you need to set up Tesseract with language data.The rise of automation
To truly get OCR to our fingertips, we’d like to run
pngpaste - | tesseract stdin stdout | pbcopy
at the touch of a button.Open up Automator.app (preinstalled on all Macs) and create a Service. Then drag in “Run Shell Script” from the left, and enter what we had above:
PATH="/usr/local/bin/:$PATH" pngpaste - | tesseract stdin stdout | pbcopy
(I prepended
/usr/local/bin/
toPATH
, since that is where Homebrew installspngpaste
andtesseract
for me.)Add this to the end if you want to automatically paste the result afterwards!
osascript -e 'tell application "System Events" to keystroke "v" using {command down}'
Important: At the top, select “Service receives no input in any application.” We use
pngpaste
for input, so Automator would otherwise complain about input.Here’s what you should end up with:
Save and give this service a name, like “Run OCR”.
Binding to a shortcut
ctrl cmd v After saving this service with a name (say, Run OCR), open up System Preferences > Keyboard > Shortcuts > Services, and scroll all the way down to find Run OCR. All that’s left to do is click on Run OCR to bind a shortcut. I use
ctrl cmd v .Usage: Try taking a screenshot of this paragraph with
ctrl shift cmd 4 , and use your shortcut in the text box below, as if you were pasting text. (That’s why I choosectrl cmd v – it’s almost like pasting).Feeling powerful yet?
Different languages
Something I do with OCR is translate comics that aren’t in English. For
tesseract
, this means you’d need to download language data to recognize languages that are not English. You can find this language data here. If you want to havetesseract
recognize Korean as well as English, for example, then download and movekor.traineddata
into the$TESSDATA_PREFIX
directory. Then change thetesseract
command like so:tesseract -l eng+kor stdin stdout
You can make the list as long as you want, like
eng+chi_sim+chi_tra+jpn+kor
. Be warned that runtime becomes noticeably long after more than three languages, at least in my experience.Wrap-up
With very little code, we’ve bound OCR to a hotkey. In the next post we’ll explore ways to get more accurate results with OCR.
tagged: tutorials, mac OCR at your fingertips (part 2) (permalink) (tweet)
- You hit
-
March 26, 2020.
OCR at your fingertips (part 1)
tags: tutorials, mac
What is OCR? Why should I care?
If you’re a student like me, taking online classes, then OCR is a lifesaver.
OCR (Optical Character Recognition) is a fancy way of saying “turn a picture of text, into normal text that you can copy and paste.” Sounds useful? It is! It makes notetaking from live slideshow lectures a breeze.
I find it really helpful to have OCR just as an everyday tool. Here are some situations where OCR saves me a lot of typing:
- Grabbing text from videos, slideshows, streams, video lectures:
- Grabbing text from a game application
- Translating images
- Copying one column from a web table
- Copy a numbered list including its numbers:
- Selecting code from tutorials without selecting prefixes, prompts, line numbers:
- Copying text in cases where I can’t select: (here, the text is on a badly designed button)
So even when I don’t really need it, OCR really saves me time and cognitive load.
Today, the role of OCR technology is mostly to scan documents in bulk, like turning books into ebooks. This means most OCR apps require quite a bit of setup and work. But with the automation capabilities on Mac, it’s really quick and easy to bring OCR technology to your fingertips, which you can see above! See this next post for a (very) short tutorial on how to actually accomplish this.
tagged: tutorials, mac OCR at your fingertips (part 1) (permalink) (tweet)
- Grabbing text from videos, slideshows, streams, video lectures: