OCR at your fingertips (part 3)
March 29, 2020.In the real world, you’ll encounter text that isn’t very machine-readable. You’ll ask OCR to extract text, and it will fail miserably.
Using tesseract out of the box like ... | tesseract stdin stdout | ...
got me very okay results, but certainly not the near-perfect results I showed off in part 1. There are three different solutions to this, and they can be combined. Briefly, they are: preprocessing, postprocessing, and tesseract config. (Skip to the bottom for the resulting command.)
Preprocessing
The best step you can take towards improving OCR accuracy is not giving it a noisy image in the first place.
One preprocessing “trick”, is just having more detailed (read: bigger) images. Apparently tesseract
data is trained on 600 dpi (highly detailed) images, while Mac retina displays hover at around 200-300 dpi. In practical terms: if the OCR fails, just zoom in and take a bigger (and therefore less pixelated looking) screenshot.
Otherwise, there are a huge number of tools for OCR image preprocessing. My favorites involve imagemagick, specifically the textcleaner script by Fred Weinhaus. My preprocessing consists of the following line:
... | textcleaner -g -e stretch -f 75 -o 10 -t 50 -s 1 - png:- | ...
I never remember what these flags are (though they are all explained well in the link above). What matters is that these flags have the following result:
The left image has a little bit of noisiness to it due to the white-outlined text and fuzzy font. OCR has less trouble with the right image, which is sharpened and black/white. The right image gets scanned without error.
So doing some preprocessing (before scanning the image with OCR) makes a huge difference in the result.
Postprocessing
Even with preprocessing, the output of OCR can be noisy, out of order, bizarrely spaced, etc.
One error that’s pretty common, in my experience, is the strange spacing. I solve this by removing leading and trailing spaces on each line. This can be fixed by inserting the following sed
command into the pipeline from the last post.
... | sed -Ee 's/^[[:space:]]+|[[:space:]]+$//g' | pbcopy
But most errors do not have a simple programmable fix like this. They often warrant manual corrections. Misspellings, random breaks in words, text out of order, smart quotes when you don’t want them, etc.
Because of this, I often postprocess the OCR text output manually. You can see some examples of this manual cleanup in part 1.
Changing the flags to tesseract
By default, tesseract
means tesseract -l eng --psm 3 --oem 3 --dpi 300
, which affects:
-l
: trained dataset(s) to use (e.g.-l eng+fra
), see part 2 for more on language datasets--psm
: what kind of text to expect from the image, seetesseract --help-psm
--oem
: underlying engine for tesseract, seetesseract --help-oem
--dpi
: density in dpi of the input image
Typically I add --psm 6 --dpi 226
, for “6: Assume a single uniform block of text” and since my MacBook display has a density of 226 dpi.
All together now
Using all of the above techniques for enhancing OCR results, you might arrive at something like this (assuming you’ve installed imagemagick and textcleaner):
PATH="/usr/local/bin/:$PATH"
pngpaste - \
| textcleaner -g -e stretch -f 75 -o 10 -t 50 -s 1 - png:- \
| tesseract --psm 6 --dpi 226 stdin stdout \
| sed -Ee 's/^[[:space:]]+|[[:space:]]+$//g' \
| pbcopy
osascript -e 'tell application "System Events" to keystroke "v" using {command down}'
This is what I’d bind to my
OCR at your fingertips (part 2)