temporary tm - about - notes atom/rss

OCR at your fingertips (part 3)

.

In the real world, you’ll encounter text that isn’t very machine-readable. You’ll ask OCR to extract text, and it will fail miserably.

Using tesseract out of the box like ... | tesseract stdin stdout | ... got me very okay results, but certainly not the near-perfect results I showed off in part 1. There are three different solutions to this, and they can be combined. Briefly, they are: preprocessing, postprocessing, and tesseract config. (Skip to the bottom for the resulting command.)

Preprocessing

The best step you can take towards improving OCR accuracy is not giving it a noisy image in the first place.

One preprocessing “trick”, is just having more detailed (read: bigger) images. Apparently tesseract data is trained on 600 dpi (highly detailed) images, while Mac retina displays hover at around 200-300 dpi. In practical terms: if the OCR fails, just zoom in and take a bigger (and therefore less pixelated looking) screenshot.

Otherwise, there are a huge number of tools for OCR image preprocessing. My favorites involve imagemagick, specifically the textcleaner script by Fred Weinhaus. My preprocessing consists of the following line:

... | textcleaner -g -e stretch -f 75 -o 10 -t 50 -s 1 - png:- | ...

I never remember what these flags are (though they are all explained well in the link above). What matters is that these flags have the following result:

Left: original. Right: after processing.

The left image has a little bit of noisiness to it due to the white-outlined text and fuzzy font. OCR has less trouble with the right image, which is sharpened and black/white. The right image gets scanned without error.

So doing some preprocessing (before scanning the image with OCR) makes a huge difference in the result.

Postprocessing

Even with preprocessing, the output of OCR can be noisy, out of order, bizarrely spaced, etc.

One error that’s pretty common, in my experience, is the strange spacing. I solve this by removing leading and trailing spaces on each line. This can be fixed by inserting the following sed command into the pipeline from the last post.

... | sed -Ee 's/^[[:space:]]+|[[:space:]]+$//g' | pbcopy

But most errors do not have a simple programmable fix like this. They often warrant manual corrections. Misspellings, random breaks in words, text out of order, smart quotes when you don’t want them, etc.

Because of this, I often postprocess the OCR text output manually. You can see some examples of this manual cleanup in part 1.

Changing the flags to tesseract

By default, tesseract means tesseract -l eng --psm 3 --oem 3 --dpi 300, which affects:

Typically I add --psm 6 --dpi 226, for “6: Assume a single uniform block of text” and since my MacBook display has a density of 226 dpi.

All together now

Using all of the above techniques for enhancing OCR results, you might arrive at something like this (assuming you’ve installed imagemagick and textcleaner):

PATH="/usr/local/bin/:$PATH"
pngpaste - \
  | textcleaner -g -e stretch -f 75 -o 10 -t 50 -s 1 - png:- \
  | tesseract --psm 6 --dpi 226 stdin stdout \
  | sed -Ee 's/^[[:space:]]+|[[:space:]]+$//g' \
  | pbcopy
osascript -e 'tell application "System Events" to keystroke "v" using {command down}'

This is what I’d bind to my cmdctrlv as described back in part 2.

tagged: tutorials, mac OCR at your fingertips (part 3) (permalink) (tweet)
OCR at your fingertips (part 2) J is my calculator