tutorials

Entries tagged: tutorials

March 29, 2020. OCR at your fingertips (part 3) tags: tutorials, mac
In the real world, you’ll encounter text that isn’t very machine-readable. You’ll ask OCR to extract text, and it will fail miserably.

Using tesseract out of the box like ... | tesseract stdin stdout | ... got me very okay results, but certainly not the near-perfect results I showed off in part 1. There are three different solutions to this, and they can be combined. Briefly, they are: preprocessing, postprocessing, and tesseract config. (Skip to the bottom for the resulting command.)

Preprocessing

The best step you can take towards improving OCR accuracy is not giving it a noisy image in the first place.

One preprocessing “trick”, is just having more detailed (read: bigger) images. Apparently tesseract data is trained on 600 dpi (highly detailed) images, while Mac retina displays hover at around 200-300 dpi. In practical terms: if the OCR fails, just zoom in and take a bigger (and therefore less pixelated looking) screenshot.

Otherwise, there are a huge number of tools for OCR image preprocessing. My favorites involve imagemagick, specifically the textcleaner script by Fred Weinhaus. My preprocessing consists of the following line:
```
... | textcleaner -g -e stretch -f 75 -o 10 -t 50 -s 1 - png:- | ...
```
I never remember what these flags are (though they are all explained well in the link above). What matters is that these flags have the following result:

Left: original. Right: after processing.

The left image has a little bit of noisiness to it due to the white-outlined text and fuzzy font. OCR has less trouble with the right image, which is sharpened and black/white. The right image gets scanned without error.

So doing some preprocessing (before scanning the image with OCR) makes a huge difference in the result.

Postprocessing

Even with preprocessing, the output of OCR can be noisy, out of order, bizarrely spaced, etc.

One error that’s pretty common, in my experience, is the strange spacing. I solve this by removing leading and trailing spaces on each line. This can be fixed by inserting the following sed command into the pipeline from the last post.
```
... | sed -Ee 's/^[[:space:]]+|[[:space:]]+$//g' | pbcopy
```
But most errors do not have a simple programmable fix like this. They often warrant manual corrections. Misspellings, random breaks in words, text out of order, smart quotes when you don’t want them, etc.

Because of this, I often postprocess the OCR text output manually. You can see some examples of this manual cleanup in part 1.

Changing the flags to tesseract

By default, tesseract means tesseract -l eng --psm 3 --oem 3 --dpi 300, which affects:
- -l: trained dataset(s) to use (e.g. -l eng+fra), see part 2 for more on language datasets
- --psm: what kind of text to expect from the image, see tesseract --help-psm
- --oem: underlying engine for tesseract, see tesseract --help-oem
- --dpi: density in dpi of the input image
Typically I add --psm 6 --dpi 226, for “6: Assume a single uniform block of text” and since my MacBook display has a density of 226 dpi.

All together now

Using all of the above techniques for enhancing OCR results, you might arrive at something like this (assuming you’ve installed imagemagick and textcleaner):
```
PATH="/usr/local/bin/:$PATH"
pngpaste - \
  | textcleaner -g -e stretch -f 75 -o 10 -t 50 -s 1 - png:- \
  | tesseract --psm 6 --dpi 226 stdin stdout \
  | sed -Ee 's/^[[:space:]]+|[[:space:]]+$//g' \
  | pbcopy
osascript -e 'tell application "System Events" to keystroke "v" using {command down}'
```
This is what I’d bind to my cmdctrlv as described back in part 2.
tagged: tutorials, mac OCR at your fingertips (part 3) (permalink) (tweet)

March 28, 2020. OCR at your fingertips (part 2) tags: tutorials, mac
Here’s a five minute tutorial on how to bind OCR capabilities to a shortcut on Mac. It’s really that simple!

High-level overview

The key insight is that those tasks become trivial if only I could take a screenshot, and magically paste it as text. Let’s develop this magic. The plan is:
- You hit ctrlshiftcmd4 to save a portion of the screen in pasteboard
- You hit a special paste shortcut that we create: ctrlcmdv
- Image in pasteboard is sent through OCR
- Resulting text is pasted onto the frontmost app
which all turn out to be doable using the following open-source packages:
- pngpaste: Mac utility for handling image data on the pasteboard
- tesseract: open source OCR
Installing dependencies

With Homebrew:
```
brew install pngpaste tesseract
```
or manually install pngpaste and tesseract. Ensure that they are installed by running:
```
pngpaste -v
```
and
```
tesseract --version
```
New toys

Try it out! Hit ctrlshiftcmd4 and grab a screenshot of some text. Then in a terminal, type
```
pngpaste - | tesseract stdin stdout | pbcopy
```
This uses pngpaste to send the image to our OCR tool, tesseract. The result is sent to pbcopy which places the resulting scanned text into our pasteboard, ready to be pasted. Try selecting this paragraph, running the above, and pasting the result into this text box:

Hopefully the results are acceptable! Hint: The next article will deal with refining our OCR results.

If nothing happened, try running pngpaste - | tesseract stdin stdout in a terminal, with an image in your pasteboard. Most likely, you need to set up Tesseract with language data.

The rise of automation

To truly get OCR to our fingertips, we’d like to run pngpaste - | tesseract stdin stdout | pbcopy at the touch of a button.

Open up Automator.app (preinstalled on all Macs) and create a Service. Then drag in “Run Shell Script” from the left, and enter what we had above:
```
PATH="/usr/local/bin/:$PATH"
pngpaste - | tesseract stdin stdout | pbcopy
```
(I prepended /usr/local/bin/ to PATH, since that is where Homebrew installs pngpaste and tesseract for me.)

Add this to the end if you want to automatically paste the result afterwards!
```
osascript -e 'tell application "System Events" to keystroke "v" using {command down}'
```
Important: At the top, select “Service receives no input in any application.” We use pngpaste for input, so Automator would otherwise complain about input.

Here’s what you should end up with:

Save and give this service a name, like “Run OCR”.

Binding to a shortcut ctrlcmdv

After saving this service with a name (say, Run OCR), open up System Preferences > Keyboard > Shortcuts > Services, and scroll all the way down to find Run OCR. All that’s left to do is click on Run OCR to bind a shortcut. I use ctrlcmdv.

Usage: Try taking a screenshot of this paragraph with ctrlshiftcmd4, and use your shortcut in the text box below, as if you were pasting text. (That’s why I choose ctrlcmdv – it’s almost like pasting).

Feeling powerful yet?

Different languages

Something I do with OCR is translate comics that aren’t in English. For tesseract, this means you’d need to download language data to recognize languages that are not English. You can find this language data here. If you want to have tesseract recognize Korean as well as English, for example, then download and move kor.traineddata into the $TESSDATA_PREFIX directory. Then change the tesseract command like so:
```
tesseract -l eng+kor stdin stdout
```
You can make the list as long as you want, like eng+chi_sim+chi_tra+jpn+kor. Be warned that runtime becomes noticeably long after more than three languages, at least in my experience.

Wrap-up

With very little code, we’ve bound OCR to a hotkey. In the next post we’ll explore ways to get more accurate results with OCR.
tagged: tutorials, mac OCR at your fingertips (part 2) (permalink) (tweet)

March 26, 2020. OCR at your fingertips (part 1) tags: tutorials, mac
What is OCR? Why should I care?

If you’re a student like me, taking online classes, then OCR is a lifesaver.

OCR (Optical Character Recognition) is a fancy way of saying “turn a picture of text, into normal text that you can copy and paste.” Sounds useful? It is! It makes notetaking from live slideshow lectures a breeze.

I find it really helpful to have OCR just as an everyday tool. Here are some situations where OCR saves me a lot of typing:
1. Grabbing text from videos, slideshows, streams, video lectures:
2. Grabbing text from a game application
3. Translating images
4. Copying one column from a web table
5. Copy a numbered list including its numbers:
6. Selecting code from tutorials without selecting prefixes, prompts, line numbers:
7. Copying text in cases where I can’t select: (here, the text is on a badly designed button)
So even when I don’t really need it, OCR really saves me time and cognitive load.

Today, the role of OCR technology is mostly to scan documents in bulk, like turning books into ebooks. This means most OCR apps require quite a bit of setup and work. But with the automation capabilities on Mac, it’s really quick and easy to bring OCR technology to your fingertips, which you can see above! See this next post for a (very) short tutorial on how to actually accomplish this.
tagged: tutorials, mac OCR at your fingertips (part 1) (permalink) (tweet)

tutorials

Preprocessing

Postprocessing

Changing the flags to tesseract

All together now

High-level overview

Installing dependencies

New toys

The rise of automation

Binding to a shortcut ctrlcmdv

Different languages

Wrap-up

Changing the flags to `tesseract`