Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Secrets of the paperless office: optimizing OCR

Joe Kissell | July 11, 2013
OCR software converts document scans into searchable PDFs. But what settings and software will get you the most accurate results while using the least hard-disk space? Joe Kissell's results may surprise you.


Since I started using a document scanner about seven years ago, I've scanned many thousands of pages and used OCR (optical character recognition) software to convert those scans into searchable PDFs. I've also written extensively about the paperless office. But when you try to reduce the amount of paper you use, you inevitably increase the amount of hard-drive space you use. I began to wonder what combinations of scanner settings and software would get the best quality scan results while using the least hard-disk space.

What sparked my investigation was a claim that some OCR apps increase the file sizes of scanned images dramatically, whereas others (Acrobat Pro in particular) shrink them. When you plan to store and read scanned documents on an iOS device, compactness is especially important. Unfortunately, Adobe's $499 Acrobat Pro XI () can no longer be driven externally by AppleScript, which means it requires tedious manual clicking to perform OCR. Were other OCR apps really inflating file sizes, and was there any way around this problem without resorting to Acrobat?

Hundreds of experiments later, I came up with some surprising results. Read on for all the details or skip to the "So, where's the sweet spot?" section for the bottom line.

The ins and outs of OCR
When you initially save a scanned document as a PDF file, you get nothing more than a bitmapped image in a PDF wrapper. Your scanner's software most likely has settings to determine the resolution of the scans in dpi (dots per inch), the color mode (black and white, grayscale, or color), and the amount of compression applied to the scanned image. All those settings affect not only the appearance of the scan but also the quality of information the OCR engine has to work with. Once OCR software recognizes the text in a PDF, it saves that text in an invisible layer along with the image so you can see what the document originally looked like, but can also search, select, and copy its text.

Besides recognizing the text, OCR software may downsample the image (decrease its resolution, so that it takes up less space) or change the compression used. Sometimes these features are user-configurable; in other cases, they're hardwired. Acrobat Pro has yet another option—a feature called ClearScan that replaces all the bitmapped text with a custom font (which takes up much less space), and then swaps out the original image for one with a much lower resolution. ClearScan nearly always results in the smallest possible PDF, but it may not be the best choice if you want to be sure your scanned image looks exactly like the original, even when printed. In addition, using ClearScan means settling for Acrobat's OCR engine, about which I'll say more in a moment.


1  2  3  4  Next Page 

Sign up for CIO Asia eNewsletters.