Secrets of the paperless office: optimizing OCR

Joe Kissell | July 11, 2013
OCR software converts document scans into searchable PDFs. But what settings and software will get you the most accurate results while using the least hard-disk space? Joe Kissell's results may surprise you.

Although the stand-alone ABBYY FineReader Express doesn't let you modify its settings for image recompression, the version built into DEVONthink Pro Office does let you enable downsampling to the resolution of your choice as well as set the level of compression used on graphics. So, with that version of FineReader I was able to get file sizes closer to what PDFpenPro and Acrobat Pro produced.

The settings that affect OCR accuracy
Depending on resolution, color mode, and compression, the one-page scanned letter I tested ranged in size from 77KB to 2.2MB before OCR. But if OCR accuracy suffers with smaller file sizes, that may not be a good trade off. So my next question was, which combination of settings and OCR engine produce the most accurate result?

To test accuracy, I opened a PDF in Preview, selected all the text, and copied it into a BBEdit document; then I used BBEdit's Compare feature to highlight the differences between a given scan and a corrected model document. I counted errors as best I could; in many cases, such as when only spacing was different or when many words were run together, the number of errors was largely a matter of interpretation. Still, the overall trends were clear.

Resolution: At the lowest resolution I tested (150 dpi for grayscale and color; 300 dpi for black and white), OCR errors were so numerous in all the tested engines that it would have been almost as efficient for me to retype the documents as to correct the errors. Accuracy generally improved with increased resolution, but not linearly. For example, whereas a 300-dpi scan was far more accurate than a 150-dpi scan, the difference between a 300- and 600-dpi scan's accuracy was tiny.

Color mode: Black-and-white images yielded the worst OCR accuracy by far. Grayscale images were superior to black-and-white ones at every resolution, and 300-dpi grayscale scans yielded much better results than 600- or even 1200-dpi black-and-white scans. Color scans produced roughly the same accuracy, on average, as grayscale scans, except at very low resolutions (in which case, color scans were considerably worse than grayscale).

Compression: The amount of compression applied to the image had relatively little bearing on OCR accuracy, especially at 300 dpi and higher. What I did see at the highest levels of compression was more noise, in the form of fuzzy text and speckled line art. Even the noisiest scans were entirely legible, but I felt a medium level of compression was more pleasant to look at, with only a modest increase in file size.

Engines: Of the tools I tested, FineReader (in either stand-alone or embedded form) was far more accurate than either Acrobat Pro or PDFpenPro, and in most tests, Acrobat Pro was the least accurate. Even though Acrobat Pro was capable of producing the smallest files, I felt the amount of editing required on its output offset the value of the file size.


