Secrets of the paperless office: optimizing OCR

Joe Kissell | July 11, 2013
OCR software converts document scans into searchable PDFs. But what settings and software will get you the most accurate results while using the least hard-disk space? Joe Kissell's results may surprise you.

I wanted to have some solid statistics to work with, so I scanned a couple of documents dozens of times each, with many combinations of resolution, color mode, and compression. Then I ran various raw scans through four different OCR engines: ABBYY's $100 ABBY FineReader Express (), Acrobat Pro X, Smile's $100 PDFpenPro (), and the version of ABBYY FineReader built into Devon Technologies' DEVONthink Pro Office (). The four engines I tested are a small subset of the OCR tools available on the Mac, but they're among the most popular. I examined the results for file size, OCR accuracy, and image fidelity.

How OCR affects file size
Most desktop document scanners have an optical resolution of 600 dpi, but let you scan at a lower resolution if you prefer. For my tests, I used a Fujitsu ScanSnap iX500 (), which is 600 dpi natively but offers up to 1200 dpi through software interpolation. Discounting compression, doubling the number of dots per inch quadruples the file size—plus scanning at higher resolutions can take much longer. So the trick is to find the lowest resolution that will meet your needs.

Although many variables come into play, my results showed that for documents consisting mainly of black text on a white background, a 300-dpi grayscale scan can run anywhere from about 250KB to 1MB per page (depending on the level of compression) before  applying OCR. It probably goes without saying that black-and-white images are smallest and color images largest, with grayscale in between. Likewise, increasing resolution always increases file size, while increasing compression decreases file size. (Files with the lightest compression tended to be about three to five times larger than those with the heaviest compression.) None of that is surprising, but what did surprise me was how OCR software changes the original sizes.

In every case, PDFpenPro did exactly what I expected, which was to increase the original file size only slightly. That is, it left the image alone and simply added the text. Acrobat Pro, given its default settings (that is, using neither ClearScan nor downsampling) behaved roughly the same as PDFpenPro with color and grayscale images; for the most part, it increased the sizes by a bit less than FineReader did. But with black-and-white images, Acrobat Pro applied its own compression which shrank the files, sometimes by as much as 90 percent.

On the other hand, FineReader Express, which also compressed the images again, produced entirely different results. Black-and-white images grew, sometimes profoundly—for example, a 77KB file became 343KB and a 2.7MB file ballooned to 13.2MB. With grayscale and color images, the results were inconsistent; some files grew while others shrank.


