Find PDF Files That Need OCR Processing

Scanning paper documents to PDF files lets you archive important (and not so important) documents without filling up cabinets.

Optical Character Recognition (OCR) makes these scanned documents much more useful than their paper originals. Once a scan has been processed by OCR, the PDF file contains both an image of the document and an invisible text version. The text can then be searched using HoudahSpot.

Unfortunately, you will find that not all of your PDF files have text content. You may have forgotten to run them through OCR. Or you may have received the scanned document from someone else.

How can you find these files and rectify this?

With a little trick, HoudahSpot can find PDF files that lack text content. It is safe to assume that any text contains either a space or a period. Thus, we will be looking for any PDF file that contains neither space or period.

This translates to the following search:

Find PDF files that lack OCR text. The first search field contains a “space” character

