HoudahSpot icon

Find PDF Files That Need OCR Processing

There is an updated version of this post.

Scanning paper documents to PDF files lets you archive important (and not so important) documents without filling up cabinets.

Optical Character Recognition (OCR) makes these scanned documents much more useful than their paper originals. Once a scan has been processed by OCR, the PDF file contains both an image of the document and an invisible text version. The text can then be searched using HoudahSpot.

Unfortunately, you will find that not all of your PDF files have text content. You may have forgotten to run them through OCR. Or you may have received the scanned document from someone else.

How can you find these files and rectify this?

With a little trick, HoudahSpot can find PDF files that lack text content. It is safe to assume that any text contains either a space or a period. Thus, we will be looking for any PDF file that contains neither space or period.

This translates to the following search:

Find PDF files that lack OCR text. The first search field contains a “space” character

You don’t have to set up the search yourself: Simply download the “Image PDFs” search template to find PDF files that lack OCR text.

Once you have found all PDF files lacking text information, you’ll need to process them using OCR. This will make their text machine-readable and searchable.

You can use PDFPen from Smile Software to do so. Since PDFPen is scriptable, you can automate the task.

  1. Download the “OCR PDF Document” Automator workflow
  2. Open the workflow in Automator
  3. Automator will suggest to install the workflow as a Service. Agree.
  4. If you use PDFPen Pro instead of PDFpen, you’ll need to edit the script within the workflow to replace “PDFpen” with “PDFpen Pro”. Save your changes.

Back in HoudahSpot:

  1. Select the PDF files you want to process
  2. Select HoudahSpot > Services > OCR PDF Document from the menu
  3. PDFPen will launch in the background, process your files and quit
  4. Once the files have been processed and text content was found, they will disappear from your HoudahSpot search. They no longer match the criteria you have set to find files with no text content.

Note: Turn off the “Prompt for OCR when opening a scanned document” option in PDFPen preferences to avoid “This document appears to be scanned” while the script sends files to PDFPen.

Credit: The “OCR PDF Document” Automator workflow uses a modified version of an Apple Script created by Greg Scown. Feel free to modify and redistribute the Automator workflow or the script included within.