Archiving paper documents as PDF files is a great way to save shelf space and preserve essential records.
However, more than simply scanning the documents is required. It would be best if you also used Optical Character Recognition (OCR) to process the scans. Once OCR has processed a PDF scan, the file contains an invisible text version in addition to the scanned image of the document. macOS Spotlight can now index the content, and you can use HoudahSpot to search your document archive.
But what if some of your PDF files lack OCR text?
Locate the files that need fixing
You can use HoudahSpot to locate image-only PDF files and pass these on to OCR software for processing.
Set up a HoudahSpot search for files for which Spotlight has not indexed any text content:
- Start with a blank search: a new search window created from the factory default.
- In the Refine section of your search, configure a criterion to say “Text content” “is set”
- Next, you need to negate this criterion. Press and hold the criteria group button at the right edge of the criterion row. Select Nest in “None” group from the menu that pops up. Your search now matches files that lack text content.
- Next, restrict your search to files where “Content Kind” “is” “PDF”
The search setup now looks like this:
Rather than set up the search yourself, you download the “Image-only PDFs” search template to find PDF files that lack OCR text.
Add Searchable Text using PDFify
PDFify is a free macOS application by the developer behind the Receipts (*) application. OCR in PDFify can update your PDF files to become searchable.
To open a PDF file in PDFify, drag and drop the file from the search results to the PDFify icon.
Click the OCR button to start optical character recognition. Save the file after processing completes.
Automate OCR using Nitro PDF Pro
Nitro PDF Pro (*) is a full suite of PDF tools. It is available from the Nitro Software website and through a Setapp subscription.
Since PDFPen Pro is scriptable, you can automate the task.
- Download the “OCR PDF Document using Nitro PDF Pro” Automator workflow.
- Open the workflow in Automator.
- Automator will suggest installing the workflow as a Service. Agree.
Back in HoudahSpot:
- Select the PDF files you want to process.
- Select HoudahSpot > Services > OCR PDF Document using Nitro PDF Pro from the menu.
- PDFPen Pro will launch in the background, process your files, and quit.
- Once PDFPen Pro has added ORC text to the files, these will disappear from your HoudahSpot search. They no longer match the criteria you set to find files without text content.
Note: Turn off the “Prompt for OCR when opening a scanned document” option in PDFPen preferences to avoid “This document appears to be scanned” while the script sends files to PDFPen.
Credit: The “OCR PDF Document using Nitro PDF Pro” Automator workflow uses a modified version of an Apple Script created by Greg Scown. Feel free to modify and redistribute the Automator workflow or the script included within.
* : This post contains affiliate links to the Setapp subscription service.