Find and Fix PDF Files That Lack Searchable Text

Archiving paper documents as PDF files is a great way to save shelf space and preserve essential records.

However, more than simply scanning the documents is required. It would be best if you also used Optical Character Recognition (OCR) to process the scans. Once OCR has processed a PDF scan, the file contains an invisible text version in addition to the scanned image of the document. macOS Spotlight can now index the content, and you can use HoudahSpot to search your document archive.

But what if some of your PDF files lack OCR text?

With a little trick, HoudahSpot can find PDF files that lack text content. It is safe to assume that any text contains either a space or a period. Thus, we will be looking for any PDF file that contains neither space or period.

This translates to the following search:

Find PDF files that lack OCR text. The first search field contains a “space” character

