Find and Fix PDF Files That Lack Searchable Text

Archiving paper documents as PDF files is a great way to save shelf space and preserve essential records.

However, more than simply scanning the documents is required. It would be best if you also used Optical Character Recognition (OCR) to process the scans. Once OCR has processed a PDF scan, the file contains an invisible text version in addition to the scanned image of the document. macOS Spotlight can now index the content, and you can use HoudahSpot to search your document archive.

But what if some of your PDF files lack OCR text?

Continue reading Find and Fix PDF Files That Lack Searchable Text

Find PDF Files That Need OCR Processing

There is an updated version of this post.

Scanning paper documents to PDF files lets you archive important (and not so important) documents without filling up cabinets.

Optical Character Recognition (OCR) makes these scanned documents much more useful than their paper originals. Once a scan has been processed by OCR, the PDF file contains both an image of the document and an invisible text version. The text can then be searched using HoudahSpot.

Unfortunately, you will find that not all of your PDF files have text content. You may have forgotten to run them through OCR. Or you may have received the scanned document from someone else.

How can you find these files and rectify this?

With a little trick, HoudahSpot can find PDF files that lack text content. It is safe to assume that any text contains either a space or a period. Thus, we will be looking for any PDF file that contains neither space or period.

This translates to the following search:

Find PDF files that lack OCR text. The first search field contains a “space” character

Continue reading Find PDF Files That Need OCR Processing