« Seasonal variation in the numbers of Phaonia subventa | Main | Stag Beetle »
Thursday
Jul112019

How to make a PDF document searchable

I had a large PDF file that I wanted to be able to search quickly but it just contained images of text, not the actual text itself, so all searches in it failed.

I use Ubuntu so I reasoned that there should some solution for me available out on the Web.  There was.  This is the one I chose:

Install pdfsandwich:

sudo apt-get install pdfsandwich

Run pdfsandwich on the file you want to become searchable:

pdfsandwich test.pdf -o test-searchable.pdf -nthreads 12 -first_page 5 -last_page 290

Here I use the -nthreads option to prevent pdfsandwich from locking up my computer by using all of its 16 processors.  I also use the -first_page and -last_page options to exclude the table of contents and the index from the searchable area (these just produce unnecessary duplicate search results).  The -o option specifies the name of the output file.

The resulting file was less than a tenth the size of the original (20MB instead of 350MB) so the text and images were more grainy but they were still easily readable. And being able to search it quickly makes the document much more useful to me than the original was.

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
All HTML will be escaped. Hyperlinks will be created for URLs automatically.