Samuel Plumppu

Automated Text Extraction from PDF Images with OCRmyPDF

PDFData PipelineOCR

When extracting text content from PDF files, you occasionally find embedded images without any text nodes. For tiny PDF:s this can usually be solved manually, but it's not feasible to manually re-type text from a large number of PDF pages. Especially not as part of a data pipeline processing many thousands of documents.

Luckily, there are ways to automate the text extraction by using so called Optical Character Recognition (OCR) software.

One great example is the open source program OCRmyPDF, which in turn is built on top of Tesseract. The best thing about this tool compared to others is that it runs completely locally on your computer which allows you to keep sensitive data private. Since it's a command-line tool, it's easy to automate and process many files in parallel.

ocrmypdf can usually be installed with one command to let you start using it. Though, as always, make sure you are installing from a reputable source, or build the program yourself from the source code.

Using ocrmypdf to extract text from PDF:s

If you only need to extract text from PDF files with English content, you can use the default language pack which usually comes preinstalled.

Here's how to perform OCR on a PDF with English content:

If some pages have text content already, you can skip them with --skip-text:

Using specific languages

If you need support for additional languages, you can install additional language packs. If you for example want to use German, you would install the deu language pack and then use it like this:

If you want both German and English, you can enable multiple language packs:

Conclusion

These commands usually solve most cases for me with really good results. Even though it's not always perfect, the output from ocrmypdf is a much better starting point for manually reviewing the PDF texts when it's important to make 100% correct conversions.

There are also plenty of options to explore with ocrmypdf to improve your results. If you find cases where it doesn't work, both ocrmypdf and tesseract are open source projects that could become even better with your contributions. In other cases, there are other OCR tools available, many of which are libre software. However, I've not needed them so far.

Thank you for reading! 🌱

Read 10 more posts or learn more about me.