Automated Text Extraction from PDF Images with OCRmyPDF
When extracting text content from PDF files, you occasionally find embedded images without any text nodes. For tiny PDF:s this can usually be solved manually, but it's not feasible to manually re-type text from a large number of PDF pages. Especially not as part of a data pipeline processing many thousands of documents.
Luckily, there are ways to automate the text extraction by using so called Optical Character Recognition (OCR) software.
One great example is the open source program OCRmyPDF, which in turn is built on top of Tesseract. The best thing about this tool compared to others is that it runs completely locally on your computer which allows you to keep sensitive data private. Since it's a command-line tool, it's easy to automate and process many files in parallel.
ocrmypdf
can usually be installed with one command to let you start using it. Though, as always, make sure you are installing from a reputable source, or build the program yourself from the source code.
Using ocrmypdf
to extract text from PDF:s
If you only need to extract text from PDF files with English content, you can use the default language pack which usually comes preinstalled.
Here's how to perform OCR on a PDF with English content:
If some pages have text content already, you can skip them with --skip-text
:
Using specific languages
If you need support for additional languages, you can install additional language packs. If you for example want to use German, you would install the deu
language pack and then use it like this:
If you want both German and English, you can enable multiple language packs:
Conclusion
These commands usually solve most cases for me with really good results. Even though it's not always perfect, the output from ocrmypdf
is a much better starting point for manually reviewing the PDF texts when it's important to make 100% correct conversions.
There are also plenty of options to explore with ocrmypdf
to improve your results. If you find cases where it doesn't work, both ocrmypdf
and tesseract
are open source projects that could become even better with your contributions. In other cases, there are other OCR tools available, many of which are libre software. However, I've not needed them so far.