Ocropus vs Tesseract

Recently I have had to extract text from pdf files. Most of these pdfs were scans and have not so good quality, therefore I have decided to use OCR (Optical character recognition) software. First, I used ImageMagick (see 1) to convert pdfs to images and then I used Ocropus 3.1. as a OCR (see 2). When Ocropus failed to provide me results — I got almost empty html files with just few cryptic sentences — I changed  to Tesseract 2.0 (see 3).

Tessearct 2.0 provides very good results for text, even if quality of tiff files is low. Now I plan now to take a look on Tesseract 3 alfa version.

1. Transform pdf to images (after superuser.com)

convert  file_name.pdf  file_name.png

This command will generate a png file for every page from the pdf file. However this command produces images with very low quality. Therefore — following this discussion on stackoverflow.com — I have used Ghostscript for this purpose:

gs \
-sDEVICE=jpeg \
-o output/page_%03d.jpeg \
-r600 \
-dJPEGQ=95 \

This command creates images with very high quality.

2. Next, I have tried to use Ocropus — installed from a xubuntu package repository — to convert pdf to html.

ocroscript recognize file_name-0.png > file_name.html

Unfortunately for most of pdfs, I got a blank html page with some cryptic charts. According to this post Ocropus is less stable then Tesseract — which is used internally by Ocropus —  alone.

3. Turn for Tesseract 2 (a stable version 2 can be found in xubuntu package repository).

First we need to generate tif files instead of jpeg. We can use for this purpose GhostScript:

gs \
-sDEVICE=tiffg4 \
-o output_dir/page_%03d.tif \
-r600 \
-dJPEGQ=95 \

Now we can run Tesseract:

tesseract page001.tif page001

Finally I can see some results. Unfortunately Tesseract fails to extract tables and has problems with text in the tables.


