Archive for December, 2010

Ocropus vs Tesseract

Recently I have had to extract text from pdf files. Most of these pdfs were scans and have not so good quality, therefore I have decided to use OCR (Optical character recognition) software. First, I used ImageMagick (see 1) to convert pdfs to images and then I used Ocropus 3.1. as a OCR (see 2). When Ocropus failed to provide me results — I got almost empty html files with just few cryptic sentences — I changed  to Tesseract 2.0 (see 3).

Tessearct 2.0 provides very good results for text, even if quality of tiff files is low. Now I plan now to take a look on Tesseract 3 alfa version.

1. Transform pdf to images (after

convert  file_name.pdf  file_name.png

This command will generate a png file for every page from the pdf file. However this command produces images with very low quality. Therefore — following this discussion on — I have used Ghostscript for this purpose:

gs \
-sDEVICE=jpeg \
-o output/page_%03d.jpeg \
-r600 \
-dJPEGQ=95 \

This command creates images with very high quality.

2. Next, I have tried to use Ocropus — installed from a xubuntu package repository — to convert pdf to html.

ocroscript recognize file_name-0.png > file_name.html

Unfortunately for most of pdfs, I got a blank html page with some cryptic charts. According to this post Ocropus is less stable then Tesseract — which is used internally by Ocropus —  alone.

3. Turn for Tesseract 2 (a stable version 2 can be found in xubuntu package repository).

First we need to generate tif files instead of jpeg. We can use for this purpose GhostScript:

gs \
-sDEVICE=tiffg4 \
-o output_dir/page_%03d.tif \
-r600 \
-dJPEGQ=95 \

Now we can run Tesseract:

tesseract page001.tif page001

Finally I can see some results. Unfortunately Tesseract fails to extract tables and has problems with text in the tables.


Leave a Comment

Install Ubuntu onto VMware Player with OCRopus

1. Download latest VMware Player (free)
2. Download latest Ubuntu server .iso file.
3. Create virtual OS in VMware
4. Browse to Ubuntu iso and follow installation instruction.
5. If webserver is needed, install LAMP follow instructions below.

LAMP (Linux, Apache, MySQL and PHP) is an open source Web development platform that uses Linux as operating system, Apache as the Web server, MySQL as the relational database management system and PHP as the object-oriented scripting language.

We did show you in our previous post how to install LAMP in Ubuntu 10.04 with one command using tasksel command. It is a software installation application that is an integral part of the Debian installer and works under Ubuntu Linux too. It groups some packages by tasks and offers the user an easy way to install the packages for that task. It provides the same functionality as using conventional meta-packages. in Maverick this command dosn`t come by default, so we need to install it first before to perform the LAMP installation.

Open terminal and Type the command :install it first with

sudo apt-get install tasksel

Now to install LAMP, type the taskel command in terminal :

sudo tasksel

And select LAMP Server:

During the installation you will be asked to insert the mysql root password

Now check if php is working :

$sudo vi /var/www/info.php

and add
view source

save and exit

restart apache2 ,

#sudo /etc/init.d/apache2 restart

Now open browser and type :

http://ip/info.php or http://localhost/info.php

Php is installed.

To full manage your lamp Server database, install phpmyadmin

sudo apt-get install phpmyadmin

To login to phpmyadmin, open browser and type :

http://ip/phpmyadmin or http://localhost/phpmyadmin

6. To install OCRopus, go to and follow download install instruction. Or Or Openfst link provide is no longer available, instead get it from

If encounter error on arrayobject.h error with no such file and directory during make on ocroswig, simply install numpy using sudo apt-get install python-numpy.

How to use ocroscript =>
$ ocroscript recognize /path/to/file.png > /path/to/output.html
$ ocroscript recognize –tessLanguage=eng –output-mode=text ScanPagesPSLulu.jpg

7. Ubuntu sometime have problem with crawling internet in vmware, can try to disable IPv6 by issue
Using sysctl you can disable IPv6 on the running system without rebooting:

sysctl -w net.ipv6.conf.all.disable_ipv6=1

To disable permanently add “net.ipv6.conf.all.disable_ipv6=1″ to /etc/sysctl.conf.

run sysctl -p.

Leave a Comment

MySQL basic commands

Leave a Comment

Installing phpmyadmin and PHP 5.2.* on a Centos 5.2 Server (updated)

Leave a Comment

Upgrading MYSQL from v5.0 to v5.1 in CentOS

Leave a Comment

Configuring REPOSITORY in CentOS

Leave a Comment

How to install Yum in CentOS

Leave a Comment

Older Posts »