In order to scan and recognize Chinese pages, I was looking for an OCR on my linux distribution.
After several tests, I came to choose tesseract, which was originally developed… in HP Labs ! (now under Apache license).
Unfortunately, the Ubuntu package available was built for version 2.x, which does not allow Chinese support (need version 3.x)…
So I installed version 3.0 from scratch.
This is not too much difficult, but there are some steps to follow, and I propose to guide you there…
I assume you already have ImageMagick installed. It is not really mandatory, but will save you many questions when manipulating images…
Step 1: install minimum compilation environment
If not already available, you should install your environment to be able to compile packages:
sudo apt-get install gcc g++ automake
Step 2: install graphic libraries
Tesseract is said to be dependant of several graphical libraries dev packages, as well as on Leptronic (see below).
Leptronic is also dependant of graphical libraries to run.
To solve this, you should install:
sudo apt-get install libpng12-0 libpng12-dev libjpeg62 libjpeg62-dev libtiff4 libtiff4-dev libzip1
Notes:
- Tesseract mentions dependencies on zlibg-dev. I could not find it, and I didn’t see any issue for not having it installed…
- On my Ubuntu release, libpng12-dev and libjpeg62-dev were already installed, as well as all libraries packages. This resulted in the following unique installation:
sudo apt-get install libtiff4-dev
Step 3: install Leptonica
This software is required to manipulate graphs (home page).
You can get it with apt-get, but the packaged version is old (1.62). I decided to fetch latest version and compile it also
Check the download page to get the latest one.
Next steps are usual:
tar zxvf leptonica-1.68.tar.gz
cd leptonica-1.68/
./configure
make
sudo make install
Now, libs are installed, by default, in /usr/local/lib.
If this directory is not in your system default Library directory path, I suggest that you create a file under /etc/ld.so.conf.d (that you could name “local.conf”), and which would contain a single line : “/usr/local/lib”.
Then:
sudo ldconfig
You are done with leptonica.
Step 4: install Tesseract
You should download the latest version from the web site (here)
Next steps are usual:
tar zxvf tesseract-3.00.tar.gz
cd tesseract-3.00/
./runautoconf
./configure
make
sudo make install
Same remark as above concerning /usr/local/lib:
If this directory is not in your system default Library directory path, I suggest that you create a file under /etc/ld.so.conf.d (that you could name “local.conf”), and which would contain a single line : “/usr/local/lib”.
Then:
sudo ldconfig
Step 5: Install learning repositories
Tesseract needs some “learning packages” to be able to recognize your language.
They are available on the web site, in the download structure.
Their naming convention is xxx.taineddata.gz (with xxx being an acronym of the language).
Download each needed one.
cd /usr/local/share/tessdata
for i in chi_sim.traineddata eng.traineddata fra.traineddata
do
wget -O - http://tesseract-ocr.googlecode.com/files/${i}.gz | gunzip > /tmp/$i
sudo cp /tmp/$i .
done
Now, use it !
I went to my usual printer corner, to scan a chinese page. I received it in my mailbox in PDF format (that’s an embedded feature of our printers, here).
I attach it here, if you want to test also.
First, I need to convert this PDF file into a TIFF file, using ImageMagick convert:
convert -density 300 chinois.pdf -depth 8 chinois.tif
I used the same dpi as for the scan.
I also used 8 depth, as it seems to be important for tesseract (from what I understood).
Next step is to OCR this TIFF file:
tesseract chinois.tif chinois -l chi_sim
Don’t forget to specify the language using the “-l” option !
This creates a text file, that looks pretty like the original one (the remaining mistakes can be easily corrected manually) !