Article : Tess4J - Java Wrapper for Tesseract OCR API

Tess4J

DESCRIPTION

Tess4J is a JNA wrapper for Tesseract OCR API; it provides character recognition support for common image formats, and multi-page images. The library has been developed and tested on Windows and Linux.

Tess4J is released and distributed under the Apache License, v2.0. Its official homepage is at http://tess4j.sourceforge.net.
SOFTWARE REQUIREMENTS

Java Runtime Environment 6.0 32-bit, JNA, and JAI-ImageIO are required. Apache Ant and JUnit are used for program building and unit testing.
INSTRUCTIONS

The Tesseract OCR DLL file, language data for English, and sample images are bundled with the library. Language data packs for Tesseract should be decompressed and placed into the tessdata folder.

The Linux shared object library (libtesseract.so) equivalent to the DLL is available in Tesseract 3.02, which can be built from the source with the instructions given in Tesseract Wiki.

PDF support is possible via GPL Ghostscript. After installation of GS, ensure its shared library object (gsdll32.dll) is in the search path by setting the appropriate environment variable. On Windows, append the following to Path value (accessible through Control Panel > System > Advanced > Environment Variables) for GS version 9.10:

  ;C:\Program Files\gs\gs9.10\bin

To unit test, at the command line, execute:

  ant test

Images to be OCRed should be scanned at resolution from at least 200 DPI (dot per inch) to 400 DPI in monochrome (black&white) or grayscale. Scanning at higher resolutions will not necessarily result in better recognition accuracy. The actual success rates depend greatly on the quality of the scanned image. The typical settings for scanning are 300 DPI and 1 bpp (bit per pixel) black&white or 8 bpp grayscale uncompressed TIFF or PNG format. PNG is usually smaller in size than other image formats and still keeps high quality due to its employing lossless data compression algorithms; TIFF has the advantage of the ability to contain multiple images (pages) in a file.

Several built-in functions are also provided for merging several images or PDF files into a single one for convenient OCR operations, or for splitting a PDF file into smaller ones if it is too large, which can cause out-of-memory exceptions.
CODE EXAMPLES

The following code example shows common usage of the library. Make sure libtesseract302.dll and tessdata folder are in the same directory and the .jar files are in the classpath.

  package net.sourceforge.tess4j.example;

  import java.io.File;
  import net.sourceforge.tess4j.*;

  public class TesseractExample {

  public static void main(String[] args) {
  File imageFile = new File("eurotext.tif");
  Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
  // Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping

  try {
  String result = instance.doOCR(imageFile);
  System.out.println(result);
  } catch (TesseractException e) {
  System.err.println(e.getMessage());
  }
  }
  }