OCR in PHP: Read Text from Images with Tesseract

Optical Character Recognition (OCR) is the process of converting printed text into a digital representation. It has all sorts of practical applications — from digitizing printed books, creating electronic records of receipts, to number-plate recognition and even circumventing image-based CAPTCHAs.

Robotic eye

Tesseract is an open source program for performing OCR. You can run it on *Nix systems, Mac OSX and Windows, but using a library we can utilize it in PHP applications. This tutorial is designed to show you how.

Installation

Preparation

To keep things simple and consistent, we’ll use a Virtual Machine to run the application, which we’ll provision using Vagrant. This will take care of installing PHP and Nginx, though we’ll install Tesseract separately to demonstrate the process.

If you want to install Tesseract on your own, existing Debian-based system you can skip this next part — or alternatively visit the README for installation instructions on other *nix systems, Mac OSX (hint — use MacPorts!) or Windows.

Vagrant Setup

To set up Vagrant so that you can follow along with the tutorial, complete the following steps. Alternatively, you can simply grab the code from Github.

Enter the following command to download the Homestead Improved Vagrant configuration to a directory named ocr:

git clone https://github.com/Swader/homestead_improved ocr

We’re not going to be using Laravel, so change the Nginx configuration in Homestead.yml from:

sites:
    - map: homestead.app
      to: /home/vagrant/Code/Laravel/public

…to…

sites:
    - map: homestead.app
      to: /home/vagrant/Code/public

You’ll also need to add the following to your hosts file:

192.168.10.10       homestead.app

Installing the Tesseract Binary

The next step is to install the Tesseract binary.

Because Homestead Improved uses a Debian-based distribution of Linux, we can use apt-get to install it after logging into the VM with vagrant ssh. It’s as simple as running the following command:

sudo apt-get install tesseract-ocr

As I mentioned above, there are instructions for other operating systems in the README.

Continue reading %OCR in PHP: Read Text from Images with Tesseract%


Source: Sitepoint