Editors' Picks Society & Culture

Thaana OCR using Machine Learning.

Thaana alphabet

According to Wikipedia, “The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents.” However, mostly this only applies to typewritten documents, a sufficiently high-quality 2D bitmap of rectangles, each containing an identified Latin, Greek, Han (Chinese), and Kana (Japanese ) character of one of a set of well-behaved, prespecified fonts. When it comes to eccentric unknown fonts, noisy scans, Asian characters it is still a complex problem to be solved.

The Document Image Analysis (DIA) — A DIA consists of three main modules. The first major step is to segment the textual information from a given scanned page. The layout analysis module extracts text lines, words or characters from a given page and the OCR module does the actual text recognition. Errors induced by these two modules are then corrected at a post-processing step using language models or dictionary correction.

Here we will only focus on the OCR (for the Thaana script). This work was done as part of my machine learning experiments and in no way is claimed to be a fully functional Thaana OCR system. This only is done as a proof of concept for utilising machine learning for Thaana OCR.

This work attempts to implement a dataset structure, a synthetic data generator for the generation of realistic training data and ultimately a deep neural net based classifier capable of Thaana text recognition.

Generally, language models or recognition dictionaries are usually considered an essential step in an OCR system. The method used for ThaanaOCR utilises a neural net, as deep learning classifiers display the ability to operate well even in absence of dedicated, detached and specialised dictionary or language models. The original design of the classifier, as well as some of the training and evaluation infrastructure, is based on an example provided by the open source Keras project.

The system is developed using the Python programming language and uses the Keras deep learning library. The Keras library is compatible with both Tensorflow and Theano. However this system is developed and tested with Tensorflow and it’s possible it wouldn’t work with Theano (due to the use of CTC loss calculation)

Presently the system is designed for recognition of a single line of Thaana text. The primary input is limited to a single image/bitmap, with the resolution of 512 x 64 pixels. The same input size is used by the original Keras example as well. However with minor modifications the code, this can easily be changed to work with variable sizes. No pre-processing of the input data is done beyond ensuring the structure is compatible with the input of the neural net classifier.

The deep net classifier is fully implemented using the natively builtin layers of the Keras library. Besides these, the model also includes auxiliary layers, reshape and repeat vector. The architecture is mostly intact from the original example implementation with the exception of the auxiliary input and changes in settings of the pooling layers.

The model’s complete architecture is shown below :

The method based on the Keras example which uses a cyclic neural network. The input data is first pass through the convolution neural network and feature are extracted (convolutional kernels can easily learn things like edge detection, etc). Finally the Conectionist Temporal Classification (CTC) Loss function is applied to complete the model training. Presently the way CTC is modelled in Keras is that you need to implement the loss function as a layer.

Convolutional Neural Networks (CNN) share the parameters across the spatial dimension, usually on image data. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data.

Every image is a matrix of pixel values. Source [6]

CNN’s are very well suitable for learning what basically serves as a replacement for both the “preprocessing” and “feature extraction” steps of the classic OCR. This helps the classifier to deal with problems like variable degree of contrast between background and foreground.

Both the convolutional layers implemented maintain native input resolution (512×64), each operating with 16 convolutional kernel and they both use the “Relu” activation function, with border mode “set assame”. The convolutional stage of the classifier has been left mostly intact as they have been set by the Keras example source code.

Unlike traditional OCR systems where it is commonly a segmentation-based approach or segmentation-free approach, the method used is a sequence learning approach. In sequence learning (or sequence labeling), the supervised learning is carried out on full sequences instead of individual components. Both input and targets are in the form of a sequence and the job of a classifier is to do sequence-to-sequence mapping. A specialized algorithm, known as the Connectionist Temporal Classification (CTC), which is similar to HMM’s forward-backward algorithm, is used to align the output activations of the neural network with target labels. These labels are the characters.

The image is a high level Illustration of how the NN is used for the OCR task. In sequence learning paradigm, the NN is trained in a supervised manner using full sequences (text-line image) and the corresponding truth sequence. The output activations are aligned with the truth using the forward-backward algorithm (CTC).

The convolutional layer we use represents a set of convolutional filters that each apply a fixed kernel at each point of input, as described in the following equation (as a TeX command)

x_{ij}^ell = sum_{a=0}^{m-1} sum_{b=0}^{m-1} omega_{ab} y_{(i+a)(j+b)}^{ell - 1}.

The convolutional classifier consists of two convolutional layers in a series, with a two dimensional maxpooling layer. The max-pooling layers are quite simple, and do no learning themselves. They simply take some k×k region and output a single value, which is the maximum in that region. In our case they serve as a “smart” way of decreasing dimensionality of the passing through information. To disable it set the pooling factor of both layers to 1. This might result in increase of the practical visual accuracy and fine granularity of the visual data, allowed further down into the classifier, towards the recurrent stage.

The recurrent stage of the classifier has been left mostly as it was in the original Keras example. The most crucial element of the later part of the classifier are the two successive bidirectional GRU (Gated Recurrent Unit) layers. The main difference between GRU and LSTM layers is the GRU layers omit internal memory cells. The greatest value of GRU and LSTM layers is their ability to maintain a short as well as a long term memory of the data sample being classified as it passes through the layer. The recurrent stage of the network does all of the recognition and post processing and correction work steps.

There are several auxiliary layers between the convolutional and recurrent stage. The output of the convolutional stage is cut down in dimensionality by a dense layer, with a necessary reshape happening on the entry into the dense layer. The last layer in the classifier is a dense layer, which transforms the output of the recurrent stage to the eventual character activations. This output of the classifier is in the form of a tensor. A softmax activation function then applied to select the prevailing classification.

Most of the machine learning problems are, in the end, optimizations problems. It can basically be said that Machine Learning = Representation + Evaluation + Optimization. The original Keras example used the Keras implementation of SGD optimizer. We have replaced the SGD with the Adam optimizer for the system. This generally promises better results in training time required. You can use either SGD or Adam as the optimizer.

The quality and quantity of training dataset used has a critical effect on the capability of the resulting classifier. The system uses plain text dataset of real Thaana words, distinguishing between monogram and bigram structures. This data is used for synthesis of bitmap samples, utilizing the Cairo library. It is then further augmented with speckle noise and deformations.

Generally to improve the accuracy of the system there is a few things that can be done, starting with having more data. Presently the system is trained on 16,000 words per epoch. If this is increased to a figure like 32,000 it is expected to provide more accuracy. However this will take more time for training.

More data always does not help, but it is always better to have a bigger dataset for training when it comes to ML.

The batch size defines the gradient and how often to update weights. An epoch is the entire training data exposed to the network, batch-by-batch. An “epoch” describes the number of times the NN sees the whole dataset. So each time the NN has seen all samples in the dataset, an epoch is completed. I have not been able to test and record if more than 25 epoch will yield better results. Maybe if you have more resources you can increase this and test if the NN will produce better results.

Though the system is trained on typed fonts, I have tested the model on some handwritten Thaana text as well and the results are promising. Meaning if you can train the system with more fonts and larger dataset it is likely to produce better results on handwritten text.

Below are some of the output results achieved while training the model and after.

The source code for the system : https://github.com/Sofwath/thaanaOCR

Full details are available from the link below:

Source URL: Medium

Leave a Reply


This site uses Akismet to reduce spam. Learn how your comment data is processed.

Notify of