Performance Analysis of Random Forests with SVM and KNN in Classification of Ancient Kannada Scripts

Ancient inscriptions which reveal the details of yester years are difficult to interpret by modern readers and efforts are being made in automating such tasks of deciphering historical records. The Kannada script which is used to write in Kannada language has gradually evolved from the ancient script known as Brahmi. Kannada script has traveled a long way from the earlier Brahmi model and has undergone a number of changes during the regimes of Ashoka, Shatavahana, Kadamba, Ganga, Rashtrakuta, Chalukya, Hoysala , Vijayanagara and Wodeyar dynasties. In this paper we discuss on Classification of ancient Kannada Scripts during three different periods Ashoka, Kadamba and Satavahana. A reconstructed grayscale ancient Kannada epigraph image is input, which is binarized using Otsu’s method. Normalized Central and Zernike Moment features are extracted for classification. The RF Classifier designed is tested on handwritten base characters belonging to Ashoka, Satavahana and Kadamba dynasties. For each dynasty, 105 handwritten samples with 35 base characters are considered. The classification rates for the training and testing base characters from Satavahana period, for varying number of trees and thresholds of RF are determined. Finally a Comparative analysis of the Classification rates is made for the designed RF with SVM and k-NN classifiers, for the ancient Kannada base characters from 3 different eras Ashoka, Kadamba and Satavahana period.


INTRODUCTION
Epigraphy is the study of ancient inscriptions, and is a primary tool of archaeology when dealing with literate cultures. The discipline of ancient history and archeology, play an important role in preserving these historical records and thus contribute in upholding the culture and heritage of the past. India is a multilingual country, possessing a rich collection of written ancient scripts. Three important varieties of scripts that were prevalent in ancient India are: Indus valley script, Brahmi Script and Kharosti script. The scripts of modern Indian languages have evolved from one of these scripts over the centuries. In India currently there are 13 Scripts and 23 official languages for communicating at state level. Apart from these, there are many languages & dialects used by number of people. The Kannada script has been used to write in Kannada language which is the official language of Karnataka state.

Properties of Kannada Script and Language
Kannada is one of the most enriched languages in India, with its long historical heritage.. and the modern script has gradually evolved from this ancient script known as Brahmi. Kannada language has evolved to present form, from earliest written records about the third century B.C. In fact, the Indian linguists have divided the whole of this evolutionary process in to four broad phases: Kannada script has traveled a long way from the earlier Brahmi model, has undergone a number of changes during the regimes of Shatavahana, Kadamba, Ganga, Rashtrakuta, Chalukya, Hoysala , Vijayanagara and Wodeyar dynasties as shown in Figure 1. The modern readers find difficulty in interpreting these ancient scripts and need much time for deciphering the ancient written records. Many researchers have been working on Indian script recognition for more than three decades and there are many conventional OCR systems found for modern Kannada Script, but very few work on ancient Kannada epigraphical scripts are reported. Hence, there is a need for the automation of deciphering ancient epigraphical scripts, which is much relevance in knowing the past. The results of our earlier works [1][2] on dating a given epigraph into the corresponding period are found to be satisfactory. The work on dating of inscriptions using Support Vector Machine for ancient Kannada scripts [1] was carried out which used only few unique characters from character bank of a script for classification. Later in [2] a Random Forest Classifier is designed, and system dates an input ancient Kannada epigraph, considering complete character set pertaining to a specific era. The Classifier reads reconstructed images of epigraphs and categorizes it into one of the six periods: Ashoka, Satavahana, Kadamba, Chalukya, Rastrakuta and Hoysala.
The proposed work performs a comparative study of the performance characteristics of the designed RF with the Classifiers SVM and k-NN supported by OpenCV libraries. This paper is organized as follows: few works on OCR of Indian/non-Indian scripts is reported in Section 2. The System architecture of the proposed work and the description of the approaches are covered in Section 3. The structure chart and methodology of proposed system is detailed in Section 4. Experimental results and analysis are demonstrated in Section 5, and concluding remarks is provided in Section 6.

LITERATURE SURVEY
In this section, some of the contributions in Optical Character Recognition of ancient and modern documents, and performance analysis is covered: RF Classifier has been used on Persian language [3] to classify handwritten Persian characters with Loci features. A classification rate of up to 87% has been achieved. RF Classifier's performance for Handwritten Digit recognition has been accounted in [4]. A feature extraction technique based on a grayscale multi-resolution pyramid was chosen to explore how the RF parameters affect the recognition accuracy. Classification rates of 85%-93% is reported in it. [5] describes a method for recognizing ancient tamil scripts from temple wall inscriptions. It uses fourier and wavelet features for describing the features and k-means algorithm for character recognition. It claims to achieve a maximum recognition accuracy of 84%. An effective system for the classification of ancient handwritten documents according to the writing style has been reported in [6]. It employs a set of features that are extracted from the contours of the handwritten images. These features are based on the direction and curvature histograms that are extracted at a global level from local contour observations. Two writings are then compared by computing the distance between their respective histograms. An identification rate of 94% is obtained in this. A method for dating of the Greek inscription's content [7] uses "platonic" realization of alphabet symbols for the specific inscription and various Geometric characteristics for the features, and classifies the period according to some statistical criteria. A study for the recognition of ancient middle Persian documents [8] chooses a set of invariant moments as the features and the classifiers used are minimum mean distance, k-Nearest Neighbours (KNN) and Parzen. A classification rate of 90.5%-95% was achieved in that. Characterization of the Arabic and Latin ancient document images is explained in [9]. Regions of images having the same size are extracted from the heterogeneous base and fractal dimension method is used to discriminate between ancient Arabic and Latin scripts. It achieves 95.87% accuracy on the discrimination between Arabic and Latin ancient document collections. It claims that the advantage of the approach is that it can be easily adapted for the identification of other ancient document collections. [10] proposes a texture-based approach for text recognition in ancient documents. It copes with the challenges such as degradation, staining, fluctuating text lines, superimposition of text etc. The approach is applied to three different manuscripts, namely to Glagolitic manuscripts of the 11th century, a Latin and a composite Latin-German manuscript, both originating from the 14th century. [11] presents an approach for the detection of elements like initials, headlines and text regions, focused on ancient manuscripts. SIFT descriptors are used to detect the regions of interest, and the scale of the interest points is used for localization. It gives a detection rate of 57% for initials and headlines, and 74% for regular text. An approach for transcribing historical documents [12] divides a text-line image into frames and a graph is constructed using the framed image. Dijkstra algorithm is applied later to find the line transcription. A character accuracy of 79.3% is found in its experiments.An efficient technique for multi-script identification at connected component level using convolutional neural network is described in [13]. Suitable script identification features are automatically extracted and learned as convolution kernels from the raw input. It is tested on a dataset of ancient Greek-Latin mix script document images and an accuracy of 96.37% is achieved on a test dataset at connected component level and improved to 98.40% by using class majority in the left-right neighboring area. A description of a method to efficiently create the ground truth to train and test the different classifiers is given in [14]. Since the manual labelling of the data is a tedious process, the data is represented in different abstraction levels, which is clustered in unsupervised manner. The different clusters are labeled by the human experts. In this method, less than 0.5% of the data is manually labeled and achieves a recognition rate of 86.21% and 94.81% for two different sets of scripts. [15] describes an age identification of ancient Kannada scripts. A hybrid neural network classifier is used where the first phase incorporates an Artificial Neural Network for identifying the base character. The second phase uses a Probabilistic Neural Network model designed for the identification of age pertaining to the base character. A system to classify both printed and handwritten Kannada numerals are discussed in [16]. An average recognition rate of 97% is got by using SVM with Fourier descriptors and chain codes.
The problem of recognizing accented and non-accented characters in French handwriting [17] is reported. The performances of SVM are declined by the presence of accents. An accented character is segmented into two parts: the root character or letter and the accent. These two parts are recognized separately, and the results are combined to rebuild the accented character. This approach avoids the combination of characters and accents that causes an increase in the number of classes to be considered with higher recognition accuracy. An OCR system for handwritten text documents in Kannada using Support Vector Machine and Zernike Moments features is described [18]. The recognition is independent of the size of the handwritten text and the system has achieved the recognition rate around 94 %. The paper [19] proposes a new technique of OCR using Gabor filters and Support Vector machines (SVM). The model proposed is trained and validated for two languages -English and Tamil and works for the entire character set in both the languages including symbols and numerals. In addition , the model can recognise the characetrs of six different fonts in English and Twelve different fonts in Tamil. The average accuracy of recognition for English is 97% and for Tamil it is 84%, which is achieved in just three iterations of training. The work [20] deals about reconstructing handwritten scanned images into text. Using supervised learning algorithm SVM for classification, classes are mapped onto Unicode for recognition. Finally the text is reconstructed using Unicode fonts which are subjected to readable and editable documents. A simple method for converting ancient Tamil handwritten scripts into text format is proposed [21]. There are thousands of Tamil palm manuscripts that are yet to be digitalized. The aim of this paper is to convert the palm manuscript image into digitized text format. In the paper [22], the research objective for recognizing Ancient Tamil handwritten characters , is fulfilled by applying the genetic algorithm technique based on the basic features of handwritten characters namely: loop, line, and location of loop and line connection. The system generates 66-bit string chromosome to represent a handwritten character. Then the system uses the 66-bit string chromosome to identify each handwritten character. J u l y 1 8 , 2 0 1 4 An Automatic License Plate Recognition (ALPR) system using Python and OpenCV [23], addresses the complex characteristics due to light and speed. ALPR systems is implemented using Open source tools and software including Python and the Open Computer Vision Library. Character Recognition of 94.3% is achieved.

PROPOSED METHOD AND BACKGROUND OF THE APPROACHES
The proposed system classifies the characters of ancient Kannada Scripts from three different periods Ashoka, Kadamba and Satavahana. The methods and techniques used in the proposed work are described in this section:   Classification: The RF classifier is designed as described in [2]. The designed RF is trained using the features stored in the file. The trained Classifier is used to classify the ancient Kannada characters.

The System Architecture
Similarly the SVM and k-NN classifier provided in OpenCV are used in classification of ancient charcters.

Classification Methods Used In The Proposed Work
Many text classifiers have been proposed in the literature using machine learning techniques, probabilistic models, etc. They often differ in the approach adopted: decision trees, naıve-Bayes, rule induction, neural networks, nearest neighbors, and support vector machines. Although many approaches have been proposed, automated text classification is still a challenging area of research. Hence in the current work a comparative study of the designed RF classifier with that of SVM and k-NN supported in OpenCV libraries, is made in classification of characters of ancient Kannada Script.

Random Forest (RF)
In this work, we use the random forest (RF) classifier for document image classification. The RF is an ensemble-based learning algorithm which constructs a set of tree-based classifiers, and then classifies new data points by taking a vote of the predictions of each classifier. In RF, the classifiers are the decision trees and it constructs a series of Classification Trees which will be used to classify a new example. For reducing the variance of an estimated prediction function of a Forest, a technique known as bagging or bootstrap aggregation is used. The idea used to create a classifier model is constructing multiple decision trees, each of which uses a subset of attributes randomly selected from the whole original set of attributes. There are multiple reasons to select the RF over other classifiers for this problem. The RF has been shown to work well when many features (on the order of thousands) are available. It does not over-fit with the increase in number of features and increases diversity among the classifiers by resampling the data, and by changing the feature sets over the different classifiers (trees). Random selection of features to split each node makes it more robust to noisy data.

Support Vector Machine (SVM)
The objective of any machine capable of learning is to achieve good generalization performance, given a finite amount of training data, by striking a balance between the goodness of fit attained on a given training dataset and the ability of the machine to achieve error-free recognition on other datasets. With this concept as the basis, support vector machines have proved to achieve good generalization performance with no prior knowledge of the data [25]. The principle of an SVM is to map the input data onto a higher dimensional feature space nonlinearly related to the input space and determine a separating hyperplane with maximum margin between the two classes in the feature space. A support vector machine is a maximal margin hyperplane in feature space built by using a kernel function in gene space. This results in a nonlinear boundary in the input space. The optimal separating hyperplane can be determined without any computations in the higher dimensional feature space by using kernel functions in the input space.
An SVM in its elementary form can be used for binary classification. It may, however, be extended to multiclass problems using the one-against-the-rest approach or by using the one-against-one approach. We begin our experiment with SVM's that use the Linear Kernel because they are simple and can be computed quickly. There are no kernel parameter choices needed to create a linear SVM, but it is necessary to choose a value for the soft margin (C) in advance.
A classification task usually involves with training and testing data which consist of some data instances. Each instance in the training set contains one "target value" (class labels) and several "attributes" (features). The goal of SVM is to produce a model which predicts target value of data instances in the testing set which are given only the attributes. Given a training set of instance-label pairs (xi, yi), i = 1,2. . . , where xi ∈ Rn and y ∈ 2 {1,−1} l, the support vector machines(SVM) require the solution of the following optimization problem: where ξ is an l-dimensional vector, and ω is a vector in the same feature space as the xi . The values ω and b determine a hyper plane in the original feature space, giving a linear classifier. Here training vectors xi is mapped into a higher (may be infinite) dimensional space by the function φ. Then SVM finds a linear separating hyper plane with the maximal margin in this higher dimensional space. C > 0 is the penalty parameter of the error term. Furthermore, K(xi, xj) ≡ φ(xi) T φ(xj) is called the kernel function.
Commonly used kernels apply linear classification techniques to non-linear classification problems, based on the concept of decision planes that define decision boundaries. Commonly used kernels include:

k-Nearest Neighbors (k-NN)
Another simple classifier is a K-nearest neighbor classifier with a Euclidean distance measure between input images [26].
The algorithm caches all of the training samples, and predicts the response for a new sample by analyzing a certain number ( K ) of the nearest neighbors of the sample (using voting, calculating weighted sum etc.) The method is sometimes referred to as "learning by example", because for prediction it looks for the feature vector with a known response that is closest to the given vector. This classifier has the advantage that no training time, and no complex processing on the part of the designer, are required. However, the memory requirement and recognition time are large: the complete training images must be available at run time.

SYSTEM STRUCTURE CHART AND METHODOLOGY
The structure chart of the system designed for classification of ancient Kannada characters using RF is shown in Figure 3  The steps towards the classification are as follows: The steps towards Feature Extraction are: Step 1: [Preprocessing]: A reconstructed image of an epigraph is read in grayscale format, converted to binary image using Otsu's approach and is segmented to characters using connected components labeling.
Step 2: [Feature Extraction]: The Normalized Central Moments and Normalized Zernike Moments are computed, and write the computed feature vectors to a text file. J u l y 1 8 , 2 0 1 4

Step 3: [Random Forest Classification]
Step 3a: [Load Text] : Gets the feature vectors from the text and saves it in two arrays, one consisting of the classes and the other consisting of feature vectors of the corresponding classes.
Step 3b: [Fit Forest] : Train the trees in the RF which can be used to classify the ancient Kannada characters.
Step 3c: [Fit Tree]: A random subset of the training data from the sub-module Fit Forest is taken as input and a single Classification Tree for the given subset of data is made.
Step 3d: [Get Gini Impurity] : Determines the impurity index of a subset of classes and corresponding data for the node so that it can determine the best split and the best threshold value for that feature.
Step 3e: [Predict] : Considering the data consisting of feature vectors, predicts the class of the the given test characters using the trained RF Classifier.

EXPERIMENTAL RESULTS AND ANALYSIS
The experimental results and analysis of the designed RF [2] for classifying ancient Kannada Epigraphical characters is discussed.
The SVM and KNN classifiers provided by the OpenCV library are used to analyze the performance characteristics. The parameters used for the classifiers are as follows: SVM classifier: Kernel type -Linear; SVM type -C-SVM (type 1) KNN classifier: k = 5

Experimental Results and Evaluation Metrics
The dataset of the epigraphs considered are reconstructed images of ancient Kannada Scripts. The RF Classifier is tested on handwritten base characters belonging to Ashoka, Satavahana and Kadamba dynasties. For each dynasty, 105 handwritten samples with 35 base characters are considered. Two-thirds of the data is used for training and the remaining one-third is taken for testing the Classifiers. Figure 4 shows the GUI of the system with the epigraph image selected and Figure 5 depicts the feature vectors provided for training the classifiers.

Evaluation Metrics
Metrics are the various measures which facilitate the quantification of some particular characteristics. The metrics used to evaluate the proposed system are:  Classification rate: This metric is used to determine the accuracy of the Classifier, which is given by the number of correct classifications out of the total number of samples considered.  Classification time: The classification time is the time taken to predict the class labels for the given set of inputs.

Classification Rate of RF on the Characters from Satavahana Period
The accuracy of RF in classifying characters from trained data set of Satavahana period for the threshold value 10 and varying number of trees are tabulated in Table 1. The plot in Figure 7 illustrate the results of the same on trained data.  The Classification rate for new characters are tabulated in Table 2 and shown in Figure 8.

Times for training and testing of RF on the characters from Satavahana period
The time (in sec) for training characters from Satavahana period are tabulated in Table 3 and plotted in Figure 9 respectively. As the number of trees in the forest increases, the time taken for training also increases proportionately.  The classification times (in seconds) for new characters are tabulated in Table 4. Each column indicates the number of trees used and the rows indicate the threshold values. The plot for the classification times for different parameters is shown in Figure 10. The designed RF Classifier is tested on 105 handwritten samples with 35 base characters, belonging to each of the dynasties: Ashoka, Kadamba and Satavahana dynasty.
These handwritten characters are also tested with the SVM and KNN classifiers provided by the OpenCV library. The parameters used for the classifiers are as follows: SVM classifier: Kernel type -Linear; SVM type -C-SVM (type 1) KNN classifier: k = 5 RF classifier: Number of trees = 30; number of thresholds = 10

Classification of characters from Ashoka period
The classification rates for handwritten ancient Kannada characters from Ashoka period are tabulated in Table 5. The number of samples taken for training are 70 and 35 samples are taken for testing. The accuracy in the classification of characters is compared using the designed RF with SVM and k-NN The Figure 11 shows the corresponding plot representing the classification rates using RF,SVM and k-NN, for characters from Ashoka period .

Classification of characters from Kadamba period
The classification accuracy, of the designed RF is compared with SVM and k-NN, for handwritten ancient Kannada characters from Kadamba period and are tabulated as in Table 6. The number of samples taken for training is 70 and 35 for testing. The Figure 12 shows the corresponding plot representing the classification rates using RF, SVM and k-NN, for characters from Kadamba period.

Classification of characters from Satavahana period
The classification rates for handwritten ancient Kannada characters from Satavahana period are tabulated in Table 7. The number of samples taken for training are 70 and 35 samples are taken for testing. The accuracy in the classification of characters using the designed RF with SVM and k-NN is being compared. The Figure 13 represents the corresponding plot of accuracy, in classification of characters from Satavahana period, using RF, SVM and k-NN

Figure 13. Classification Rates Of Handwritten Characters From Satavahana Period
The following inferences are drawn from the performance analysis:


The accuracy in classification of the trained data is at least 1.08 times greater than the classification rate of new characters for any classifier.


There is a linear increase of classification rate as the number of trees in the forest is increased, but no significant changes when the number of thresholds is increased.


The training time is directly proportional to the number of classification trees and the number of thresholds.


The classification time is directly proportional to the number of classification trees. It is not dependant on the number of thresholds since it is used only when growing the trees.


The training time of the RF classifier is about 200 times more than the classification time. This is because most of the time is spent for the calculation of Gini index during training. Classification involves only a comparison at each node till it reaches the leaf.
 RF Classifier works better in classifying the trained characters. Hence a well trained RF classifier will perform comparatively better than SVM and k-NN in classification of new characters.
 SVM classifier works well in classification of new characters. Hence SVM performs comparatively better than RF and k-NN, when the classifier is not trained with sufficient samples.
 KNN Classifier's capability lies in between RF and SVM in classification of trained and test data.

CONCLUSION
A RF classifier is designed and tested in classification of ancient Kannada epigraphical characters, prevailing during the regime of Ashoka, Satavahana and Kadamba dynasties. The performance characteristics of RF, in terms of accuracy and time, is observed for classification of trained and test characters from Satavahana dynasty. The Performance analysis of RF illustrates that, fixing the number of thresholds at 10 would be a good tradeoff between training time and classification rate. Finally a comparative study on the performance of designed RF with SVM and k-NN is carried out, in the classification of ancient characters from Ashoka, Satavahana and Kadamba dynasties. The strengths and weaknesses of these classification methods is also discussed. The current work can be further extended on recognition of characters from ancient scripts. Thus it finds scope in development of an automatic recognition system, for deciphering ancient epigraphical records, which is of greater relevance to archeologists in exploring the details of past.