The Automated VSMs to Categorize Arabic Text Data Sets

Text Categorization is one of the most important tasks in information retrieval and data mining. This paper aims at investigating different variations of vector space models (VSMs) using KNN algorithm. we used 242 Arabic abstract documents that were used by (Hmeidi & Kanaan, 1997). The bases of our comparison are the most popular text evaluation measures; we use Recall measure, Precision measure, and F1 measure. The Experimental results against the Saudi data sets reveal that Cosine outperformed over of the Dice and Jaccard coefficients.


INTRODUCTION
Text categorisation (TC) is one of the most important tasks in information retrieval (IR) and data mining (Sebastiani 2005). This is because of the significance of natural language text, the large amount of text is stored on the internet, in addition to the available information libraries and document collection. Further, TC importance rises up since it concerns with natural language text processing and classification using different techniques and procedures , in which it makes the retrieval and other text manipulation processes easy to execute.
Many TC approaches from data mining and machine learning exist such as: decision trees (Quinlan, 1993), Support Vector Machine (SVM) (Joachims, 1998), rule induction (Moulinier et al., 1996), and Neural Network (Wiener et al., 1995). The goal of this paper is to present and compare results obtained against Arabic text collections using K-Nearest Neighbour algorithm. Particularly, three different experimental runs of the KNN algorithm on the Arabic data sets we consider are performed, using three different VSMs (Cosine, Dice, Jaccard).
Generally, TC based on text similarity goes through two steps: Similarity measurement and classification assignment. Term weighting is one of the known concepts in TC. It can be defined as a factor given to a term to reflect the importance of that term. There are a lot of term weighting methods, including, inverse document frequency (IDF), weighted inverse document frequency (WIDF) and inverse term frequency (ITF) (Tokunaga and Iwayama, 1994). In this paper, we compare different variations of VSMs with KNN (Yang, 1999) algorithm using IDF. The bases of our comparison between the different implementations of the KNN are the F1, Recall and Precision measures. In other words, we want to determine the best VSM, which if merged with KNN produces good F1, Precision and recall results. To the best of the author's knowledge, no comparisons have been performed against The Saudi Newspapers (SNP) using VSM.
The paper is organized as follows: , Section two is review of related literature .Section three is the description of TC problem. Experiment results are discussed and explained in section four , and finally conclusions and future works in Section 5.

Related Works
As Syiam, et. al., (2006) pointed out that there are over 320 millions Arabic native speakers in 22 countries located in Asia and Africa. Due to the enormous energy resources, the Arab world has been developing rapidly in almost every sector especially in economics. As a result, a massive number of Arabic text documents have been increasingly arising in public and private sectors, where such documents contain useful information that can be utilised in a decision making process. Therefore, there is a need to investigate new intelligent methods in order to discover useful hidden information from these Arabic text collections.
Reviewing the existing related works proved that there are several methods which have been proposed by researchers towards Arabic text classification. For classifying Arabic text sources the N-Gram Frequency Statistics technique is investigated by Khreisat, (2006). This method is based on both Dice similarity and Manhattan distance measures in classifying an Arabic corpus. For this research the Arabic corpus was obtained from various online Arabic newspapers. The data is associated with four categories. After carrying out several pre-processing on the data, and experimentation, the results indicated that the "Sport" category outperformed the other categories with respect to recall evaluation measure. The least category was "Economy" with around 40% recall. In general the N-gram Dice similarity measure figures outperformed that of Manhattan distance similarity.
In (Thabtah et el., 2008), the authors investigated different variations of Vector Space Model using KNN algorithm, They mentioned the following variations: Cosine modulus, Dice modulus and Jacaard modulus, using different term weighting method. The average F1 results obtained against six Arabic data sets indicated that Dice based TF.IDF and Jaccard based TF.IDF outperformed Cosine based TF.IDF, Cosine based WIDF, Cosine based ITF, Cosine based log(1+tf), Dice based WIDF, Dice based ITF, Dice based log(1+tf), Jaccard based WIDF, Jaccard based ITF, and Jaccard based log(1+tf).
(Guo, Y., Shao, Z. & Hua, N. (2010), Active cooperation trees were used to classify Arabic documents automatically. The documents of the corpus were gathered from Arab Web sites. The corpus consists of 6825 articles, varying in length and divided into seven categories as follows: politics, economy, sports, medicine, science and technology, law and religion. We used the book SVM and naive Bayes seeded to make comparison with the active cooperation of trees. It was founded that SVM and naive Bayes seed had the best performance of the algorithms with reference to accuracy. This is because of the method of increasing the strength of a batch that enhances the Performance of C4.5. The disadvantage, however, was the decision of trees algorithm itself. Has investigation naive Bayesian method, SVM classification algorithm based on association rule algorithm for data classification determines Arabic. Data set consists of 5121 Arabic documents of different lengths that belong to the seven categories. The experimental results showed that the classification of algorithm based on association rule outperformed naive Bayesian method and algorithm for SVM regard to f1, precession and recall and measures.
(Saleh Alsaleem, 2011). The author used NB and SVM for automatic classification of Arabic text, he collect a large dataset of 5121 documents of classes. It will be interesting to see the impact of extending the classes for more than 7 classes. The results show that the SVM algorithm outperformed NB. M a r c h 3 1 , 2 0 1 4

Text Categorisation Problem
TC is the task in which texts are classified into one of predefined categories based on their contents. If the texts (data of the study) are newspaper articles, categories could be, for example, economics, politics, sports, and so on. This task has various applications such as automatic email classification and web-page categorization. Those applications are becoming increasingly important in today"s information-oriented society.
TC problem can be defined according to (Sebastiani 2005) as follows: The documents divided in two datasets, for training and testing. Let training data set = {d1,d2,…,dg}, where g documents are used as examples for the classifier, and must contain a good number of positive examples for all the categories involved. The testing data set {dg+1,dg+2,…,dn} used to test the classifier effectiveness. The matrix shown in Table 1 represents data splitting into training and testing. A document dy is considered a positive example to Ck if Cky =1 and a negative example if Cky =0.
Generally, TC task goes through three mainly steps: Data pre-processing, text classification and evaluation. Data preprocessing phase means making the text documents suitable to train the classifier. Then, the text classifier is constructed and tuned using a text learning approach against from the training data set. Finally, the text classifier gets evaluated by some evaluation measures i.e recall, precisinon, etc . The next two sub-sections are devoted to discuss the main phases of the TC problem related to the data we utilised in this paper.

Data Pre-Processing on Arabic Data
The data used in our experiments are from The Saudi Newspapers (SNP) (Al-Harbi, 2008), the data set consists of 5121 Arabic documents of different lengths that belong to 7 categories. The categories are in the fields of (Culture ‫الثقافية"‬ " , Economics ‫"اإلقتصادية"‬ , General ‫"العامة"‬ , Information Technology " ‫المعلومات‬ ‫تكنولوجيا‬ " , Politics " ‫,"السياسية‬ Social " ‫األجتماعية‬ ", Sport " ‫,)"الرياضة‬ Table 2 represents the number of documents for each category.
Arabic text is different from English one. In other words, Arabic language is highly inflectional and derivational language which makes the monophonical analysis is a complex task. Furthermore, in Arabic script, some of the vowels are demonstrated by diacritics which usually left out in the text. Moreover, Arabic uses capitalisation for proper nouns that create ambiguity in the text (Thabtah et al., 2008;Hammo et. al. 2002). In the Arabic data set we are using, each document file was saved in a separate file within the corresponding category's directory.
Moreover, we represented the Arabic data set to a form that is suitable for the classification algorithm. In this phase, we have followed ( 1. Each article in the Arabic data set is processed to remove the digits and punctuation marks. 2. We have followed (Samir et al., 2005) in the normalization of some Arabic letters such as the normalization of (hamza ‫)إ(‬ or ‫))أ(‬ in all its forms to (alef ‫ا(‬ )).
3. All the non Arabic texts were filtered. 4. Arabic function words were removed. The Arabic function words (stop words) are the words that are not useful in IR systems .

Classification Assignment
There are many approaches to assign categories to incoming text such as (SVM) (Joachims, 1998), Neural Network (Wiener et al., 1995) and k-nearest neighbor (KNN) (Yang 1999). In our paper, we implemented text-to-text comparison (TTC), which is also known as the KNN (Yang 1999). KNN is a statistical classification approach, that has been studied in pattern recognition over four decades. KNN has been successfully applied to TC problem, i.e. (Yang and Liu, 1999), (Yang, 1999), and showed promising results comparing with other statistical approaches such as Baysian based Network.

KNN Algorithm
Does the algorithm prove its effectiveness in the supervised classification of textual data? The learning phase consists of storing the labelled examples. The classification of new texts is made by calculating the distance between the vector representing the new document and each stored instance of the data set. The Nearest instances are selected and the document is assigned the majority class (the weight of each class may be weighted according to its distance). In order to make a comparative study and because the similarity measure plays a crucial role in the method, we used the three similarities measures.

Vector Space Model
The vector space model uses non-binary weights that are assigned for the documents and queries index terms (Salton, 1968). This will suggest a partial matching retrieval instead of the relevant / non-relevant matching. The non-binary weights assigned for both the queries and documents are ultimately used to measure the degree of similarity between each of the documents in store in the system and the user query. Hence, the vector model will also take into consideration documents which match the query terms partially.
The vector model uses the t-dimensional vectors to represent both document and query. For a document dj ( where j is the document number ) and a query q, their t-dimensional representations are dj and q as follows: The query q representations is : and the document dj representation is : where wi,q ≥ 0 and t is a total number of index terms in the system.
The vector model proposes to evaluate the degree of similarity of the document dj with regard to the query q as the correlation between the vectors dj and q. This correlation can be quantified, for example, by the cosine of the angle between these two vector (Salton, 1968), That is,

Note |d| is the number of term in document d.
Index term weights can be calculated in many different ways, the most popular ways are (Salton & McGill, 1983) : 1-Binary term weights 2-Term Frequency-Inverse Document Frequency (tf-idf) term weights which is given by the next formula:

Wij=tf *idf
Let N be the total number of documents in the system and ni be the number of documents in which the index term ki appears. Let freqij be the raw frequency of term ki in the document dj (i.e., the number of times the term ki is mentioned in the text of the document dj). Then, the normalized frequency fij of term ki in document dj is given by where freqiq is the raw frequency of the term ki in the text of the information request q.

Experiment Results
Arabic text is different than English, since Arabic language is highly inflectional and derivational language which makes monophonically analysis a complex task. Also, in Arabic script, some of the vowels are represented by diacritics which usually left out in the text and it does use capitalisation for proper nouns that creates ambiguity in the text (Hammo et. al. 2002).
Three TC techniques based on vector model similarity (Cosine, Jaccard, and Dice) have been compared in term of F1 measure, which is shown in equation (1). These methods use the same strategy to classify incoming text i.e. KNN. We have many options to construct a text classification method; we compared techniques using IDF term weighting method. All of the experiments were implemented using Java on 3 Pentium IV machine with 1GB RAM.
The F1 measureed is computed on the following equation:  Table 4 gives the F1 results generated by the three algorithms (Cosine, Dice and Jaccard) against seven Arabic data sets; where in each data set we consider 70% of documents arbitrary for training, and 30% for testing. K parameter in the KNN algorithm was set to 9.
After analysing Table 4, we found that the Cosine categorize outperformed Dice and Jaccard Algorithms on all measures (F1, Precison and recall).
Particularly, Cosine outperformend Dice and Jaccard on 6,5 data sets respectively with regards to F1 results. Also Recall results obtain that the Cosine outperformed Dice and Jaccard on 5,6 data sets respectively. And Precison results obtain that the Cosine also outperformed Dice and Jaccard on 6, 6 data sets respectively.
The average of three measures obtained against seven Arabic data sets indicated that the Cosine dominant Dice and Jaccard.

Conclusions and Future Works
This study intended to develop an Arabic text classifier in order to classify Arabic text. We investigated different difference of Vector Space Model, using KNN algorithm. These differences are Cosine coefficients, Dice coefficients and Jacaard coefficients. We also used IDF term weighting method. The obtained average of the three measures against seven Arabic data sets indicated that the Cosine is more dominant than Dice and Jaccard.