Outlier Analysis of Categorical Data Using Infrequency

Anomalies are those objects, which will act with different behavior and do not follow with the remaining records in the databases. Detecting anomalies is an important issue in many fields. Though many methods are available to detect anomalies in numerical datasets, only a few methods are available for categorical datasets. In this work, a new method has been proposed. This algorithm finds anomalies based on infrequent itemsets in each record. These outliers are generated by Apriori property on each record values in datasets. Previous methods may not distinguish different records with the same frequency. These give same score for each record. For each record a score is generated based on infrequent itemsets which is called MAD score in this paper. This algorithm utilizes the frequency of each value in the dataset. FPOF method is used the concept of frequent itemset and otey method used infrequent itemset. But these cannot distinguish records perfectly. The proposed algorithm has been applied on Nursery dataset and Bank dataset taken from “UCI Machine Learning Repository”. Numerical attributes are excluded from Datasets for this analysis. The experimental results show that it is efficient for outlier detection in categorical dataset.


INTRODUCTION
Outlier analysis is an important research field in many fields like networks, medicine and Business decisions. This analysis concentrates on detecting infrequent data records in dataset. Most of the existing systems concentrate on numerical attributes or ordinal attributes and sometimes, categorical attribute values can be converted into ordinal values there to categorical values. This process is not always preferable. This paper presents a novel method for finding anomalies in categorical data. The mechanism in the previous methods which are depends on frequent itemsets is that, these calculates frequency of each value in each records and checked with the threshold value, whether that value frequent or not. Then they formed all combinations of itemsets and find their frequency by scanning the dataset. All these combinations are the subsets of records. Based on infrequent itemsets their respective scores are generated. Top koutliers are selected based on the least k-scores. The parameter s used in this method "k", the number of outliers and a threshold value "σ" to decide frequent item sets in each data object" [1].

Some of the Existing Approaches for Categorical Datasets based on Item Frequency Frequent Pattern Outlier Factor (FPOF) algorithm:
This algorithm utilizes the Apriori algorithm as a first step to find all frequent Item sets. This method needs a human defined threshold value called" minimum support" "σ" as input to find frequent item sets. By taking this threshold value, it makes all combinations of values of each record and compares the frequency of each combination with threshold value and finds each combination whether it is frequent or not. To find frequency of each combination, it needs one scan of the dataset. The formula utilized in this algorithm is Where Dataset D= {A1, A2--------Am}, Minimum support = "σ", Number of outliers = "k", J u n e 30, 2 0 1 3 Where F is the frequent item set which satisfies the minimum support, FS (x i ) is the set of all frequent itemsets which are subsets of the record "x i ", This model finds FPOF score for each record and selects k-outliers as least k-scores. If there is no frequent itemset at all in any record, identifying the score is a problem in this method.

a) Fast Distributed Outlier Detection(FDOD) (Otey) Algorithm
This algorithm also used the concept of frequent pattern method. The inverse Apriori is used in this model which say that "every super set of an infrequent itemset is again an infrequent set. So that, model reduces number of scannings of the dataset. This model first finds all combinations of subsets of each record and checks its support with threshold "σ". It considers infrequent item sets and finds the FDOD score for each record by the below formula, Where IF is the infrequent item set which does not satisfy the minimum support, IFS (x i ) is the set of all infrequent itemsets which are subsets of the record "x i ", This model finds FDOD score for each record and selects k-outliers as top k-scores. If there is no infrequent itemset at all in any record, then identifying the score is another problem in this method b)

Attribute Value Frequency (AVF) algorithm
This algorithm is simple for finding scores of each object. and faster approach to detect outliers that minimizes the scans over the data and does not need to create more space and more search for combinations of attribute values or item sets is Attribute Value Frequency (AVF) algorithm. An outlier point "x i " is defined based on the AVF Score below: Where f (x ij ) is the frequency of each value involved in each record, "m" is the number of attributes, "n" is the dataset size, "x ij " is the cell value in i th record and j th Attribute. This model finds AVF score for each record and selects k-outliers as least k-scores. When these above algorithms applied on a sample data below with threshold value =5, one problem is identified.

c) Proposed Model (MAD Score)
This proposed algorithm also used the infrequent itemsets which are generated by Appriori concept. This proposed model finds the score for each record. We call this score as MAD score.  [10].The comparison of results with FPOF, FDOD and AVF on Trail data is given below. Table.3. Comparison of MAD score with FPOF, FDOD and AVF.

Experimental Results
When the experiments are conducted on bank data with 45212 records by the proposed model, it has achieved the maximum classifier accuracy better than AVF. The experimental results are compared with AVF results because in previous research work AVF has given good results when compared with FPOF and FDOD. The bank data contains 7 variables and 46 values. The Bank Sample has partitioned into two parts, one is with "Yes" Class label (5299) and another is with "no" class label (39922 records) using Clementine tool. The "yes" label records are considered as outliers in this experiment. In this experiment 50% of outliers (2645 records) are selected randomly and mixed up with "no" class label. The mixed records (42567 records) considered for experiments. Both AVF and MAD models applied on the same built mixed records to delete top 100,200,300,400,500,600,700 and 800 outliers. After deleted these outliers different classifiers are tested. The tested results are given bellow.  From the above results it is concluded that Classifiers built on the records when outliers are deleted by MAD Score algorithm gives better results when compared with AVF Score. Classifiers also applied on the original data without deleting outliers. These classifiers gave 88.302% only.

Conclusion and Future work
To sum up, this proposed method finds distinguished score for distinguishable records, where as the previous methods may not find different scores for different records. This model also gives reliable records as the classifiers get maximum accuracies when compared with old models. To form the combinations of item sets and scanning of the dataset for every itemset for frequency is a big problem in these models. Some possibility of solving this problem is parallel computing. When the attributes are increasing the complexity becomes more in these models.