A two level approach to discretize cosmetic data using Rough set theory

Discrete values play a very prominent role in extracting knowledge. Most of the machine learning algorithms use discrete values. It is also observed that the rules discovered through discrete values are shorter and precise. The predictive accuracy is more when discrete values are used. Cosmetic industry extracts the features from the face images of the customers to analyze their facial skin problems. These values are continuous in nature. A predictive model with high accuracy is required to determine the cosmetic problems of the customers and suggest suitable cosmetic. Existing traditional discretization techniques are not sufficient for deriving discretized data from continuous valued cosmetic data as it has to balance the loss of information intrinsic to process adapted and generating a reasonable number of cut points, that is, a reasonable search space. This paper proposes a two level discretization method which is a combination of traditional k means clustering technique and rough set theory to discretize continuous features of cosmetic data


INTRODUCTION
There are huge volumes of data in the cosmetic industry not only to analyze the problems of the customers but also to rejuvenate a new product basing on the customer problems. Data mining algorithms help us to extract necessary information for decision making from this cosmetic data. However, many mining algorithms or machine learning algorithms cannot be applied on them as they are continuous in nature. Numeric data contain large number of values when compared to discrete values, the rules discovered looks complex and gives less predictive accuracy. As discrete attributes are represented with simple interval numbers they are understandable and easier to use. The rules of discrete attributes usually are shorter and easy to understand, hence will increase the accurateness of predictions. Therefore, it is essential to have good descretization techniques [1] to transform continuous valued features into discrete valued features. This not only speeds up the mining process but also helps in developing a better model. This paper deals with a two level discretization technique for cosmetic data which firstly uses the traditional kmeans algorithm and then applies rough set theory to discrete the data at attribute level.

K means algorithm
Kmeans algorithm: Kmeans is a simple unsupervised clustering technique [2]. It follows simple and easy steps to form the clusters. Initially number of clusters to be formed is to be determined. Then it follows three steps, initialization, expectation and maximization. In initialization step k centers are created where k is the number of clusters to be formed which is predetermined. In expectation step each data point is assigned to the center closest to it and maximization step deals with computation of new center basing on the data points associated to it. These steps are carried out repeatedly until no more changes are done to centers.Finally, this clustering technique aims at minimizing an objective function, in this case a squared error function. The objective function used is , where is a chosen distance measure between a data point and the cluster centre is an indicator of the distance of the data points from their respective cluster centres [3]. Application of k means algorithm to cosmetic data discretization Initially kmeans algorithm is applied on sample cosmetic data to form the clusters as it is unsupervised [4]. This completes the basic discretization step. This step discretizes the data into specified number of intervals .The results are then given to the second phase which uses Rough set theory [5]

Rough Set Theory
Rough set theory was proposed by Professor Powlak (powlak, 1982:1991 skowron, 1990) [6]. The main goal of the rough set analysis is induction of (learning) approximations of concepts. It offers mathematical tools to discover patterns hidden in data. The basic concepts of rough set theory are described below: Approximation Space: An approximation space is a pair (U, B) where U is a nonempty finite set called the universe and B is an equivalence relation defined on U.
Information System: An information system is a pair S= (U, A), where U is a nonempty finite set called the universe and A is a nonempty finite set of attributes, i.e., a: U→Va for aєA, where Va is called the domain of a.
Decision Table (Data Table): A decision table is a special case of information system, S = (U, A= C є {d}), where attributes in C are called condition attributes and d is a designated attribute called the decision attribute. J u l y 1 0 , 2 0 1 5 The lower approximation of X in S is defined as The upper approximation of X in S is defined as The rough membership function quantifies the degree of relative overlap between X and the equivalence class to which x belongs. Thus this rough membership function is also a measure of the significance of B⊆A to describe X and is defined by [7],

Application of rough set theory to refine the cut points generated by Kmeans algorithm
The traits of the clusters formed by the kmeans algorithm vary. This discretization using clustering technique is not sufficient to generate cut points with minimum information loss. Hence they are refined using Rough set theory concepts [8]. The main aim in splitting the cluster is to refine the discretized interval. The refinement is to enhance the significance of the attribute. In rough set theory the significance of an attribute is measured through rough membership function POSai(D).Hence maximizing POSai(D) leads to maximizing the significance of the attribute. To maximize POSai (D), the clusters formed through kmeans are refined further to generate new intervals or cut points. The refinement is processed in such a way that the maximum number of objects is correctly classified by each of the interval of attribute ai, just as they are classified by D [9]. This is done by a rough membership function applied to each interval of the attribute ai with respect to the clusters formed through the kmeans which are further treated as class labels. Maximizing f(ai,cp,I) is maximising which further maximizes POSai(D) [10].To achieve this each cluster generated by kmeans is examined carefully and if necessary a cluster may be split into two or merged with the neighbouring cluster. The splitting process uses the rough set membership function such that it maximizes the POSai (D).in this way the intervals are refined. The refinement takes place as follows. Initially three predetermined parameters are taken. Max_size determines the maximum no of values that could fall in each cluster. Min_size decides the minimum number of values to form a cluster and Range gives the length of the cluster. These parameters decide whether the cluster can be retained or still to be refined. The refinement process takes place if the cluster is large or small. The cluster is said to be large if its cardinality is greater than the Max_size or the length is greater than the Range. A cluster is treated as small if its cardinality is less than the Min_size. If the cluster is large it is split into two or else small, merged with other small clusters thereby generating new cut points or intervals. This process is refined until there is no change in the cut points or intervals.

Algorithm for the proposed method
Step1: Consider each attribute in the data set, select distinct values and sort them.
Step3.From the generated clusters determine the class labels as well as intervals.
Step4.Refine the intervals and add new intervals to the interval set.

Results
Cosmetic data has been collected from the customers of different age groups. The facial images of the customers are captured under sophisticated environment and then the features are extracted. The features are numeric in nature. To analyse the data collected mining tools are applied. As a preprocessing step to mining process the numeric values are discretized. To show the experimental results a dataset of 33 samples taken which consists of 17 numeric Features. After applying the proposed algorithm the results are as shown in Table -1:   Table -1 Complexity Before we apply kmeans algorithm first distinct values are identified then they are sorted. For carrying out this process complexity is O (N log N) where N is the number of objects in the dataset. Kmeans is known to have the complexity which may be in worst situation for the above algorithm i.e. when the attribute values for each object are distinct. The complexity of the Refine function is bounded by k * N/2, where k is the number of intervals of an attribute and the running time of the function Cut_Point is bounded by N/2. If n is the number of attribute then the total complexity of the algorithm is bounded by, n * (N log N + N log N + k * N /2 + N ) ≈ n * (N log N) The number of attributes n is normally small in comparison to N. The preprocessing of the dataset for selecting relevant attributes further reduces the value of n to be small compared to N. Therefore, the running time of the proposed algorithm for labeled data is bounded by N logN.

Conclusion
By the proposed method the natural intervals of the values of the continuous attributes are obtained which maximized the mutual class-attribute interdependency. The method also generates the possibly minimum number of intervals.
Although the computational effort for the search algorithm for cut point has been reduced to half of N, the size of dataset, by implementing binary search for the cut point can further reduce the complexity of search step.