A New Method on Data Clustering Based on Hybrid K-Harmonic Means and Imperialist Competitive Algorithm

Data clustering is one of the commonest data mining techniques. The K-means algorithm is one of the most wellknown clustering algorithms thatare increasingly popular due to the simplicity of implementation and speed of operation. However, its performancecouldbe affected by some issues concerningsensitivity to the initialization and getting stuck in local optima. The K-harmonic means clustering method manages the issue of sensitivity to initialization but the local optimaissue still compromises the algorithm. Particle Swarm Optimization algorithm is a stochastic global optimization technique which is a good solution to the above-mentioned problems. In the present article, the PSOKHM, a hybrid algorithm which draws upon the advantages of both of the algorithms, strives not only to overcome the issue of local optima in KHM but also the slow convergence speed of PSO. In this article, the proposed GSOKHM method, which is a combination of PSO and the evolutionary genetic algorithmwithin PSOKHM,has been positedto enhancethe PSO operation. To carry out this experiment, four real datasets have been employed whose results indicate thatGSOKHMoutperforms PSOKHM.


Introduction
Data clustering is one of the most essential methods in data control and management thatcould partition data into classes accordingto their similar features. Data clustering is a process in whichsets of objective data are divided into separate groupings of classes-clusters-in such a way that objects in the same cluster are more similar while theyare dissimilar to objects of other classes. Clustering has multiple applications in various spheres of activities such as pattern recognition, machine learning, data mining, data recovery and bioinformatics. TheKmeans is one of the techniques that are being extensively used in clustering.
The principal objective in KM clustering is that the total dissimilarity among objects in one cluster would be less than that of the center of neighboring cluster. The most significantshortcoming of KM is that the results of clustering are sensitive to the initial choice of cluster centers and may converge with local optima (1,5). The Kharmonic means (KHM), which was proposed in 2002, aims at minimizing the harmonicmeans of all points in a dataset distancing from cluster centers.Although KM solves the initialization problem, it is still wrestling with the issue of getting stuck at local optima. Therefore, to arrive at a better clustering algorithm we need to seek a solution to overcome getting stuck in local optima. Particle swarm optimization (PSO) is an optimization technique based on population that is inspired by collective and cooperative behavior of bird flocks and fish school. This technique could help KHM to evade local optima trap. The PSOKHM attempts to benefit from bothmethods in order to improve clustering process.
Our proposed method is comprised of a combination of PSO and evolutionary genetic algorithm in PSOKHM to improve PSO operation. Moreover, to examine the efficiency of the proposed algorithm four sets of real data have been employed. As the article continues in section 2, PSOKHM algorithm will be discussed in which PSO and KHM will be briefly dealt with. In section 3, the proposed GSOKHM will be introduced.Section 4 will deal with the results of the proposed methods using four real datasets and a comparison will be drawnbetween these results and those of the precursors. Finally, section 5 will present a summary of what has been done in this study.

The hybrid PSOKHM clustering
In order to explain the above-mentioned hybrid algorithm, PSO and KHM algorithms will be briefly discussed then a discussion of the PSOKHM will follow.

K-harmonic means algorithm
The KM clustering is a simple and rapid method which is widely being used due to the simplicity of implementation and less iteration. In an attempt to find the clusters centers (C1, C2, C3), the KM algorithm behaves in a way that minimizes the sum of squares of the distance for each Xi point from the nearest cluster center (Cj). The KM efficiency depends on the initialization of centers thatis one of the major shortcomings of this algorithm. There has been established a strong connection between data points and the nearest clustering centers which prevents clusters centers from departingthe boundaries of local density of data. The KHM method solves this problem by replacing the minimum distance of a point from centers used in KM with the harmonic means of distance of each point from all centers. The harmonic means give a privilege to every data points according to their proximity to each center which is considered as a feature of harmonic means.
The following symbols are used to formulize KHM algorithm: Data to be clustered: X = x 1 , x 2 , … , x n The group of cluster centers: C = c 1 , c 2 , … , c k The membership function that defines Xi data share belonging to Cjm c j x i The weight function that defines the impact of Xi on the repeated calculation of center parameters at the next iteration.w(xi) The basic algorithm for KHM clustering is as follows: 1) Initial quantization algorithm with an estimated C centers (centers random selection) 2) The calculation of the value of the objective function is as follows: 1) , = || − || = = where p is an input parameter with the valuep ≥ 2.

1.
For every Xi data, the membership function m(cj|xi) for each Cj is calculated as the followings: For every Xi data, the weight of respective W(Xi) is calculated as the followings: A u g u s t 3 0 , 2 0 1 3 For every Cj its distance from all Xi points according to their membership functions weights are calculated as the followings: The steps 2 through 5 are performed either according to the predefined numbers of iterations or until KHM(X,C) stops changing to a considerable extent.

5.
Xi point is allocated to j cluster with the biggest m(cj|xi) This algorithm indicates that KHM is not necessarily sensitive to initialization of centers but the tendency towards converging with local optima is existent. (1,3,12)

Particle Swarm Optimization (PSO)
The Particle Swarm Optimization (PSO) was firstly developed by Kennedy and Eberhart in 1995. It has been successfully employed in several scientific and applied fields since then. PSO is an optimization algorithm based on population in which an individual is considered as a particle and every population consists of a number of these particles. In PSO the solution space is regarded as a search space and every position in this search space is a problem-based solution. In this population, particles, working in collaboration, try to find the best position (the best solution) in the search space (solution space).
Moreover, every particle travels according to its velocity. At each iteration, the movement of every particle is calculated using the following formulas:

5)
x In equations 5 and 6, xi(t) is the position of the lith at the t moment and vi(t) is the velocity of lith at the t moment. Pbesti(t) is the best position that has been found by the lith particle so far. Gbest(t)is the best position that has been found by the whole population so far. ωisthe inertia weight that denotes a proportion of the previous velocity and c1, c2 are the velocity constants that denotes the impact of the particle best position and the global best position.
In addition, rand1 and rand2 are variables ranging from 0 to 1. The procedure of PSO algorithm is shown in figure 1.
Initialize a population of particles with random positions and velocities in the search space.
While (termination conditions are not met) { For each particle I do Update the position of particle I according t equation (5). Update the velocity of particle I according to equation (6). Map the position of particle I in the solution space and evaluate its fitness value according to the fitness function.

PSOKHM algorithm
The KHM tends to converge faster than the PSO since it needs less function evaluation. However, due to its voracious nature, it would get stuck in a local optima. The PSOKHM hybrid-clustering algorithm attempts to take advantage of both methods through combining PSO and KHM.This hybridalgorithm repeats KHM four times in each generation for which employs 8 generations to improve particles within the population. Furthermore, PSO algorithm repeats 8 times in each generation.
Each particle is a vector of real numbers with K*D dimensions where k is cluster numbers and d is dimensions of the to-be-clustered data. A sample of a particle in population is shown in Figure 2.
The result of its evaluation is the KHM objective function. A summary of the PSOKHM algorithm is illustrated in Figure 3. As the figure shows in each generation, PSO denotes the number of iterations applied on particles. Subsequently, the KHM algorithm applies on the results of PSO iteration again.  Figure 2: a representation of a particle Step 1:Set the initial parameters including the maximum iterative count IterCount, the population size Psize,ω, c1 and c2.
Step 2:Initialize a population of aizePsize.
Setp 5: (pso Method) Step 5.1: Apply the PSO operator to update the Psize particles.
Step 6: (KHM Method)For each particle I do Step 6.1: Take the position of particle I centers of the KHM algorithm.
Step 6.2:Recalculate each cluster center using the KHM algorithm.
Step 8:Assign data point i x to cluster j with the biggest m(cj|xi).

The proposed GSOKHM method
In order to improve the efficiency of the PSO algorithm within PSOKHM, attempts have been madeto combine PSO with another evolutionary algorithm like genetic algorithm so that a more efficient data clustering results. Genetic algorithm is one of the randomized algorithm which draws on the selection, crossover and mutation.
This algorithm is one of the most well known evolutionary algorithms which is widely used in problemsolving optimization. The genetic algorithm could be very efficient in solving local optima within KHM and improving the efficiencyof PSO algorithm. To use the combination of PSO and GA for this specific application, GSO algorithm is used in a way that is shown in figure 4. A u g u s t 3 0 , 2 0 1 3

Figure 4: GSO hybrid algorithm
As it isevidentin the figure above, members of the population are partitioned into two equal classes at each iteration and PSO and GA operators are directly applied on each class which will eventually be combined to evaluate changes. This procedures proceeds until arrived at a favorable conditions. Furthermore, Roulette wheel is employed for selection in AG algorithm and crossover is carried out as depicted in figure 5. To perform mutation points of random particles of each generation are randomly selected and will be replaced by another random value.

Experiments and Results
Four real datasets are employed to measure the proposed method which include Iris, Wine, Glass and Contraceptive Method Choice (CMC) with small, medium and large dimensions. These datasets are presented in 15. Table 1 shows a summary of thefutures of these datasets. Additionally, table 1 illustrates the parameters values employed in the algorithm.

Results
In this section, considering the objective algorithm KHM, the efficiency of KHM, PSOKHM and GSOKHM methods are evaluated and compared. Besides, the intended clustering quality is being investigated using the two criteria below: The sum over all data points based on the harmonic average of the distance from a data point to all the centers as is shown in equation (2)(3)(4)(5)(6)(7)(8)(9)(10). It is evident the smaller the values of this set, the better the clustering quality would be.
F-measure criterion which employs precision and recall to recover data.
Every iclass, as shown using the class labels in the evaluated datasets,is considered as a set of nithat is favorable for a search. Everyj cluster, generated by the algorithm, is regarded as the sum of ni of the recovered section by a search. nij denotes the number of objects of i class within j cluster.Precision and recall criteriafor everyi class and j cluster are defined as follows: Neighboring F-measure value is calculated as follows: We consider b=1 to have a trade-off for p(o,j) and r(I,j). The global F-measure value for datasets about the size of n is shown below: 10) = { , } A u g u s t 3 0 , 2 0 1 3 It is clear that the more the F-measure value, the better the clustering quality would be.The reported results are averages of runs of the program. The proposed algorithms are implemented using MATLAB 7.6.0 (R2008a) installed on a Vista Home Premium OS with 2.4 GHz CPU and 6 GB RAM. So far, the experiments carried out on KHM algorithm indicate that p is a key parameter to arrive at the values of the objective function.
To this end, our experiments were carried out on a variety of p values and the results are presented in the form of tables for comparison. These tables are the results of the objective function KHM (X,C) which are in accordance with different p values P=2, P=2.5 and P=3. Moreover, not only the objective function KHM (X,C) and F-measure were calculated but the runtime of the proposed algorithms were also calculated and added to the tables.Finally, as the major results of the evaluation,the average independent runs of the algorithms are presented and compared in the tables.   The results demonstrate that for all p values the mean of KHM(X,C) function within the proposed GSOKHM was smaller than that of KHM and PSOKHM resulting in more optimized data.On the other hand, we concluded that, except in the case of CMC data in other cases, the value is more than the other two previous samples in GSOKHM. Therefore, this results in more efficiency. From runtime perspective, this algorithm demands much more time compared to KHM but it is comparable with the PSOKHM combinatorial algorithm.
Finally, due to the considerable reduction of the value of KHM(X,C) function and the increase of Fmeasure, this could be concluded that GSOKHM algorithm generates better clustering quality than that of its precursors.

Summary
This article examines the hybrid algorithm PSOKHM based on advantages of both PSO and KHM algorithms. In fact, this combination not only improves the converging velocity of PSO algorithm but also prevents KHM from falling into local optima traps. In the present article, theproposed method is SOKHM which combines evolutionary genetic algorithm with PSO on the PSOKHM hybrid algorithm. Four sets of real data have been employed to carry out this experiment. These algorithmscalculate data cluster centers through a sum of all data points based on the harmonic means of a point distance from all centers. Thus, this method has brought about better results compared to KHM and PSOKHM. Furthermore, from F-measure criterion perspective, it has also had much more favorable results. Although that this algorithm is very efficient in clustering, it demands more runtime than KHM. Therefore, this method is not applicable when time is a vital factor in systems.