Simulated Annealing Algorithm For Feature Selection

In the process of physical annealing, a solid is heated until all particles randomly arrange themselves forming the liquid state. A slow cooling process is then used to crystallize the liquid. This process is known as simulated annealing. Simulated annealing is stochastic computational technique that searches for global optimum solutions in optimization problems. The main goal here is to give the algorithm more time in the search space exploration by accepting moves, which may degrade the solution quality, with some probability depending on a parameter called temperature. In this discussion the simulated annealing algorithm is implemented in pest and weather data set for feature selection and it reduces the dimension of the attributes through specified iterations


1.INTRODUCTION
Simulated annealing is a general purpose stochastic search algorithm inspired by a process used in metallurgy. The heating and slow cooling technique of annealing allows the initially excited and disorganized atoms of a metal to find strong, stable configurations. Likewise, simulated annealing seeks solutions to optimization problems by initially manipulating the solution at random (high temperature), and then slowly increasing the ratio of greedy improvements taken (cooling) until no further improvements are found. Feature selection is a process that selects a subset of original features. The optimality of a feature subset is measured by an evaluation criterion. As the dimensionality of a domain expands, the number of features N increases. A typical feature selection process consists of four basic steps, namely, subset generation, subset evaluation, stopping criterion, and result validation. Subset generation is a search procedure that produces candidate feature subsets for evaluation based on a certain search strategy. Each candidate subset is evaluated and compared with the previous best one according to a certain evaluation criterion. If the new subset turns out to be better, it replaces the previous best subset. The process of subset generation and evaluation is repeated until a given stopping criterion is satisfied. Then the selected best subset usually needs to be validated by prior knowledge or different tests via synthetic and real-world data sets. Feature selection can be found in many areas of data mining such as classification, clustering, association rules, regression.

2.HISTORY AND MOTIVATION
Simulated annealing is so named because of its analogy to the process of physical annealing with solids, in which a crystalline solid is heated and then allowed to cool very slowly until it achieves its most regular possible crystal lattice configuration (i.e., its minimum lattice energy state), and thus is free of crystal defects. If the cooling schedule is sufficiently slow, the final configuration results in a solid with such superior structural integrity. Simulated annealing establishes the connection between this type of thermo-dynamic behavior and the search for global minima for a discrete optimization problem. Furthermore, it provides an algorithmic means for exploiting such a connection. At each iteration of a simulated annealing algorithm applied to a discrete optimization problem, the objective function generates values for two solutions (the current solution and a newly selected solution) are compared. Improving solutions are always accepted, while a fraction of non-improving (inferior) solutions are optimal accepted in the hope of escaping local optima in search of Global optima. The probability of accepting non-improving solutions depends on a temperature parameter, which is typically non-increasing with each iteration of the algorithm. The key algorithmic feature of simulated annealing is that it provides a means to escape local optima by allowing hill-climbing moves (i.e., moves which worsen the objective function value). As the temperature parameter is decreased to zero, hill-climbing moves occur less frequently, and the solution distribution associated with the inhomogeneous Markov chain that models the behavior of the algorithm converges to a form in which all the probability is concentrated on the set of globally optimal solutions.

3.LITERATURE REVIEW
SA has been extensively applied to deterministic optimization problems and the theoretical basis of the algorithm for this application has been known for a number of years. Many instances of practical and difficult problems were successfully solved by simulated annealing. The effectiveness of SA is attributed to the nature that it can explore the design space by means of a neighborhood structure and escape from local minima by probabilistically along uphill moves controlled by a temperature parameter. Kirkpatrick (1983) realized the similarity between the optimization of combinational optimization problems and the physical process of annealing. Simulated Annealing became one of the more popular optimization algorithms. Sullivan and Jacobson (2001) studied generalized hill climbing algorithms and their performance. They extended necessary and sufficient convergence conditions for simulated annealing. Azizi and Zolfaghari (2004) addressed changes in temperature based on the number of consecutive moves showing improvement by comparing two variations of the SA method in adaptive temperature control. Rosen and Harmonsky (2005) proposed a simulated annealing based simulation optimization method , which is an asynchronous , team-type heuristic. It improved the performance of simulated Annealing for discrete variable simulation optimization with the conventional cooling schedule, the probability of transition decreases from the beginning of the search to the end. Ameu r(2004) found a simple algorithm to compute the temperature in SA which is compatible with a given acceptance ratio of bad moves. He also provided a convex function and low temperatures and a concave function for high temperature based on a geometric schedule. The first theoretical analysis of simulated Annealing applied to solve discrete stochastic optimization problems was given by Gelfand and Mitter(1989). They showed that if the noise in the estimated objective function values in the iteration has the normal distribution with zero mean and positive variance, then their procedure converges in probability to the set of global optimal solutions provided that the sequence is chosen properly.
In 1996, Gutjahs and Pflug(1996) generalized a classical convergence result for the Simulated Annealing algorithm to the case where cost function observations are distributed by random noise. Saul B.Gelfand and Sanjoy K. Mitter(1988) examined the effect of using noisy or imprecise measurements of the energy differences on tracking the minimum energy state visited by the modified algorithms. Charon and Hudry (1993) suggested adding noise to the SA algorithm. Their approach adds random noise initially and then gradually reduces the noise to zero in order to perturb the solution space. Mahmond H.Alrefari and Sigriin Andraddttir (1999) present a modified simulated Annealing algorithm designed for solving discrete stochastic optimization problems. In 2001 Charon and Hudry (2001) extended their noising method. The algorithm perturbs the solution space by adding random noise to the problem's objective function values. A stopping criterion is introduced in a precise way that gradually reduces the noise-rate. Prudius and Andradottir (2005) proposed two cooling schedule approaches for controlling the probability of moving to seemingly inferior points and used the state with the highest estimated objective function value obtained from all the previous observations. Ling Wang and N o v e m b e r 0 2 . 2 0 1 5 Liang Zhang(2005) proposed SA combined with hypothesis testing for stochastic discrete optimization problems and demonstrated the effectiveness of the proposed approach by the simulation results based on stochastic numeric optimization problems.

4.SIMULATED ANNEALING [SA]
To apply simulated annealing, we must specify three parameters. First is an annealing schedule, which consists of an initial and final temperature, T0 and Tfinal , along with an annealing (cooling) constant ∆T. Together these govern how the search and proceed until the search stops. The second parameter is a function used to evaluate potential solutions (feature subsets). The goal of simulated annealing is to optimize this function. For this discussion, the mean squared error is used to estimate the function. The final parameter for Simulated Annealing [SA] is a neighbor function, which takes the current solution and temperature as input, and returns a new nearby solution. The role of the temperature is to govern the size of the neighborhood. At high temperature the neighborhood should be large, allowing the algorithm to explore broadly. At low temperature, the neighborhood should be small, forcing the algorithm to explore locally. For example, if we represent the set of available features as a bit vector, such that each bit indicates the presence or absence of a particular feature. This algorithm attempts to iteratively improve a randomly generated initial solution. On each iteration, the algorithm generates a neighboring solution and computes the difference in quality (energy, by analogy to metallurgy process) between the current and candidate solutions. If the new solution is better, then it is retained. Otherwise, the new solution is retained with a probability that is dependent on the quality difference, ∆E, and the temperature. The temperature is then reduced for the next iteration. Success in simulated annealing depends heavily on the choice of the annealing schedule. One obvious criterion is to accept a solution when it has a less error than the previous solution. Where ∆E is the difference between the solution error after it has perturbed, and the solution error before it was perturbed. T is the current temperature and k is a suitable constant. From the metropolis algorithm it can be observed that when ∆E is negative, the solution is always accepted. However, the algorithm may accept a new solution, if the solution has not a smaller error than the previous one (a positive ∆E) and the probability to do this decreases when the temperature decreases or when ∆E increases. If the metropolis algorithm takes the value in between 0.7 and 0.9 the new solution will be accepted and otherwise the new solution will not be accepted.
An estimate for mean squared error which is represented by ∆E can be computed from ∆E = σ 2 /n. Initial value of Tk is taken as 0.95 and the value of k is a random number between 0 and 1. In the successive iterations the value of T will be taken as Tk+1=α x Tk ,0<α<1 where α = 0.5.
In the context of feature selection, relevant evaluation functions include the accuracy of a given learning algorithm using the current feature subset (creating a wrapper algorithm), or a variety of statistical scores (producing a filter algorithm). If ∆T is too large (near one), the temperature decreases slowly, resulting in slow convergence. If ∆T is too small (near zero), then the temperature decreases quickly and convergence will likely reach a local extreme. Moreover, the range of temperatures used for an application of simulated annealing must be scaled to control the probability of accepting a low-quality candidate solution.

6.EXPERIMENTAL RESULTS AND ANALYSIS
Pest and weather data set for cotton plant is taken from the cotton research station in Coimbatore. This data is given in table 1. The data consists of several attributes such as crop, location, pest, observation, week, pest value, max temp, min temp, RH1, RH2, rainfall, wind speed, sunshine hours and evaporation. Variance and mean squared error are the computations that are made in the available table and based on the values of the mean squared error of the attributes the feature subset will be constructed.

Table 1: Pest and weather data set for cotton plant
The experiment is carried out in pest and weather data set in cotton to find the dimension reduction of attributes using simulated annealing concept. The attributes max temp, min temp, RH1, RH2, Rain fall, Wind speed, sun shine and Evaporation are taken for consideration. The mean squared error for these attributes is calculated in the table below. The initial set of attributes is considered as {temp, min temp, RH1, RH2, Rain fall, Wind speed, sun shine, Evaporation}  The neighbourhood solution is found from the previous attribute set by randomly selecting the attributes such as max temp, min temp, RH1, RH2, Rain fall, Wind speed and sun shine. The mean squared error for these attributes is calculated in the following

∆E = 9.976
Since ∆E > 0, the probability of acceptance should be computed and its value is 0.9501. The probability of acceptance, Pa=0.9501, lies between 0.7 and 0.9. Hence, the new solution is accepted as the current solution. This procedure is followed with the same temperature for a maximum chain length (say 2) in this example and then the temperature is reduced for the next iteration with suitable k value. This iterating process of finding the mean squared error between the current and candidate set of attributes is repeated until the final temperature or zero value is reached. The subset selected in this simulated annealing process after performing three iterations in this cotton data set is {max temp, min temp, RH1, RH2, sunshine, wind speed, evaporation} and the values of the attributes are given in the table below. N o v e m b e r 0 2 . 2 0 1 5

Selection Text Categorization
Text categorization is the problem of automatically assigning predefined categories to free text documents. This problem is of great practical importance given the massive volume of online text available through the World Wide Web, Emails, and digital libraries. A major characteristic or difficulty of text categorization problems is the high dimensionality of the feature space. The original feature space consists of many unique terms that occur in documents, and the number of terms can be hundreds of thousands for even a moderate-sized text collection. This is prohibitively high for many mining algorithms. Therefore, it is highly desirable to reduce the original feature space without sacrificing categorization accuracy. Different feature selection methods are evaluated and compared in the reduction of a high dimensional feature space in text categorization problems. It is reported that the methods under evaluation can effectively remove 50% -90% of the terms while maintaining the categorization accuracy.

Image Retrieval
Feature selection is applied in to content-based image retrieval. Recent years have seen a rapid increase of the size and amount of image collections from both civilian and military equipments. However, we cannot access to or make use of the information unless it is organized so as to allow efficient browsing, searching and retrieving. Content-based image retrieval is proposed to effectively handle the large scale of image collections. Instead of being manually annotated by text-based keywords, images would be indexed by their own visual contents (features), such as color, texture, shape, etc. One of the biggest problems to make content-based image retrieval truly scalable to large size image collections is still the "curse of dimensionality". Dimensionality reduction is a promising approach to solve this problem. The image retrieval system proposed in uses the theories of optimal projection to achieve optimal feature selection. Relevant features are then used to index images for efficient retrieval.

Customer Relationship Management
A case of feature selection is presented in for customer relationship management. In the context that each customer means a big revenue and the loss of one will likely trigger a significant segment to defect, it is imperative to have a team of highly experienced experts monitor each customer's intention and movement based on massively collected data. A set of key indicators are used by the team and proven useful in predicting potential defectors. The problem is that it is difficult to find new indicators describing the dynamically changing business environment among many possible indicators (features). The machine recorded data is simply too enormous for any human expert to browse and obtain any insight from it. Feature selection is employed to search for new potential indicators in a dynamically changing environment. They are later presented to experts for scrutiny and adoption. This approach considerably improves the team's efficiency in finding new changing indicators.

Intrusion Detection
As network-based computer systems play increasingly vital roles in modern society, they have become the targets of our enemies and criminals. The security of a computer system is compromised when an intrusion takes place. Intrusion detection is often used as one way to protect computer systems. Lee, Stolfo, and Mok proposed a systematic data mining framework for analyzing audit data and constructing intrusion detection models. Under this framework, a large amount of audit data is first analyzed using data mining algorithms in order to obtain the frequent activity patterns. These patterns are then used to guide the selection of system features as well as the construction of additional temporal and statistical features for another phase of automated learning. Classifiers based on these selected features are then inductively learned using the appropriately formatted audit data. These classifiers can be used as intrusion detection models since they can classify whether an observed system activity is "legitimate" or "intrusive". Feature selection plays an important role in building classification models for intrusion detection.

Genomic analysis
Structural and functional data from analysis of the human genome has increased many folds in recent years, presenting enormous opportunities and challenges for data mining [91,96]. In particular, gene expression microarray is a rapidly maturing technology that provides the opportunity to assay the expression levels of thousands or tens of thousands of genes in a single experiment. These assays provide the input to a wide variety of data mining tasks, including classification and clustering. However, the number of instances in these experiments is often severely limited.

9.DISADVANTAGES OF SIMULATED ANNEALING
 Repeatedly annealing with a 1/log k schedule is very slow, especially if the cost function is expensive to compute.
 For problems where the energy landscape is smooth, or there are few local minima, SA is overkill simpler, faster methods (e.g., gradient descent) will work better. But generally don't know what the energy landscape is for a particular problem.
 Heuristic methods, which are problem-specific or take advantage of extra information about the system, will often be better than general methods, although SA is often comparable to heuristics.
 The method cannot tell whether it has found an optimal solution. Some other complimentary method (e.g. branch and bound) is required to do this.

10.RELATED CHALLENGES FOR FEATURE SELECTION
Feature selection is a preprocessing step in very large databases collected from Internet, business, scientific, and government applications. Novel feature selection applications will be found where creative data reduction has to be conducted when our ability to capture and store data has far outpaced our ability to process and utilize it . Feature selection can help focus on relevant parts of data and improve our ability to process data mining applications arise as techniques evolve. Scaling data mining algorithms to large databases is a pressing issue. As feature selection is one step in data preprocessing, changes need to be made for classic algorithms that require multiple database scans and random access to data. Research is required to overcome limitations imposed when it is costly to visit large data sets multiple times or access instances at random as in data streams. Feature selection can also be extended to instance selection in scaling down data which is a sister issue of scaling up algorithms. In addition to sampling methods, a suite of methods have been developed to search for representative instances so that data mining is performed in a focused and direct way. Feature selection is a dynamic field closely connected to data mining and other data processing techniques. This paper attempts to survey this fast developing field, show some effective applications, and point out interesting trends and challenges. It is hoped that further and speedy development of feature selection can work with other related techniques to help evolve data mining into solutions for insights.

11.CONCLUSION
In genetic algorithms a major advantage of simulated annealing is its flexibility and robustness as a global search method. It is a "weak method "which does not use gradient information and makes relatively few assumptions about the problem being solved. It can deal with highly nonlinear problems and non-differentiable functions as well as functions with multiple local optima. It is also amenable to parallel implementation. Simulated annealing is a very powerful and important tool in a variety of disciplines.


SA is based on neighborhood search and allows uphill moves.


It has a strong analogy to the simulation of cooling of material.
 Uphill moves are allowed with a temperature dependent probability.
 Generic and problem-specific decisions have to be taken at implementation.