THE IMPORTANCE OF NORMALIZATION METHODS FOR MINING MEDICAL DATA

Over the past decades, the field of medical informatics has been growing rapidly and has drawn the attention of many researchers. The digitization of different medical information, including medical history records, research papers, medical images, laboratory analysis and reports, has generated large amounts of data that need to be handled. As the rate of data acquisition is greater than the rate of data interpretation, new computational technologies are needed in order to manage the resulted repositories of medical data and to extract relevant knowledge from them. Such methods are provided by data mining techniques, which are used for discovering meaningful patterns and trends within the data and help improving various aspects of health informatics. In order to apply data mining techniques, the data needs to be cleansed and transformed, normalization being one of the most important pre-processing methods that accomplish this purpose. This paper aims to present the impact of applying different data normalization methods, on the performance obtained with the K-Nearest Neighbour algorithm on medical data sets.


INTRODUCTION
Nowadays, the amount of data that is collected and needs to be processed is growing exponentially every day. In the medical field, different researchers and companies are mining huge amounts of data in order to get to proper conclusions for their case studies.
Physicians and other medicine practitioners are providing treatment recommendations to patients based on their medical history, laboratory results, medical images etc. Health informatics can help them by providing a more rapid access to the relevant information that need to be reviewed and thus, allowing them to make optimal decisions. Moreover, by integrating data mining techniques in the health informatics solutions, physicians will be able to use analytical and predictive instruments to query all the relevant information that are useful within their decision process.
Data mining techniques are used for their ability to provide the necessary methods and tools for extracting meaningful patterns and trends, based on the data gathered within the databases taken into consideration. In order for data mining algorithms to be applied, data needs to be validated, cleaned and transformed. This is achieved as part of a preliminary step of the data mining processpre-processing. One of the core methods used during pre-processing is represented by data normalization.
In the process of developing informatics solutions for the medicine field, it is very important to pre-process all the required medical data sets before applying any data mining algorithms. This is due to the fact that data is mostly characterized by noise, discrepancies, outliers, missing values and lack of exactness. In order to run a successful analysis on medical data sets, such as K-NN algorithm, different pre-processing methods should be carried out first.
K-Nearest Neighbour (k-NN) is a classification algorithm that uses mostly the concept of Euclidian distance, although other distance measures can be used as well, in order to classify an input data based on the class label of the closest k points in the training dataset [1]. The concept of closeness between points is influenced by the dataset accuracy as well. Thus, the normalization method needs to be applied in order to reduce the redundant data, for transforming the initial data set into a more consistent and noise-free one and, overall, to ensure that good quality clusters are generated at the end of the performed analysis.

NORMALIZATION: THEORETICAL FRAMEWORK
Data mining is the core step of the Knowledge Discovery in Databases (KDD) process. In order to apply data mining techniques to the data, a crucial issue, part of the KDD process, must be addresseddata pre-processing. The KDD process flow, along with its main components shown in Figure- According to the KDD process flow, after the raw data is selected from the sources and loaded as target data into the mining database, it has to be cleansed and transformed, so that data mining techniques can be applied to it, to obtain significant patterns and trends. These patterns and trends are evaluated and they become valuable knowledge extracted from the data.
Data pre-processing involves various methods divided into four major categories: [2]  data cleaningattempts to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data;  data integrationimplies merging the data from multiple data stores, to reduce and avoid redundancies and inconsistencies;  data reductionis applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data;  data transformationthe data are transformed or consolidated into forms appropriate for applying data mining techniques.
The main data transformation methods used in data mining are: smoothing, attribute construction, normalization, aggregation and discretization. J u n e 02, 2 0 1 5 The major data pre-processing categories, along with the methods used for data transformation and the main data normalization methods are shown in Figure-2:

Fig 2: Data pre-processing methods
Data normalization method is used to reduce the range of an attribute of the dataset to a smaller range, for example 0 to 1.0.
Normalization is used to standardize all the features of the dataset into a specified predefined criterion so that redundant or noisy objects can be eliminated and use made of valid and reliable data which can effect and improve accuracy of the result [3].
There are several data normalization methods, the most important ones being min-max normalization, z-score normalization and normalization by decimal scaling: [2] 1. Min-max normalization performs a linear transformation on the original data. Suppose that minA and maxA are the minimum and maximum values of an attribute, A. Min-max normalization maps a value, vi, of A to vi′ in the range [new_minA, new_maxA] by computing (1). Min-max normalization preserves the relationships among the original data values. It will encounter an "out-ofbounds" error if a future input case for normalization falls outside of the original data range for A.

 
2. Z-score normalization (or zero-mean normalization), the values for an attribute, A, are normalized based on the mean (i.e. average) and standard deviation of A. A value, vi, of A is normalized to vi′ by computing (2). where, where, -vi′ = new value of attribute A vi = current value of attribute A j = the smallest integer such that max(|vi′|) < 1 Data normalization is very useful in data mining for classification algorithms, involving neural networks or distance measurements such as nearest-neighbor classificationk-NN.

CASE STUDY
In this chapter we will describe the results obtained after applying the k-NN algorithm"s steps on each dataset that resulted after cleaning the initial Diabetes dataset by one of the normalization methods.
The dataset used for this case study is represented by the instances associated to patients investigated for Diabetes. Those were collected from the National Institute of Diabetes and Digestive and Kidney Diseases, from India. The dataset was obtained from UCI Machine Learning Repository [4] and consists in a number of 768 instances. In what concerns the structure of the dataset, this is represented by 8 attributes and a class, illustrated in Table-1. The class labels can take two values: "tested positive" and "tested negative" and represents the diagnosis set for each patient investigated as having diabetes. During our experimental study, the k-NN algorithm was applied, using RapidMiner [5], on the dataset described above, normalized using the methods previously presented.
RapidMiner is a software platform developed by the company of the same name that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. J u n e 02, 2 0 1 5 In terms of data normalization, RapidMiner Software documentation defines this preprocessing technique as a tool used to rescale attribute values to fit in a specific range. Normalization of the data is very important when dealing with attributes of different units and scales, especially for data mining techniques that use the Euclidean distance. Therefore, all attributes should have the same scale for a fair comparison between them. In other words normalization is a technique used to level the playing field when looking at attributes that widely vary in size as a result of the units selected for representation. The RapidMiner normalization operators perform normalization of selected attributes. Four normalization methods are provided by RapidMiner: range_transformation, proportion_transformation, z_transformation and interquartile_range. [5] Our experiment consisted in defining a mining process in RapidMiner, for applying the k-NN algorithm on the Diabetes dataset, after applying each normalization method, and obtaining performance indicators for each execution of the process.
The focus of this process is to obtain a comparative analysis of different methods available for normalization. All normalization parameters other than the method parameter are for selection of attributes on which normalization is to be applied.
The process flow for applying the k-NN classification algorithm is shown in Figure- The process we defined for testing the performance of k-NN, has the following six steps: 1. Read ARFFthis operator is used for reading an ARFF file, in our case the file diabetes.arff, for processing purposes. 2. Normalizethis operator normalizes the attribute values of the selected attributes of the dataset. RapidMiner supports applying all three normalization methods described in this paper. We applied during our experiment the following RapidMiner methods: for min-max normalization we used range_transformation, for z-score normalization we used z_transformation and for normalization by decimal scaling we used proportion_transformation. All normalization parameters other than the method parameter are for selection of attributes on which normalization is to be applied. 3. Set Rolethis operator is used to change the role of one or more attributes. The Role of an attribute reflects the part played by that attribute in the dataset. Changing the role of an attribute may change the part played by that attribute in a process. 4. k-NNthis operator generates a k-Nearest Neighbour model from the input dataset. This model can be a classification or regression model depending on the input dataset. The basic k-NN algorithm is composed of two steps: find the k training examples that are closest to the unseen example and take the most commonly occurring classification for these k examples. 5. Apply Modelthis operator applies an already learnt or trained model, in our case k-NN, on a dataset. A model is first trained on an dataset; information related to the dataset is learnt by the model. Then that model can be applied on another dataset usually for prediction. 6. Performancethis operator is used for performance evaluation. It delivers a list of performance criteria values.
These performance criteria are automatically determined in order to fit the learning task type, for our experiment k-NN.
The process described above was executed for each normalization method on the Diabetes dataset. The results obtained have shown the performance of the k-NN algorithm on the dataset, transformed through each normalization method. J u n e 02, 2 0 1 5 The effect of training the k-NN algorithm on the described database can be measured with a set of statistical indicators. For this study, we have taken into consideration the accuracy and root mean square error which are representative in terms of describing an algorithm"s efficiency and in order to evaluate and compare the impact that each normalization methods has on the results. Equations (4) and (5)  Data transformation methods such as normalization are meant to increase the efficiency and accuracy of data mining techniques, such as k-NN algorithm. As part of our experiment we applied all three normalization methods and we performed a comparative analysis against each other. The purpose of the experiment was to see the effect of the three normalization techniques applied over the accuracy and root mean square error of k-NN.
In order to establish the better normalization technique based on our experiments, the root mean square error metric should be as low as possible and in the same time, the accuracy of the classification process should be higher. These expectations are achieved with the min-max normalization method as shown in Table-  From the table above, it is clear that min-max normalization produces the lowest RMSE compared to the other methods. It is followed by Z-score normalization where the RMSE is 0.318. The highest RMSE is obtained through normalization by decimal scaling.
In regards to the prediction accuracy, with min-max normalization it is 85.81%, followed by Z-score normalization with 85.16% and normalization by decimal scaling with 82.94%.
Our experiment compares the quality of the classification obtained by applying k-NN algorithm, on the diabetes dataset, with min-max normalization method against the two other normalization methods, Z-score normalization and normalization by decimal scaling. Based on the previous assertions, all three normalization procedures produce almost