PERFORMANCE EVALUATION SYSTEM FOR DECISION TREE ALGORITHMS

In the machine learning process, classification can be described by supervise learning algorithm. Classification techniques have properties that enable the representation of structures that reflect knowledge of the domain being classified. Industries, education, business and many other domains required knowledge for the growth. Some of the common classification algorithms used in data mining and decision support systems is: Neural networks, Logistic regression, Decision trees etc. The decision regarding most suitable data mining algorithm cannot be made spontaneously. Selection of appropriate data mining algorithm for Business domain required comparative analysis of different algorithms based on several input parameters such as accuracy, build time and memory usage. To make analysis and comparative study, implementation of popular algorithm required on the basis of literature survey and frequency of algorithm used in present scenario. The performance of algorithms are enhanced and evaluated after applying boosting on the trees. We selected numerical and nominal types of dataset and apply on algorithms. Comparative analysis is perform on the result obtain by the system. Then we apply the new dataset in order to generate generate prediction outcome.


INTRODUCTION
Extraction of knowledge from data in a human-understandable structure is the main goal of data mining. The process of data mining consists of three stages: Exploration, Model building and deployment. In the exploration stage data preparation mainly include cleaning data, data transformation, selection of subset records and for large data sets with large number of features it also require to do feature selection. In model building and validation we have a variety of models and select the best one based on their predictive performance i.e. produces good results from the given samples. Than final stage involves, choose the best model selected in the previous stage and applying it to the new data in order to generate predictions or estimates of the expected outcome. But the output of mining is depending on data set and the algorithm used. Sometimes data is not classified as per need of application because of algorithms are not much suitable for the given data set.
Some of the common classification algorithms used in data mining and decision support systems is: Neural networks, Logistic regression, Decision trees etc. A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
In this project we concentrate over a real word problem for prediction using previous data analysis. Here we select tree different data formats and apply our algorithm to know the answers of the following questions.
1. Is size of dataset affect the performance of the decision tree? 2. What is the affect of data types over the performance of decision tree?
3. Which algorithm is performing better with the nominal data set? 4. Which algorithm is best fit for classification of numerical dataset?

INTRODUCTION TO DECISION TREE ALGORITHM
A decision tree is a tree structure which classifies an input sample into one of its possible classes. Decision trees are used to extract knowledge by Inferring decision making rules from the huge amount of available information. Decision tree is a useful tool in classification. A decision tree classifier has a simple form which can be compactly stored and that efficiently classifies new data. Decision tree classifiers can perform automatic feature selection and complexity reduction, while the tree structure gives easily understandable and interpretable information regarding the predictive or generalization ability of the data. A decision tree recursively partitions a data set into smaller subdivisions on the basis of tests applied to one or more features at each node of the tree.
Because of their tree structure and ability to easily generate rules decision trees are the favored technique for building understandable models. Because of this clarity they also allow for more complex profit and ROI models to be added easily in on top of the predictive model. For instance once a customer population is found with high predicted likelihood to attrite a variety of cost models can be used to see if an expensive marketing intervention should be used because the customers are highly valuable or a less expensive intervention should be used because the revenue from this subpopulation of customers is marginal.
Decision trees are data mining technology that has been around in a form very similar to the technology of today for almost twenty years now and early versions of the algorithms date back in the 1960s. Often times these techniques were originally developed for statisticians to automate the process of determining which fields in their database were actually useful or correlated with the particular problem that they were trying to understand. Partially because of this history, decision tree algorithms tend to automate the entire process of hypothesis generation and then validation much more completely and in a much more integrated way than any other data mining techniques. They are also particularly adept at handling raw data with little or no pre-processing. Perhaps also because they were originally developed to mimic the way an analyst interactively performs data mining they provide a simple to understand predictive model based on rules (such as "90% of the time credit card customers of less than 3 months who max out their credit limit are going to default on their credit card loan.").

OVERVIEW OF CLASSIFICATION ALGORITHM USING DECISION TREES
This section presents overview of various decision tree algorithms developed so far. One of the advantages of using classification trees is their ability to provide easy to understand classification rules. Each node of a classification tree is a rule.

Classification and Regression Tree (CART)
CART is a recursive partitioning method used both for regression and classification. CART is constructed by splitting subsets of the data set using all predictor variables to create two child nodes repeatedly. The best predictor is chosen using a variety of impurity or diversity measures. The goal is to produce subsets of the data which are as homogeneous as possible with respect to the target variable.
Quick, Unbiased, Efficient Statistical Tree (QUEST) is a binary-split decision tree algorithm. It can be used with univariate or linear combination splits. its attribute selection method has negligible bias. If all the attributes are uninformative with respect to the class attribute, then each has approximately the same change of being selected to split a node. 2. Further, because simple if then rules can be read right off the tree, models are easy to grasp and easy to apply to new data.
3. CART uses strictly binary, or two-way, splits that divide each parent node into exactly two child nodes by posing questions with yes/no answers at each decision node. 4. CART is unique among decision-tree tools. CART-proven methodology is characterized by: a. Reliable pruning strategy -CART developers determined definitively that b. no stopping rule could be relied on to discover the optimal tree, c. Powerful binary-split search approach -CART binary decision trees are more sparing with data and detect more structure before too little data is left for learning. d. Automatic self-validation procedures -in the search for patterns in databases it is essential to avoid the trap of over fitting e. Further, the testing and selection of the optimal tree are an integral part of the CART algorithm.
f. It has automated solutions that surrogate splitters intelligently handle missing values; g. multiple-tree, committee-of-expert methods increase the precision of results. Where S1, S2,…Sm are the partitions induced by attribute A in S.

SLIQ Algorithm
SLIQ is a decision tree classifier that can handle both numerical and categorical attributes it builds compact and accurate trees. It uses a pre-sorting technique in the tree growing phase and an inexpensive pruning algorithm. It is suitable for classification of large disk-resident datasets, independently of the number of classes, attributes and records.

Partition (Data S)
If (all points in S are in the same class)

Then return;
Evaluate Splits for each attribute A; Use best split to partition S into S1 and S2; Partition (S1); Partition (S2); The gini index is used to evaluate the "goodness" of the alternative splits for an attribute The first technique implemented by SLIQ is a scheme that eliminates the need to sort data at each node It creates a separate list for each attribute of the training data. A separate list, called class list, is created for the class labels attached to the examples. SLIQ requires that the class list and (only) one attribute list could be kept in the memory at any time.

Boosting Algorithm
This algorithm is introduced by R. Shapire in [3]. In Bagging, the base models are generated, at least logically, independently and in parallel, on the other hand boosting is a sequential procedure mainly applied to classification, where the performance of a preceding model is used when generating all subsequent models. The main principle is that difficult training instances are assigned a higher weight, making the base models focus on these instances. More specifically, each training instance is initially assigned the same weight but after training one model, the instances incorrectly classified have their weights increased, while those correctly classified have their weights decreased. The weights are either used as part of the score function or to prioritize instances with higher weights when bootstrapping.Below is a generic description of Boosting.  This step involves selection of data set which contains information related to data by which we construct model for evaluation. Here we collect data of different size and different types. We use nominal data (car evaluation) and numerical data (Auto Imports) both to evaluate results.

Data Analysis Models:
We use three most popular decision tree models namely C4.5, CART and SLIQ. The tree forming process is depending upon data supplied to build model and techniques used by different algorithms.
Build Model: It is tree building process by which data is parsed and using the data, system generates tree structure.
Parameter Evaluation: In this phase data model is prepared and evaluation process is started. The evaluation of constructed model is done using cross validation process. In this cross validation process we randomly select data supply them over the build model, model predicts its output values and we compare the predicted values to the real values. And according to these predicted values we define accuracy and error rate. Graphs are plot to evaluate accuracy, build time, search time and memory usage. Boosting: This phase involve performance enhancement technique by which we re-adjust our build model. And using this technique we improve accuracy of models.

Prediction:
Here we use our constructed model for prediction according to the data enters by user.

Graph Representation between C4.5, CART and SLIQ in Context of Accuracy
The below given graph show the accuracy of the C4.5, CART and SLIQ algorithm. All three algorithm shows one characteristic when size of data set is small then accuracy is high and data set size is large the accuracy is reduced. By using graph we can see at the initial state when the size of data set is too large then SLIQ and CART reflect similar accuracy. And as size reduces the similarity of accuracy pattern between C4.5 and CART is much similar.

Graph Representation between C4.5, CART and SLIQ in Context of Accuracy with Boosting
The above derived results are simple C4.5, SLIQ and CART here we describe the change after boosting of these algorithms. After boosting we can see the accuracy of all algorithms are increased and all algorithm shows similar resultant after boosting. In the below given graph all three lines having similar up and downs as the size of data set increased or decreased .

Graph Representation between C4.5, CART and SLIQ in Context of Build Time
Here we show the comparison between three algorithms in the domain of time in both manners with boosting and without boosting. To differentiate more accurately C4.5 is represented using blue lines, pink lines for CART and Yellow Line represents SLIQ algorithm.

Graph Representation between C4.5, CART and SLIQ in Context of Build Time
Here we show the comparison between three algorithms in the domain of time in both manners with boosting and without boosting. To differentiate more accurately C4.5 is represented using blue lines, pink lines for CART and Yellow Line represents SLIQ algorithm.

Graph Representation between C4.5, CART and SLIQ in Context of Build Time with Boosting
The below given graph shows build time of all three algorithms after boosting build time of all three algorithm is much similar. But if we look like this if size of data set is small then the build time of all three algorithms is quite similar after a time they reflect different behavior.

Graph Representation between C4.5, CART and SLIQ in Context of Accuracy
The below given graph show the accuracy of the C4.5, CART and SLIQ algorithm. With numeric dataset SLIQ perform better than C4.5 and CART one of the reason is that SLIQ first sort the data and then build the model.

Graph Representation between C4.5, CART and SLIQ in Context of Accuracy with Boosting
The below graph shows the accuracy of the C4.5, CART and SLIQ algorithms with boosting. As shown in graph after boosting all algorithms shows similar result.

Graph Representation between C4.5, CART and SLIQ in Context of Build Time
The below given graph show that with small data size all algorithms taking same time with increase data size SLIQ is taking more time than C4.5 and CART.

Graph Representation between C4.5, CART and SLIQ in Context of Build Time with Boosting
The below given graph show that with small data size all algorithms taking same time but with increase in data size SLIQ is taking more time than C4.5 and CART.

CONCLUTION
In this paper we have compared the performance and usefulness of different decision tree algorithms for classifying data in knowledge based system.
We analyze the results using various parameters like accuracy, memory usage and build time. In complete implementation of proposed work, we found results that are listed below: We analyze the effect of the data size on selected algorithm and found that parameters accuracy, build time etc. also changes with changing the data size. We also found that accuracy improves after applying boosting on the decision tree algorithms.
We have made the following conclusions shown in Table 6.1 on the basis of results obtain-N o v 2 0 , 2 0 1 3 As we can see by the result that for time critical application, for nominal data type SLIQ performing best i.e. for application like polar molecular surface area and for numerical data type CART is best which can be used by application like share market. C4.5 performed better with nominal data type so it can be implement in application like student performance, online shopping etc.

FUTURE WORK
We can use algorithms that can be made dynamic to change tree automatically.
Proposed system compares the performance in terms of parameters like accuracy, time and memory. More parameter can be evaluated to compare the performance