Data Mining Algorithms: An Overview

The research on data mining has successfully yielded numerous tools, algorithms, methods and approaches for handling large amounts of data for various purposeful use and problem solving. Data mining has become an integral part of many application domains such as data ware housing, predictive analytics, business intelligence, bio-informatics and decision support systems. Prime objective of data mining is to effectively handle large scale data, extract actionable patterns, and gain insightful knowledge. Data mining is part and parcel of knowledge discovery in databases (KDD) process. Success and improved decision making normally depends on how quickly one can discover insights from data. These insights could be used to drive better actions which can be used in operational processes and even predict future behaviour. This paper presents an overview of various algorithms necessary for handling large data sets. These algorithms define various structures and methods implemented to handle big data. The review also discusses the general strengths and limitations of these algorithms. This paper can quickly guide or be an eye opener to the data mining researchers on which algorithm(s) to select and apply in solving the problems they will be investigating .


INTRODUCTION
A number of definitions about data mining have been laid forth by various researchers. Some have defined data mining as a process of discovering useful or actionable knowledge in large scale data [1,4]. According to Zaki and Meira [8] data mining is the process of discovering insightful, interesting, and novel patterns, as well as descriptive, understandable, and predictive models from large-scale data [8]. Another definition of data mining as coined by Ozer [2] and Garcia et.al. [26], is the extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data.
Data mining also means knowledge discovery from data which describes the typical process of extracting useful information from raw data [1,3]. As pointed out by Kamruzzaman, Haider and Hasan [11], many people treat data mining as a synonym for another popularly used term, Knowledge Discovery in Databases, or KDD. This observation is quiet true if one can closely look at the interpretations which have been made by several researchers such [1,2,3,10] about data mining. However as pointed out by Kamruzzaman, Haider and Hasan [11], data mining is also treated simply as an essential step in the process of knowledge discovery in databases.

SCOPE AND OBJECTIVE
Based on document analysis, this paper reviews and summarizes the information on data mining concept and some of the data mining algorithms. It gives the general overiew or background about data mining and the algorithms which are usually used to mine data. It gives a background about the categories of these algorithms. It then gives a discussion on the strengths and limitations of these data mining algorithms. The research paper is intended to give an understating to researchers, scholarly peers , learners, data miners, companies and anyone who wish to stay abreast with the data mining and the algorithms which are commonly used in data mining.

DATA MINING ALGORITHMS
A data mining algorithm is a set of heuristics and calculations that creates a data mining model from data [26]. It can be a challenge to choose the appropriate or best suited algorithm to apply to solve a certain problem. Even though one can use different algorithms to perform the same tasks, each algorithm yield a different set of results, and some algorithms can even produce more than one type of results. Some algorithms can perform classification process, that is, they can predict one or more discrete variables, based on the other attributes in the data set. Some algorithms perform regression I S S N 2 2 7 7 -3061 V o l u m e 15 N u m b e r 6 I n t e r n a t i o n a l j o u r n a l o f C o m p u t e r s a n d T e c h n o l o gy 6807 | P a g e C o u n c i l f o r I n n o v a t i v e R e s e a r c h A p r i l , 2 0 1 6 w w w . c i r w o r l d . c o m purposes, they can predict more or continuous variables based on the other attributes in the data set. As pointed out by Microsoft [26], some algorithms can perform segmentation,they divide data into groups,or clusters of items that have similar properties.While some algorithms can be associative by finding correlations between different attributes in a set, some can be used for sequence analysis processes, that is they can be used to summarise sequence or episodes in data, such as a web path flow [26]. However, all of the aforementioned types of algorithms can be categorized into two large categories: Supervised learning and Unsupervised learning algorithms.
The following sub-sections briefly discuss the two categories: supervised and unsupervised learning. Several examples of some of the aforementioned algorithms in each of the said categories are also given as a summary in Table 1 and 2. Basically both Table 1 and 2 show a general discussion of some of the strengths and limitations of some of these algorithms.

SUPERVISED LEARNING
The supervised learning algorithms are those for which the class attribute values for the dataset are known before running the algorithm. This data is called labelled data or training data [1]. Instances in this set are tuples in the format (x, y) where x is a vector and y is the class attribute, commonly a scalar. Supervised learning builds a model that maps x to y. The task is to find a mapping m(.) such that m(x) = y. The unlabelled dataset or test dataset is also provided, in which instances are in the form(x,?) and y values are unknown. Given m(.) learned from training data and x of an unlabelled instance, m(x) can be computed which results in the prediction of the label for the unlabelled instance [5].  Supervised learning can be divided into i) classification and ii) regression. When the class attribute is discrete, it is called classification; when the class attribute is continuous, it is regression [1,5,6].

i) CLASSIFICATION
In classification, class attribute values are discrete. Given a set of data elements classification maps each data element to one of a set of pre-determined classes based on the difference among data elements belonging to different classes. The goal is to discover rules that define whether an item belongs to a particular subset or class of data [6]. For example, when trying to determine which households will respond to a direct mail campaign, look at rules that separate the "probables" from the not probables. Then use IF-THEN rules in a tree-like structure to represent the predictions and classify the set of data items.
Classification Process: According to Han, Kamber and Pei [7], classification is a two-step process model construction describing a set of predetermined classes. Each tuple is assumed to belong to a predefined class, as determined by the class label attribute. The set of tuples used for model construction is training set. The model is represented as classification rules, decision trees, or mathematical formulae. Then the model will be used for classifying future or unknown objects [1].
Some of the examples of classification methods include decision tree learning, naive Bayes classifier, K-nearest neighbour classifier, and classification with network information. Also, regression methods examples include K-means, linear regression and logic regression. See Table 1 In regression, class attribute values are real numbers. For instance, if we wish to predict the stock market value (class attribute) of a company given information about the company (features). The stock market value is continuous; therefore, regression must be used to predict it. The input to the regression method is a dataset where attributes are represented using x1, x2,…, xm (also known as regressors). The class attribute is represented using Y (also known as the dependent variable), where the class attribute is a real number and the relation between Y and the vector X = (x1, x2,…, xm) [5]. Examples of regression methods include linear regression and logic regression. See Table 1, it shows a summary of examples some these supervised learning regression algorithms. -Practically limited to classification.

Naive Bayes Classifiers (NB)
-Bayesian network is a model that encodes probabilistic relationships among variables of interest [13].
-In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features [12].
-Naïve Bayesian technique is generally used for intrusion detection in combination with statistical schemes, a procedure that yields several -Easy to use.
-Effective if the training set is large enough.
-Ability to learn; as the training set gets larger, the results get more and more accurate (intelligence) [5].
-Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.
-If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than -Does not consider the sequence of words( Nonrelevant word feature) -It cannot learn interactions between features.
-Is based on the so Bayesian theorem and is particularly suited when the dimensionality of the inputs is high -Considerably higher computational effort is required [14]. discriminative models like logistic regression, so you need less training data.
-Naïve Bayesian classifiers simplify the computations and exhibit high accuracy and speed when applied to large databases.
-Bayesian classifiers give satisfactory results because focus is on identifying the classes for the instances, not the exact probabilities) [12].
that the data attributes are conditionally independent [13] which is not always so

K-Nearest Neighbour
-K nearest neighbours is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition as a nonparametric technique.
-A case is classified by a majority vote of its neighbours, with the case being assigned to the class most common amongst its K nearest neighbours measured by a distance function.
If K = 1, then the case is simply assigned to the class of its nearest neighbour [17].
-The k-nearest neighbour algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbours, with the object being assigned to the class most common amongst its k nearest neighbours -A disadvantage of SVMs as a classifier is its high algorithmic complexity and extensive memory requirements [19]]. This consequently makes the speed both in training and testing slow.  I n t e r n a t i o n a l j o u r n a l o f C o m p u t e r s a n d T e c h n o l  -Used in text classification problems where very highdimensional spaces are the norm [12].

Linear Regression
Linear Regression is a statistical procedure for predicting the value of a dependent variable from an independent variable when the relationship between the variables can be described with a linear model.
-Linear regression implements a statistical model that, when relationships between the independent variables and the dependent variable are almost linear, shows optimal results -Linear regression is often used to model non-linear relationships [20].
-Linear regression is useful for data with linear relations or applications for which a first-order approximation is adequate [21].
-Linear regression is limited to predicting numeric output.
-A lack of explanation about what has been learned can be a problem -Does not work well for data with continuous or binary outcomes

Logistic Regression
Logistic regression is classification analog of linear regression. It is preferable to trees in the same situations that linear regression is preferable to regression tress-when effects are small and when predictors contribute additively (no interactions).
-Lots of ways to regularize the model, and one do not have to worry about the features being correlated, -One can easily update the model to take in new data (using an online gradient descent method), -Use it if one wants a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you're unsure, or to get confidence intervals) -Or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model [5].
-Logistic regression is intrinsically simple, it has low variance and so is less prone to over-fitting.
-logistic regression is faster -Text classification is a classic problem -It is unstable when one predictor could almost explain the response variable, because the coefficient of this variable will be boosted to as high as possible, here is a case when people turn to discriminant analysis -Requires more assumptions and is sensitive to outliers [25].  I n t e r n a t i o n a l j o u r n a l o f C o m p u t e r s a n d T e c h n o l  -Logistic regression is less complex and easier to inspect

UNSUPERVISED LEARNING
This is the unsupervised division of instances into groups of similar objects. Normally when discussing the unsupervised learning, most researchers focus on clustering [1,5]. In clustering, the data is often unlabelled. Thus, the label for each instance is not known to the clustering algorithm. This is the main difference between supervised and unsupervised learning. Any clustering algorithm requires a distance measure. Instances are put into different clusters based on their distance to other instances [9]. The most popular distance measure for continuous features is the Euclidean distance: Table 2 show some of the examples of the unsupervised learning algorithms. It shows a discussion of their strengths and limitations when handling data in general.

Apriori
-The Apriori Algorithm is an influential algorithm for mining frequent item sets for boolean association rules [25].
-Apriori uses a "bottom up" approach, where frequent subsets are extended -Allow the pruning of many associations -Uses large item set property -Easily parallelized -Easy to implement [25].
-More search space is needed and I/O cost will increase [25].
-Number of database scan is increased thus candidate generation will increase hence increase in computational cost -Normally discover a huge quantity of rules, some -With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).
-K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular [23,26].
-It is well suited to generating globular clusters. The K-Means method is numerical, nondeterministic and iterative -The K-Means algorithm is repeated a number of times to obtain an optimal clustering solution, every time starting with a random set of initial clusters [20].
-Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K affect outcome).
-Fixed number of clusters can make it difficult to predict what K should be.
-Does not work well with non-globular cluster [23].

CONCLUSION
Due to the increase in the amount of data coming from everywhere(online: blogging,social media,databases,etc.), it has become difficult to handle the data, to find associations, patterns and to analyse the large data sets. Consequently, large numbers of technologies are being developed for the extraction of meaningful data from huge collections of textual data using different text mining techniques. Different tools, algorithms and methods which are being used to mine and analyse the data, perform differently on the data collections as has been indicated in the review which has been made in this paper. Choosing the best algorithm to use for a specific analytical task can be a challenge. While you can use different algorithms to perform the same business task, each algorithm produces a different result, and some algorithms can produce more than one type of result.

ACKNOWLEDGEMENTS
Special thanks are passed to BIUST for funding this research