A Fuzzy Logic based Privacy Preservation Clustering method for achieving K-Anonymity using EMD in dLink Model

Privacy preservation is the data mining technique which is to be applied on the databases without violating the privacy of individuals. The sensitive attribute can be selected from the numerical data and it can be modified by any data modification technique. After modification, the modified data can be released to any agency. If they can apply data mining techniques such as clustering, classification etc for data analysis, the modified data does not affect the result. In privacy preservation technique, the sensitive data is converted into modified data using S-shaped fuzzy membership function. K-means clustering is applied for both original and modified data to get the clusters. t-closeness requires that the distribution of sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table. Earth Mover Distance (EMD) is used to measure the distance between the two distributions should be no more than a threshold t. Hence privacy is preserved and accuracy of the data is maintained.


Overview of data mining
Data mining is a recently emerging field, connecting the three worlds of Databases, Artificial Intelligence and Statistics (Sairam et al.2011). The information age has enabled many organizations to gather large volumes of data (Y. Li et al.2009). However, the usefulness of this data is negligible if "meaningful information" or "knowledge" cannot be extracted from it. Data mining, otherwise known as knowledge discovery, attempts to answer this need (Agarwal et al.2002). In contrast to standard statistical methods, data mining techniques search for interesting information without demanding a priori hypotheses. As a field, it has introduced new concepts and algorithms such as association rule learning (Fienberg et al.2005). It has also applied known machine-learning algorithms such as inductive-rule learning (e.g., by decision trees) to the setting where very large databases are involved. Data mining techniques are used in business and research and are becoming more and more popular with time.

Confidentiality issues in data mining
A key problem that arises in any en masse collection of data is that of confidentiality. The need for privacy is sometimes due to law (e.g., for medical databases) or can be motivated by business interests (Muralidhar et al.2006).However, there are situations where the sharing of data can lead to mutual gain. A key utility of large databases today is research, whether it be scientific, or economic and market oriented. Thus, for example, the medical field has much to gain by pooling data for research; as can even competing businesses with mutual interests. Despite the potential gain, this is often not possible due to the confidentiality issues which arise.

Applications of Privacy-Preserving Data Mining
The problem of privacy-preserving data mining has numerous applications in homeland security, medical database mining, and customer transaction analysis. Some of these applications such as those involving I S S N 2 3 2 1 -8 0 7 X V o l u m e 1 2 N u m b e r 1 2 J o u r n a l o f A d v a n c e s i n c h e m i s t r y 4602 | P a g e O c t o b e r 2 0 1 6 w w w . c i r w o r l d . c o m bio-terrorism and medical database mining may intersect in scope (Agarwal et al.2002). In this section, we will discuss a number of different applications of privacy-preserving data mining methods. a. Medical Databases b. Bioterrorism Applications c. Homeland Security Applications

Data Mining Methods
The main reason for applying data mining methods to text document collections is to structure them. A structure can significantly simplify the access to a document collection for a user. Well known access structures are library catalogues or book indexes (Muralidhar et al.2006). However, the problem of manual designed indexes is the time required to maintain them. Therefore, they are very often not up-to-date and thus not usable for recent publications or frequently changing information sources like the World Wide Web. The existing methods for structuring collections either try to assign keywords to documents based on a given keyword set (classification or categorization methods) or automatically structure document collections to find groups of similar documents (clustering methods). In the following we first describe both of these approaches. The proposed methods to automatically extract useful information patterns from text document collections.

Clustering
Clustering method can be used in order to find groups of documents with similar content. The result of clustering is typically a partition (also called) clustering P, a set of clusters P. Each cluster consists of a number of documents d. Objects in the case documents of a cluster should be similar and dissimilar to documents of other clusters. Usually the quality of clustering"s considered better if the contents of the documents within one cluster are more similar and between the clusters more dissimilar (R.K. Ahuja et al.1993). Clustering methods group the documents only by considering their distribution in document space (for example, a n-dimensional space if we use the vector space model for text documents).
Clustering algorithms compute the clusters based on the attributes of the data and measures of (dis)similarity. However, the idea of what an ideal clustering result should look like varies between applications and might be even different between users. One can exert influence on the results of a clustering algorithm by using only subsets of attributes or by adapting the used similarity measures and thus control the clustering process. To which extent the result of the cluster algorithm coincides with the ideas of the user can be assessed by evaluation measures.

Anonymization Approach
Data anonymization is a type of information sanitization whose intent is to ensure privacy protection. It is the process of either encrypting or removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous.

K-anonymity
A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release. L-diversity This model is an extension of the K -Anonymity. It is a form of group based anonymization that is used to preserve privacy in data sets by reducing the granularity of a data representation. An equivalence class is said to have l-diversity if there are at least l "well-represented" values for the sensitive attribute. Distinct l-diversity, Entropy l-diversity, Recursive (c-l) -diversity are the different types of L-diversity models.

t-closeness
It is a further refinement of l-diversity group based anonymization that is used to preserve privacy in data sets by reducing the granularity of a data representation. An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. t-closeness anonymization is more effective than many other privacy-preserving data mining methods.

RELATED WORKS
In recent year"s lot of research work has been carried out to preserve data privacy before releasing the data for various research purposes which adopts various techniques like Data Auditing, Data Modification, Cryptographic methods and k-anonymity (Ren et al.2012).
In Modification-Based Techniques a number of techniques have been developed for a quantity of data mining techniques like classification, association rule discovery and clustering (Weng et al.2015). Based on the hypothesis that discerning data modification or sanitization is an NP-Hard problem, and for this basis, alteration can be used to address the complexity issues like swapping values between records, replacing the original In Cryptographic methods, data is encrypted using protocols like secured multiparty computation (SMC) (Bayardo et al.2005)..It is a study of mathematical techniques, related to aspects of information security such as confidentiality, data integrity, entity authentication and data origin authentication is shaping the way that information is safely and securely transmitted over the internet. Sensitive information is quite large, such as Credit card information, Security numbers, Private correspondence, Military statement, Bank account information.
Refurbishing-based techniques are techniques where the original circulation of the data is reconstructed from the randomized data.

PRIVACY PROTECTED DATA PUBLISHING TECHNIQUES
In this section, this analyzes how rule based slicing can provide membership disclosure protection.

Bucketization
This block examines how an adversary can infer membership information from bucketization. Because bucketization releases each tuple"s combination of QI values in their original form and most individuals can be uniquely identified using the QI values, the adversary can determine thhjje membership of an individual in the original data by examining whether the individual"s combination of QI values occurs in the released data (Duncan et al.2001).

Rule Based Slicing
Slicing offers protection against membership disclosure because QI attributes are partitioned into different columns and correlations among different columns within each bucket are broken (Lambert et al.1986).
The proposed two quantitative measures for the degree of membership protection offered by rule based slicing which identifies the background knowledge about the data.
The first is the fake-original ratio (FOR), which is defined as the number of fake tuples divided by the number of original tuples. Intuitively, the larger the FOR, the more membership protection is provided.

Generalization
By generalizing attribute values into "less-specific but semantically consistent values," generalization offers some protection against membership disclosure.
It was shown in that generalization alone (e.g., used with k-anonymity) may leak membership information if the target individual is the only possible match for a generalized record (Givens et al,1984). The intuition is similar to our rationale of fake tuple. If a generalized tuple does not introduce fake tuples (i.e., none of the other combinations of values are reasonable), there will be only one original tuple that matches with the generalized tuple and the membership information can still be inferred Also, the protection against membership disclosure depends on the choice of the background table. Therefore, with careful anonymization, generalization can offer some level of membership disclosure protection.

K-anonymity
K-anonymity is a popular measure of privacy for data publishing: It measures the risk of identitydisclosure of individuals whose personal information is released in the form of published data for statistical analysis and data mining purposes (e.g. census data) (Iyengar et al.2002). Higher values of k denote higher level of privacy (smaller risk of disclosure).
In many applications, the data records are made available by simply removing key identifiers such as the name and social-security numbers from personal records. However, other kinds of attributes (known as pseudo-identifiers) can be used in order to accurately identify the records. For example, attributes such as age, zip-code and sex are available in public records such as census rolls. When these attributes are also available in a given data set, they can be used to infer the identity of the corresponding individual. A combination of these attributes can be very powerful, since they can be used to narrow down the possibilities to a small number of individuals.
In k-anonymity techniques, we reduce the granularity of representation of these pseudo-identifiers with the use of techniques such as generalization and suppression (Koudas et al.2007).In the method of For example, the date of birth could be generalized to a range such as year of birth, so as to reduce the risk of identification. In the method of suppression, the value of the attribute is removed completely (Dwork et al.2011). It is clear that such methods reduce the risk of identification with the use of public records, while reducing the accuracy of applications on the transformed data.In order to reduce the risk of identification, the k-anonymity approach requires that every tuple in the table be Indistinguish ability related to no fewer than k respondents. This can be formalized as follows: Anonymizing Data: k-anonymity

Figure 1: Data Anonymizing Process Flow
There are four basic methods for anonymizing data:  Replacement -substitute identifying numbers  Suppression -omit from the released data  Generalization -for example, replace birth date with something less specific, like year of birth  Perturbation -make random changes to the data   Take a table 1 for example, with rows and attributes. Each attributes is either part of a quasi-identifier (like a name or address), or is sensitive information (like the fact you had an operation on a particular afternoon). A quasi-identifier is a set of attributes that, perhaps in combination, can uniquely identify individuals (Kullback et al.2014). Sensitive information includes the attributes that we want to keep private (Lambert,1993). The driving license number is an identifier; a driving record is sensitive information. The table satisfies k-anonymity if each sequence of values in any quasi-identifier appears with at least k occurrences. Bigger k is better. If the user removes all the attributes except for the problem we have a much anonymized data set (k=11) On the other hand, if user just removes the name and generalize the zip code and date of birth we have a less anonymized set. Exercise: convince yourself that k=2 for this set.  Of course, the issue is utility. There is a tradeoff between keeping the data useful for research and maintaining privacy. Researchers and attackers are doing the same thing after all: looking for useful patterns in the data (Xiaokui et al.2015). With the k=2 data set you can ask questions about correlation of problems with gender, or with geography to some extent (although not very specific geographical factors, like toxic leaks).

The l-diversity Method
The k-anonymity is an attractive technique because of the simplicity of the definition and the numerous algorithms available to perform the anonymization. Nevertheless the technique is susceptible to many kinds of attacks especially when background knowledge is available to the attacker. Some kinds of such attacks are as follows: Homogeneity Attack: In this attack, all the values for a sensitive attribute within a group of k records are the same. Therefore, even though the data is k-anonymized, the value of the sensitive attribute for that group of k records can be predicted exactly.
Background Knowledge Attack: In this attack, the adversary can use an association between one or more quasi-identifier attributes with the sensitive attribute in order to narrow down possible values of the sensitive field further (Xiaokui et al.2015). An example given in the following is one in which background knowledge of low incidence of heart attacks among Japanese could be used to narrow down information for the sensitive field of what disease a patient might have. A detailed discussion of the effects of background knowledge on privacy may be found the existing approaches. Clearly, while K-anonymity is effective in preventing identification of a record, it may not always be ef fective in preventing inference of the sensitive values of the attributes of that record. Therefore, the technique of l-diversity was proposed which not only maintains the minimum group size of k, but also focuses on maintaining the diversity of the sensitive attributes.

D-Link
Organizations share data about individuals to drive business and comply with law and regulation. However, an adversary may expose confidential information using quasi-identifying attributes (e.g., age, geocode and gender) across disparate data publications (Xiaokui et al.2015). Privacy protection models (e.g., k-anonymity and its extensions) fail to protect an individual"s privacy against this ""composition attack". The objective is to enhance the dLink model by providing privacy preservation using t-closeness for publish data set. It includes Generalization and Suppression. The Earth Mover's Distance (EMD) is a method to evaluate dissimilarity between two multi-dimensional distributions in some feature space where a distance measure between single features, which we call the ground distance is given. The EMD ``lifts'' this distance from individual features to full distributions (Liu et al.2015).Intuitively, given two distributions, one can be seen as a mass of earth properly spread in space, the other as a collection of holes in that same space. Then, the EMD measures the least amount of work needed to fill the space (Geravand et al.2013).
A distribution can be represented by a set of clusters where each cluster is represented by its mean (or mode), and by the fraction of the distribution that belongs to that cluster (Shu et al.2012 3. Find correlation between quasi-identifier attributes and sensitive attributes. 4. Find probabilities for each sensitive attribute 5. Find distance between probabilities 6. For every class find the score by applying the following formula.
The following section discusses about the implementation details, experimental bed set-up for further evaluation.

EVALUATION METHOD
The Earth Mover"s Distance (EMD) was introduced in laptop vision as associate degree improved distance live between 2 distributions. The Most frequent use of EMD is often recorded in multimedia database systems (Duncan et al. 2001). The EMD is predicated on the stripped-down quantity of work required to rework one distribution to a different by moving distribution mass between one another. Using results from running common machine learning algorithms (such as k-means clustering and logistic regression on a dataset) that EMD does not significantly affect the accuracy of data analysis (Swapnil et al. 2016). Further, we show that the method not only relieves the analysts from the burden of distributing a privacy budget between data transformation operations, it also manages to provide superior output accuracy. Evaluation criteria for privacy & utility are the most thematic consideration and it is shown through benchmarks of minimizing the composition attack using the t-closeness with dLink model.

CONCLUSION
While k-anonymity protects against identity disclosure, it does not provide sufficient protection against attribute disclosure. The notion of _-diversity attempts to solve this problem by requiring that each equivalence class has at least _ well-represented values for each sensitive attribute. We have shown that _-diversity has a number of limitations and have proposed a novel privacy notion called t-closeness, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold t). As part of future work, Data Perturbation will help to preserve data and hence sensitivity is maintained .In future, we want to propose a hybrid approach of these techniques (Valake et al, 2014)