محافظت از افشای هویت: رویکرد بازسازی داده ها را برای حفظ داده کاوی حریم خصوصی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
22174 | 2009 | 8 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Decision Support Systems, Volume 48, Issue 1, December 2009, Pages 133–140
چکیده انگلیسی
Identity disclosure is one of the most serious privacy concerns in today's information age. A well-known method for protecting identity disclosure is k-anonymity. A dataset provides k-anonymity protection if the information for each individual in the dataset cannot be distinguished from at least k − 1 individuals whose information also appears in the dataset. There is a flaw in k-anonymity that would still allow an intruder to discern the confidential information of individuals in the anonymized data. To overcome this problem, we propose a data reconstruction approach to achieve k-anonymity protection in predictive data mining. In this approach, the potentially identifying attributes are first masked using aggregation (for numeric data) and swapping (for nominal data). A genetic algorithm technique is then applied to the masked data to find a good subset of it. This subset is then replicated to form the released dataset that satisfies the k-anonymity constraint.
مقدمه انگلیسی
Data-mining technologies have enabled organizations to extract useful knowledge from the data in order to better understand and serve their customers, and to gain competitive advantages [6], [21] and [26]. While successful business applications of data mining are encouraging, there are increasing concerns about invasions to the privacy of personal information. A survey by Time/CNN [16] revealed that 93% of respondents believed companies selling personal data should be required to gain permission from the individuals whose information is being shared. In another study [9], more than 70% of participants responded negatively to questions related to the secondary use of private information. Concern about privacy threats has caused data quality and integrity to deteriorate. According to [34], 82% of online users have refused to give personal information and 34% have lied when asked about their personal habits and preferences. This study deals with the conflict between privacy and data mining in organizational decision support. Organizations that use their customers' records in data-mining activities are obligated to take actions to protect the identities of the individuals involved. It has been demonstrated that personal identities cannot be adequately protected by simply removing identity attributes from released data. There has been extensive research in the area of statistical databases (SDBs) on how to protect individuals' sensitive data when providing summary statistical information. The privacy issue arises in SDBs when summary statistics are derived on very few individuals' data. In this case, releasing the summary statistics may result in disclosing confidential data. The methods for preventing such disclosure can be broadly classified into two categories: (i) query restriction, which prohibits queries that would reveal confidential data, and (ii) data perturbation, which alters individual data in a way such that the summary statistics remain approximately the same. In general, both methods have been extensively investigated and employed [1]. Problems in data mining are somewhat different from those in SDBs. A data-mining task, such as classification or numeric prediction, requires working on individual records contained in a dataset. As a result, query restriction is no longer applicable and data perturbation or anonymization becomes the primary approach for privacy protection in data mining. Further, predictive data mining essentially relies on discovering relationships between data attributes. Preserving such relationships may not be consistent with preserving summary statistics. Researchers in the data-mining community have proposed various methods to resolve the conflict between data mining and privacy protection [4], [7], [14], [22] and [23]. For example, a method for building a decision tree classifier from perturbed data is proposed in [3]. A framework for mining association rules from transaction data that have been randomized is presented in [11]. A set of algorithms for hiding sensitive rules is proposed in [36]. Techniques for preserving privacy in distributed data mining are discussed in [8]. A well-known method for privacy protection, called k-anonymity, was proposed in [31] and [33]. The basic idea is to anonymize the data such that each individual cannot be distinguished from a group of other individuals in the data. The method has gained increasing popularity in privacy-preserving data mining. However, the k-anonymity approach would, in some circumstances, still allow a data intruder to disclose the individual confidential information in the k-anonymized data. To overcome this problem, we propose a data reconstruction approach to achieve k-anonymity protection in predictive data mining. In this approach, the potentially identifying attributes are first masked using aggregation (for numeric data) and swapping (for nominal data), without considering the k-anonymity constraint. A genetic algorithm technique is then applied to the masked data to find a good subset of it. This subset is then replicated to form the released dataset that satisfies the k-anonymity constraint. An experimental study is conducted to show the effectiveness of the proposed method.
نتیجه گیری انگلیسی
This paper presents a novel instance selection method based on genetic algorithm for identity disclosure protection. We introduce a data reconstruction approach to achieve k-anonymity protection in privacy-preserving data mining. The empirical evaluation results indicate that our proposed approach can lead to significantly improved performance. The insights gained from this study can help business make effective decisions on privacy protection in data mining. Our work illustrates the usefulness of using instance selection for privacy protection, and the effectiveness of using genetic algorithm for obtaining solutions to the problem. Future research will take into account more complicated situations, and in particular characterize dataset where this approach is most likely to work well. In particular, we will consider how such parameters as number of instances, number of class values, and number of attributes influence the performance of the algorithm. We will also explore other alternative distance measures in addition to the Euclidean distance. The basic idea behind our proposed approach can also be applied to numeric prediction problem such as regression. We will investigate approaches to extend our work to numeric prediction problems in future research. In a classification problem, there is only one class attribute. By designating the class attribute confidential and non-class attributes non-confidential, we have implicitly assumed that there is only one confidential attribute in the data. The proposed method can be extended to handle multiple nominal confidential attributes. In this situation, we can consider all confidential attributes together as one compound attribute. Suppose, for instance, the Marital Status attribute in the example in Table 1 is also confidential. A compound attribute called “Marital Status × Test Result” can be created, which would have eight categories, formed by different combinations of Marital Status and Test Result values. The transformed dataset would have three non-confidential and one (compound) confidential attributes. The proposed method can then be applied to this transformed dataset.