It is well known that data mining is a process of discovering unknown, hidden information from a large amount of data, extracting valuable information, and using the information to make important business decisions. And data mining has been developed into a new information technology, including regression, decision tree, neural network, fuzzy set, rough set, and support vector machine. This paper puts forward a rough set-based multiple criteria linear programming (RS-MCLP) approach for solving classification problems in data mining. Firstly, we describe the basic theory and models of rough set and multiple criteria linear programming (MCLP) and analyse their characteristics and advantages in practical applications. Secondly, detailed analysis about their deficiencies are provided, respectively. However, because of the existing mutual complementarities between them, we put forward and build the RS-MCLP methods and models which sufficiently integrate their virtues and overcome the adverse factors simultaneously. In addition, we also develop and implement these algorithm and models in SAS and Windows system platforms. Finally, many experiments show that the RS-MCLP approach is prior to single MCLP model and other traditional classification methods in data mining, and remarkably improve the accuracy of medical diagnosis and prognosis simultaneously.
Data mining has been used by many organizations to extract information or knowledge from large volumes of data and then use the valuable information to make critical business decisions. Consequently, analysis of the collected history data in data warehouse or in data mart can gain better insight into your customers and evaluation of the medical diagnosis and prognosis (Mangasarian et al., 1995), improve the quality of decision-making and effectively increase the opportunity of the curability for these vital illness.
From the aspect of methodology, data mining can be performed through association, classification, clustering, prediction, sequential patterns, and similar time sequences (Han & Kamber, 2001). For classification, data mining algorithms use the existing data to learn decision functions that map each case of the selected data into a set of predefined classes. Among various mathematical tools including statistics, decision trees, fuzzy set, rough set and neural networks, linear programming (Dantzig & Thapa, 1997; Pardalos & Hearn, 2005; Yosukizu, Hirotaki, & Tanino, 1985) has been initiated in classification for more than 20 years (Freed & Glover, 1981). Given a set of classes and a set of attribute variables, one can use a linear programming model to define a related boundary value separating the classes. Each class is then represented by a group of constraints with respect to a boundary in the linear program. The objective function minimizes the overlapping rate of the classes or maximizes the distance between the classes (Kou et al., 2003 and Shi et al., 2001). The linear programming approach results in an optimal classification. It is also flexible to construct an effective model to solve multi-class problems.
However, the MCLP model is not good at dimensional reduction and at removing information redundancy, especially facing many attributes with a large number of data. To our joy, rough set can find the minimal attribute set and efficiently remove redundant information (Pawlak, 1982). Consequently, the developing approach of RS-MCLP to data mining is promising to overcome these disadvantages.
In this paper, we will give a full description of the rough set-based MCLP method and model for classification in data mining. First a detailed introduction of MCLP model and rough set in the related work section is given, including the algorithms of the MCLP model, rough set for feature selection and their virtues of classification. Then we put forth the methodology of the rough set-based MCLP model after the analysis of their deficiencies, respectively, and implement the combinational model in SAS and Windows platform. And then we describe the advantages of the RS-MCLP model. Finally we present a comprehensive example in different data set and experimental conclusions.
This paper provides a new data mining model and its applications in the different decision systems or tables, and experiments show that the rough set-based MCLP model for classification is prior to the single MCLP model and rough set method. That is to say, after the rough set attribute reduction and removing its redundant information, the speed and the performance of MCLP model will be considerably improved and increased. As far as the medical diagnosis and prognosis are concerned, the model considerably enhances the accuracy of classification and prediction. Besides, we plan to implement a rough set-based fuzzy MCLP model for classification, and attempt to extend it to regression methods or unsupervised learning approaches in the future.