قواعد القاء در داده کاوی: اثر مقیاس ترتیبی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|22031||2002||9 صفحه PDF||سفارش دهید||محاسبه نشده|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 22, Issue 4, May 2002, Pages 303–311
Many classification tasks can be viewed as ordinal. Use of numeric information usually provides possibilities for more powerful analysis than ordinal data. On the other hand, ordinal data allows more powerful analysis when compared to nominal data. It is therefore important not to overlook knowledge about ordinal dependencies in data sets used in data mining. This paper investigates data mining support available from ordinal data. The effect of considering ordinal dependencies in the data set on the overall results of constructing decision trees and induction rules is illustrated. The degree of improved prediction of ordinal over nominal data is demonstrated. When data was very representative and consistent, use of ordinal information reduced the number of final rules with a lower error rate. Data treatment alternatives are presented to deal with data sets having greater imperfections.
Data mining may be viewed as the extraction of patterns and models from observed data (Berry & Linoff, 1997). The area of data mining is very broad as it incorporates techniques and approaches from different research disciplines. Data mining is usually used to answer two main types of application questions (Edelstein, 1997): (1) generate predictions on the basis of available data or (2) describe behavior captured in the data. Examples of the first type of tasks include banking ( Kiesnoski, 1999), interested in the success of prospective loans; insurance ( Goveia, 1999) interested in the probability of fraud, marketing ( Peacock, 1998) interested in identifying the best prospects for direct-mail list campaigns. Examples of the second type include finding out which products are sold together, what infections are connected with surgery, in what time ranges which group of customers use a service. In this article we will concentrate on the first type of tasks. To answer the first type of question, three main approaches are used (Edelstein, 1997): (1) classification, (2) regression, and (3) time-series. These models are differentiated on the basis of what we want to predict. If we want to forecast continuous values of the output attribute, regression analysis is mostly used (time-series if we are concerned with distinctive properties of time). If we have to predict a categorical value for a specific data item (categorical data fits into a small number of discrete categories such as ‘good credit history’ or ‘bad credit history’), we have a classification task to solve. Examples of this type of tasks are medical or technical diagnostics, loans' evaluation, bankruptcy prediction, etc. Classification is one of the most popular data mining task. There are many different methods, which may be used to predict the appropriate class for the objects (or situations). Among the most popular one are: logistic regression, discriminant analysis, decision trees, rule induction, case-base reasoning, neural networks, fuzzy sets, and rough sets (Kennedy, Lee, Van Roy, Reed, & Lippman, 1997). Other methods are used as well. The majority of data mining techniques can deal with different data types. The traditional types of data mentioned in applications are continuous, discrete, and categorical. Among these three categories continuous scales are usually assumed to be numerical, while categorical and discrete data involve variety. Categorical information may be either ordinal (e.g. ‘high’, ‘medium’, ‘low’), or nominal (e.g. ‘blue’, ‘yellow’, ‘red’). Discrete data is an uncertain data type as different models can treat this type of data differently. The majority of data mining models (e.g. regression analysis, neural networks, etc.) will consider discrete data as numeric and apply models suitable for numeric data (Lippmann, 1987). In other cases, discrete data can be treated as categorical, viewing it as numeric codes for nominal data as done in decision trees or rough sets approaches (Quinlan, 1990 and Slowinski, 1995). In the latter case, it can be ordinal as well, e.g. if we describe cars, we can have an attribute ‘number of doors’ with possible values of 2, 4, 5, 6. It can be considered numeric (with the more doors we have the better), or ‘4 doors’ can be the most desirable characteristic, while others are less attractive. Use of numeric information usually provides possibilities for more powerful analysis than ordinal data. On the other hand, ordinal data allows more powerful analysis when compared to nominal data. It is therefore important not to overlook the knowledge about ordinal dependencies in the data sets. Although not as popular in the area of data mining, the qualities of ordinal data were rather thoroughly examined in the area of decision analysis and expert systems (Ben-David, 1992, Larichev and Moshkovich, 1994, Larichev and Moshkovich, 1997, Mechitov et al., 1995, Mechitov et al., 1996 and Yager, 1981). Implementation of this knowledge may be useful in some classification problems. The rest of the paper will investigate additional data mining support available from ordinal data. The effect of inclusion of the information on ordinal dependencies in the data set on the overall results of constructing decision trees and induction rules will be illustrated. Possibilities for using ordinal properties to evaluate quality of the training data set will be discussed.
نتیجه گیری انگلیسی
Knowledge discovery is a complicated process of extracting useful information from data. It includes many steps such as data warehousing; target data selection; data cleaning, preprocessing, and transformation; model development and choosing suitable data mining algorithms; evaluation and interpretation of results; using and maintaining the discovered knowledge. In a majority of the real cases, the knowledge discovery process is iterative and interactive in nature. Results obtained at any step of the process may stimulate changes at earlier steps. At the core of the knowledge discovery process are the data mining methods for extracting patterns from data. These methods can have different goals and may be applied successively to achieve the desired result in the knowledge discovery process (any method that can help in obtaining more information from data is useful). Although the ultimate goal of knowledge discovery is an automated knowledge discovery, the majority of available tools today are designed for expert analysts. They work with initial data providing the right data for the right analysis. They analyze interim results at each step of the process and re-adjust data, models, and techniques as necessary. The understanding that some attributes possess ordinal qualities regarding categorical classes may significantly improve the results of the analysis in classification tasks. The data presented in Section 4 shows that just stating these ordinal dependencies in the traditionally used analytical tools may lead to a better predictive model. It usually leads to a reduced number of rules and nodes with simultaneous reduction in error rate. The car evaluation data set proved to be highly representative and consistent. The primary gain of using ordinal information in this case was in reducing the number of final rules with a lower error rate. In other tasks with less adequate information ordinal analysis may show the chances for the stable and reliable outcome as was illustrated in the loan application example. This data represented an example of a very inconsistent and incomplete set of data even though the dimensionality of the problem was relatively small. It is important to evaluate this factor before applying a data mining technique as it may require some action, as long as the representativeness of the data is not distorted. Inconsistent data sets can be improved by eliminating instances involving too many contradictions, or by reclassifying some of them, or by reevaluating the description of cases (criteria and scales used in the task). In any case, the information on the quality of the data set makes it possible to understand why this data set is not good enough for the task being considered. The substantial additional information that can be obtained due to ordinal dependences between attribute scales and decision classes is valuable. Although algorithms for ordinal classification are labor intensive (not meant for very large data sets), they can be used at some stages of data analysis; for example, after reducing the number of attributes with other methods (Saarenvirta, 1999), or after feature construction (Major, 1998). Incorporation of ordinal classification into the data mining classification technique can improve overall results for specific tasks.