پایداری تشخیص ارزش مداوم : یک برنامه در نظریه مجموعه راف
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
29491 | 2004 | 25 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : International Journal of Approximate Reasoning, Volume 35, Issue 1, January 2004, Pages 29–53
چکیده انگلیسی
Continuous value discretisation (CVD) is the process of partitioning a set of continuous values into a finite number of intervals (categories). This paper introduces a number of stability measures associated with the resultant CVD. The stability measures are constructed from a series of estimated probability distributions for the individual ‘partitioning’ intervals found using the method of Parzen windows. These measures enable comparisons between the results of alternative methods of CVD on their ability to effectively partition the continuous values. A further utilisation of these measures is exposited within rough set theory (RST). RST is a modern approach to the generation of sets of rules enabling the classification of objects to categories based on sets (reducts) of related characteristics. To avoid rules of poor quality (from RST analysis) induced directly from continuous valued characteristics, CVD methods can be used to reduce the associated granularity and allow higher rule quality. The notion of stability introduced enables the further introduction of novel measures particular to reduct and rule set stability within RST.
مقدمه انگلیسی
Continuous value discretisation (CVD) is the process of partitioning a set of continuous values (data) into a finite number of intervals (categories). A simple example of CVD is the categorisation of continuous values into a given number of intervals based on equidistant cut-points (equal width discretisation). The study of CVD is an ongoing research topic, including specific CVD technique development and comparison between techniques [20,29]. A necessity for data to be discretised may be to improve the utilisation of certain symbolic machine learning methods, including rough set theory (RST), which is a rule based technique for object classification. In the case of RST, to avoid rules of poor quality induced directly from continuous valued characteristics, CVD techniques can be used to reduce the associated granularity and allow higher rule quality (see [3,29]). Recently, Kane et al. [22] relating to more traditional statistical methods advocated that continuous variables (e.g. financial ratios) should be ranktransformed (discretised) to improve their distributional properties in a company failure prediction setting. Within RST based studies on company failure prediction, the CVD process has often been based on expert opinion and also tradition, habits or convention (see [8,12]). How appropriate and consistent the effects of the CVD process from the views of an expert opinion are generally not considered. Articles including [9,39] have identified the need to develop new methods of statistical reasoning (with sparse data). Here, the development is not on the actual CVD techniques employed but a series of measures to describe the effectiveness of any CVD undertaken. Koczkodaj et al. [24], in a philosophical discussion of RST, considered an information system (set of objects described by characteristics) and propound at what stage does the CVD process effect the objectivity of the information system. Hence, discretisation of data may bring with it subjective uncertainty, with consideration given to the subjective judgements in establishing the boundary points (cut-points) of the defined intervals. This notion is compounded by a motto given in Duntsch and Gediga [15, p. 594], who believes underlying the RST philosophy is ‘‘Let the data speak for themselves.’’ It is suggested here that the voice of the data may be muted on the occasion CVD has been applied, with the actual data (continuous values) now described by the intervals within which they exist. This highlights the accuracy versus simplicity problem often described by the Occam (razor) Dilemma (see [9] and references and comments contained therein), whereby here there may be more (accurate) rules on the real data or fewer (simpler) rules using the intervals constructed from CVD. In this paper, while the discretised data may be used knowledge on the positions of the original data in each of the constructed intervals is available and should also be used. In general, after the utilisation of CVD, the original continuous values may be spread non-uniformly within the different intervals constructed. This spread may involve values near the boundary points of the intervals. Since the continuous values may themselves be estimates (inherently imprecise) then intervals created with a relatively large number of included values near their 30 M.J. Beynon / Internat. J. Approx. Reason. 35 (2004) 29–53 boundary points would be undesirable. This paper uses the method of Parzen windows [30] to help construct a measure of stability for an interval, which takes into account the overall spread of the values in the interval. 1 Importantly this measure is independent of the a priori CVD process employed. That is, the stability measure is calculated from the boundary points defining an interval and the original continuous values in the interval, irrespective of the means of how the boundary points were calculated. Further stability measures are constructed based on the aggregation of certain interval stability measures. It follows these measures enable the comparison of the results from alternative CVD methods. CVD methods can be partitioned into certain groups. The first partition of methods is between; supervised––utilise a decision class during discretisation, e.g. minimum-entropy [16] and unsupervised––does not utilise a decision class, e.g. equal width discretisation. A second partition of methods is between local–– acts on one sets of continuous values only, e.g. equal width discretisation, and global––acts on one or more set of continuous values, e.g. global discretisation [11]. For a general discussion on aspects relating to CVD, see [4,11,13, 20,29]. The global discretisation method [11] uses the quality of approximation measure from RST within its stopping criteria. Further CVD methods closely related to RST include; Nguyen [28] who considered the relationship between CVD and identifying specific reducts of certain size, Stefanowski [36] who discretised continuous attributes, specifically with a view to the discovery of strong decision rules. Also Sun and Gao [37], incorporated compatibility rough sets within a CVD method. In this paper the stability measure is elucidated through its application on the well known Iris data set which is intervalised by four different CVD methods. Without loss of generality to the findings in this paper the notation and vocabulary used throughout is based around that found in the extant RST literature. Since the nascence of RST by Pawlak [31,32] it has continued to establish itself as a versatile method of data mining and knowledge discovery [43]. The domain of RST is the information system (decision table) made up of objects each categorised and described by decision and condition attributes respectively. RST then attempts to find subsets of condition attributes (reducts) which describe the information system to the same quality as the whole set of condition attributes. The outcome from RST analysis is a collection of rules (rule set), i.e. ‘‘if . . . then . . .’’ statements, used to classify objects within the information system. Since the stability measures introduced are utilised on individual intervals and the real values contained therein, they can be used to construct measures specifically related to RST. That is, the rules constructed (in RST) are based on sets of criteria, which are attribute-value pairs each relating to a single interval from a specific condition attribute. The RST related stability measures are based specifically on attributes and criteria, at the reduct and rule set levels respectively. Here the proposed RST related stability measures are utilised on a small example data set, and are shown to aid in reduct selection and rule stability within an analysis based on the variable precision rough sets (VPRS) model [41,42]. These measures could subsequently be used in conjunction with already existing measures describing the reliability of rules within RST analysis (see [20,26]). The structure of the rest of the paper is as follows: In Section 2, the stability of the results of CVD is defined including a description of Parzen windows. In Section 3, an illustration of these stability measures is given using the well known Iris data set. In Section 4, the utilisation of these stability measures within RST is considered. In Section 5, an application of RST is undertaken, further illustrating the use of the stability measures introduced.
نتیجه گیری انگلیسی
In this paper new measures relating to the stability of the discretisation of continuous attribute values are introduced. At its base level, a stability value is Table 12 Rules associated with b-reduct r9 ¼ fc1; c4; c7; c8g Rule c1 c4 c7 c8 d1 Nr9 ;q Pq RLSIr9 ;q Nr9 ;q RLSIr9 ;q 1 1 2 1 25 0.960 0.83946 20.98638 2 1 1 1 4 0.750 0.82907 3.31627 3 0 1 2 14 0.929 0.83861 11.74053 4 0 0 2 8 1.000 0.83946 6.71564 5 0 2 2 7 1.000 0.83861 5.87027 6 0 2 0 2 3 1.000 0.79805 2.39415 7 1 2 0 2 1 1.000 0.79805 0.79805 8 0 1 0 2 1 1.000 0.79805 0.79805 9 0 1 3 22 1.000 0.80914 17.80107 10 1 0 3 3 0.667 0.83861 2.51583 11 1 1 1 3 10 0.700 0.82907 8.29068 48 M.J. Beynon / Internat. J. Approx. Reason. 35 (2004) 29–53 found for each interval constructed. This allows the stability measure of an attribute to be known. Analogous stability index measures are also constructed. These measures defined, are of use in any study that includes the discretisation of continuous values. Importantly these measures are independent of the discretisation technique employed. One such study where the stability measures are show to be effectively used is with RST. Here the stability values are used to aid in reduct selection and reliability of rules.More specifically, reduct, rule and rule set stability measures are defined. A large example is considered, which clearly illustrates the role of these measures within RST. This includes the micro aspects of the rules, that is the individual criteria in each rule. One drawback of this measure (in RST) is its inability to cope with the inclusion of attributes that are already categorical in nature. That is, when an information system is made up of both continuous and categorical attributes. Since then, the associated reducts may also contain both type of attributes, not enabling the reduct stability measure to be found in these cases. The inclusion of these measures in this type of system is left for future research.