طبقه بندی آموزش شبکه های بیزی از نسبت های برچسب
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|29274||2013||16 صفحه PDF||سفارش دهید||15960 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Pattern Recognition, Volume 46, Issue 12, December 2013, Pages 3425–3440
This paper deals with a classification problem known as learning from label proportions. The provided dataset is composed of unlabeled instances and is divided into disjoint groups. General class information is given within the groups: the proportion of instances of the group that belong to each class. We have developed a method based on the Structural EM strategy that learns Bayesian network classifiers to deal with the exposed problem. Four versions of our proposal are evaluated on synthetic data, and compared with state-of-the-art approaches on real datasets from public repositories. The results obtained show a competitive behavior for the proposed algorithm.
In classical supervised classification, the objective is to build a predictive model from a dataset of labeled instances such that, given a new unlabeled example, the model will assign it to one of the already-known class labels. In the most common situation, each instance in the dataset consists of the description of the example and its associated class label . Moreover, other problems where obtaining labeled examples is difficult (semi-supervision) have received considerable attention in the literature . However, in recent years new problems in which the available class-membership information of the provided examples (a.k.a. information of supervision) does not consist of the typical class value for each (labeled) instance have been proposed. Thus, standard learning strategies, which have been developed for learning from supervised or semi-supervised domains, cannot be straightforwardly applied. Therefore, new specific strategies (or adaptations of classical strategies) that learn from the new kinds of non-fully labeled datasets are necessary. The specific techniques, in order to be efficient, are expected to extract as much knowledge as possible from the available information of supervision. In this paper, we deal with a problem in which the relation between an instance and its associated class label is lost. This may be due to the black-box nature of the problem, privacy preserving, non-monitoring process, etc. In this framework, the unlabeled instances are grouped and only global class information is available for the instances of each group: the label proportions. A particular application of this general framework is the problem of embryo selection in assisted reproductive technology (ART) . In the most critical step of an ART cycle, gynecologists have to select the embryos to be transferred to the uterus of the woman among a set of embryos that have been cultured for several days (in Spain, by law, three embryos at most can be transferred in an ART cycle). During the culture period, some relevant features are observed and collected for each individual embryo. Then, after the transference, doctors can observe, using preclinical imaging techniques, the number of those transferred embryos that are implanted (and induce a pregnancy), but it is not possible to know which individual embryo is implanted. Thus, in a dataset for this problem, each instance represents a transferred embryo and each group includes the embryos transferred in the ART cycle that it represents. The class label, which should indicate whether or not the specific transferred embryo became implanted or not, is individually unknown for the instances of the dataset. However, some kind of information of supervision is available for each group of instances: the number of positive instances (implanted embryos) in the corresponding ART cycle. Another real case that involves the same kind of data is that of election votes, where some parties stand for institutions and, in each polling station, each party gets a known number of votes. The global election results are known, but which party each citizen voted for is unknown. By knowing the population census and some socioeconomic data of the voters, it could be possible to estimate the probability of a citizen voting for a party . More real instances of the problem include the analysis of single particle mass spectrometry data , e-commerce , spam and image filtering , fraud detection , etc. The presented problem relates to the multiple-instance learning problem since, in both cases, the training dataset is divided into disjoint groups of instances. Multiple-instance learning (MIL)  is a supervised classification problem where an example is represented by a group of instances and there is a global label per group (or example). In MIL, the objective is to learn from and classify groups of instances. However, the problem we are dealing with considers class label assignments to the individual instances, despite being unknown in training time. There exist in the literature several methods to deal with the learning from label proportions (LLP) problem. The first time that a method was proposed to learn from this kind of data was in , where Kück and Freitas present a MCMC strategy. But it was Musicant et al.  who gave the first definition of the LLP problem, which they called aggregated outputs. They use the counts of labels (instead of proportions) as general class information per group. In their paper, basic adaptations of KNN, ANN, SVM and Decision Trees are proposed. Simultaneously, Quadrianto et al.  gave an alternative definition based on label proportions. Their method, called MeanMap, models the conditional class probability using conditional exponential models. Although their method is primarily defined to deal with problems where the label proportions of the test set are known, it incorporates a functionality that estimates these proportions when they are not given. Following a similar definition of the problem but without requiring label proportions of the test set, Rueping  proposes an algorithm to learn SVMs for this problem. Other authors implement a different strategy to learn from LLP datasets. Their contributions consist of a procedure that firstly reduces the uncertainty of the data provided, estimating the class label of each unlabeled instance. This generates a complete dataset which can be used to train a classifier using any classical method for supervised data. In this way, Chen et al.  proposed a method based on kernel K-means for solving this problem of label assignment. Later, Stolpe and Morik  presented a similar method which solves this problem using an evolutionary strategy that looks for the predictive variable weights that lead to the clustering (K-means) that best fits the label proportions. The main contributions of this paper are as follows: • The development of an algorithm based on the Structural Expectation-Maximization (SEM) strategy  to learn Bayesian network classifiers for the LLP problem. • The development of several variants of the method, two of which have been specifically designed to deal with (complex) LLP scenarios with high degree of uncertainty in the class label of the individual instances. • The use of joint label assignments, i.e. only the label assignments which fulfill the label proportions of the groups are considered. • The proposal of a new framework for testing LLP methods, which covers the whole spectrum of LLP scenarios in terms of complexity for a given dataset. The Bayesian network classifiers that our SEM method learns show a good performance behavior through different LLP scenarios of increasing class uncertainty. Moreover, it obtains competitive results with respect to state-of-the-art methods. The rest of the paper is organized as follows. In the next section, a formal description of a LLP problem is given. Then, three Bayesian network classifiers and the Structural Expectation-Maximization (SEM) strategy, the basic methodologies implemented in our proposals, are explained. Later, four versions of a new algorithm based on the Structural EM strategy which learns Bayesian network classifiers in the LLP framework are proposed. In Section V, the experiments are presented in four subsections: an experimental demonstration of the usefulness of the extra class information provided in the LLP problems using the semi-supervised learning approach as a baseline-performance reference, an evaluation of the approximate reasoning of our method by means of local probabilistic label assignments, an analysis on synthetic data that evaluates the efficacy of our proposals in different experimental conditions, and a comparison with state-of-the-art approaches. Finally, some conclusions and future work are presented.
نتیجه گیری انگلیسی
In this paper we have proposed four competitive versions of a Structural EM method to learn Bayesian network classifiers for a classification problem where the only information of supervision provided consists of label proportions associated with subsets of instances (LLP). We have shown that the label proportions associated to the bags (groups of instances) provide relevant class information that can be used to learn more accurate classifiers. Specifically, our proposal shows a competitive behavior with respect to state-of-the-art techniques, as has been shown in the comparison with the most representative and influential LLP methods. Among the four versions of our method, a probabilistic version that performs exact calculations, PEMLLP, shows the best results, but it is not scalable when the uncertainty of the problem grows. We have overcome this issue with the proposal of another probabilistic version, MCEMLLP, that performs an approximation to the exact version, showing a good behavior in situations that are unaffordable for PEMLLP. The associated statistical tests do not show significant differences between both versions. Finally, we propose the PMEMLLP version, which combines the exact and approximate procedures in a method that only uses approximate reasoning (MCEMLLP) when the exact approach (PEMLLP) is unfeasible. It uses the MCMC-specific parameters to establish a threshold (burn-in, bi , plus number of samples for calculating estimations, s ) for the maximum number of explored valid completions. For future work it would be interesting to study the possibility of calculating automatically and for each bag individually the parameters of the MCMC procedure (burn-in and number of samples). This would imply a non-constant threshold in the maximum number of explored valid completions for each bag in PMEMLLP, so it would also be necessary to study the implications of this decision. More questions could be raised about this issue: would it be worth carrying out the exact approach when the number of valid completions (s i) is slightly larger than the current threshold (bi+s)(bi+s)? That is, if s i exceeds the number of samples generated in the MCMC procedure (bi+s)(bi+s) by a small number of valid completions, which approach should be applied? What should be considered “a small number” in this context? For future work, it would also be interesting to follow the idea expressed by Kück and Freitas  regarding the “relaxation” of the notion of label proportions. That is, considering a problem with groups of instances where each group is provided with the probabilities that a randomly chosen instance of the bag belongs to each possible class label. Learning from this new framework would involve new challenges, although we consider that we could apply the knowledge acquired in this paper as the new framework can be seen as a generalization of the LLP problem.