The objective of supervised learning is to find an input–output relationship behind training samples (Bishop, 2006 and Hastie et al., 2001). Once the input–output relationship is successfully learned, outputs for unseen inputs can be predicted, i.e., the learning machine can generalize.
When users are allowed to choose the location of training inputs, it is desirable to design the input locations so that the generalization error is minimized. Such a problem is called active learning ( Settles, 2009) or experiment design ( Fedorov, 1972 and Pukelsheim, 1993), and has been shown to be useful in various application areas such as text classification ( Lewis and Gale, 1994 and McCallum and Nigam, 1998), age estimation from images ( Ueki, Sugiyama, & Ihara, 2010), medical data analysis ( Wiens & Guttag, 2010), chemical data analysis ( Warmuth et al., 2003), biological data analysis ( Liu, 2004), and robot control ( Akiyama, Hachiya, & Sugiyama, 2010).
If users are allowed to locate training inputs at any position in the domain, the active learning setup is said to be population-based ( Kanamori and Shimodaira, 2003, Sugiyama, 2006 and Wiens, 2000). On the other hand, if users need to choose training input locations from a pool of finite candidate points, it is said to be pool-based ( Kanamori, 2007, McCallum and Nigam, 1998 and Sugiyama and Nakajima, 2009). Depending on the way training input locations are chosen, active learning is also categorized into sequential or batch approaches: Training inputs are selected one by one iteratively in the sequential approach ( Box and Hunter, 1965 and Sugiyama and Ogawa, 2000), while all training inputs are selected at once in the batch approach ( Kiefer, 1959 and Sugiyama and Ogawa, 2001). In this paper, we focus on pool-based batch active learning.
Active learning generally induces a covariate shift — a situation where training and test input distributions are different ( Quiñonero-Candela et al., 2009, Shimodaira, 2000 and Sugiyama and Kawanabe, 2012). When a model is correctly specified, covariate shifts do not matter in designing active learning methods. However, for a misspecified model, a covariate shift causes a strong estimation bias and thus classical active learning techniques that require a correct model become unreliable ( Fedorov, 1972 and Kiefer, 1959).
To cope with the bias induced by the covariate shift, active learning techniques that explicitly take model misspecification into account have been developed (Beygelzimer et al., 2009, Kanamori, 2007, Kanamori and Shimodaira, 2003, Sugiyama, 2006, Sugiyama and Nakajima, 2009 and Wiens, 2000). The key idea of covariate shift adaptation is importance weighting — a loss function used for training is weighted according to the importance (i.e., the ratio of test and training input densities). Among the importance-weighted active learning methods, the pool-based batch active learning method for approximate linear regression called P-ALICE (Pool-based Active Learning using Importance-weighted least-squares learning based on Conditional Expectation of the generalization error) was demonstrated to be useful ( Sugiyama & Nakajima, 2009).
However, in the original P-ALICE, the number of training samples to gather is assumed to be sufficiently smaller than the size of the sample pool. However, when this assumption is not satisfied, the importance weight used in P-ALICE is not reliable. In this paper, we propose a new method to set the importance weight that does not rely on this assumption. Our new weighting scheme is based on the inclusion probability ( Horvitz & Thompson, 1952), which allows us to precisely capture the relation between the training and test input distributions. Through experiments, we show that the active learning performance of P-ALICE can be improved by the proposed weighting method when the training sample size is relatively large.
The rest of this paper is structured as follows. In Section 2, we formulate the problem of pool-based active learning and give an overview of P-ALICE. In Section 3, we point out a limitation of importance estimation in P-ALICE, and propose an alternative method. In Section 4, experimental results on toy and benchmark datasets are reported. Finally, concluding remarks are given in Section 5.
In this paper, we discussed importance weight estimation in the pool-based batch active learning criterion called P-ALICEP-ALICE. We pointed out that when the number of training samples to gather is not small compared with the pool size, importance weights used in the original P-ALICEP-ALICE are not accurate. This inaccuracy is due to the influence of sampling without replacement.
To cope with this problem, we proposed an alternative method of importance weight estimation based on the inclusion probability. Because the true inclusion probability is generally unknown, we numerically approximated it by the frequency of selection of each sample through Monte Carlo simulations.
The importance weights obtained by the proposed approach are more accurate when the sampling rate is not small, and thus they achieve a lower estimation bias. Furthermore, because the importance weights obtained by the proposed approach tend to be flatter than the original ones, the variance is also reduced. Numerical experiments with toy and benchmark datasets showed that our new weighting scheme gave a statistically significant improvement upon the original P-ALICEP-ALICE.
The importance of active learning research has grown significantly in recent years because labeling costs became a critical bottleneck of real-world machine learning applications. In consideration of this increasing interest and demand in active learning, further enhancing the active learning performance is an important challenge, for instance, in the context of crowdsourcing.