رگرسیون لجستیک پراکنده چندکلاس برای طبقه بندی انواع سرطان های متعدد با استفاده از داده های بیان ژن
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|24719||2006||13 صفحه PDF||سفارش دهید||5410 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 51, Issue 3, 1 December 2006, Pages 1643–1655
Monitoring gene expression profiles is a novel approach to cancer diagnosis. Several studies have showed that the sparse logistic regression is a useful classification method for gene expression data. Not only does it give a sparse solution with high accuracy, it provides the user with explicit probabilities of classification apart from the class information. However, its optimal extension to more than two classes is not obvious. In this paper, we propose a multiclass extension of sparse logistic regression. Analysis of five publicly available gene expression data sets shows that the proposed method outperforms the standard multinomial logistic model in prediction accuracy as well as gene selectivity.
Constructing a classification rule for tissue samples based on gene expression profiles has received much attention recently due to emerging microarray technology. A new challenge is that the number of genes (i.e. the dimension of inputs) is much larger than the number of tissue samples, in which case standard classification methods either are not applicable or perform badly. Also, identifying a small subset of informative genes, called marker genes, which discriminate types of tumors or tumor versus normal tissues, has become an important subject. Hence, good learning algorithms with gene expression data should provide a classification rule which not only yields high accuracy but also has the ability to identify marker genes. In related literature, Guyon et al. (2002) proposed a recursive feature elimination technique with support vector machines, Li et al. (2002) introduced two Bayesian approaches with the technique of automatic relevance determination, and Shevade and Keerthi (2003) and Roth (2002) applied the sparse logistic regression, to name just a few. Among these tools, sparse logistic regression is a useful classification method for gene expression data. It gives a sparse solution with high accuracy and also it provides the user with explicit probabilities of classification apart from the class information. However, its optimal extension to more than two classes is not obvious. A standard multiclass extension of sparse logistic regression might be sparse multinomial logistic (SML) regression (Krishnapuram et al., 2004), which is a sparse version of the multinomial logit model—a popular multiclass formulation in statistics (see, for example, Agresti, 1990). SML, however, has a problem in gene selection. Since the estimates of the regression coefficients depend on the choice of the baseline class (see Section 2 for definition), and so do the selected genes. Hence, some important genes are dropped in the final model, which in turn degrades the prediction accuracies. Empirical results in Section 4 confirms this observation. In this paper, we propose a new multiclass extension of sparse logistic regression called sparse one-against-all logistic (SOVAL) regression, whose main idea is to reduce a multiclass problem to multiple binary problems and to construct a classifier using the reduced multiple binary problems simultaneously. By analyzing five real data sets of gene expressions, we show that SOVAL outperforms SML in prediction accuracy as well as gene selectivity. The paper is organized as follows. In Section 2, SOVAL as well as SML are presented. A computational algorithm based on the gradient LASSO algorithm of Kim et al. (2005) is given in Section 3. Results of numerical experiments are presented in Section 4 and concluding remarks follow in Section 5.
نتیجه گیری انگلیسی
In this paper, we proposed a multiclass extension of sparse logistic regression, so called SOVAL, compared it with SML, and developed the efficient computational algorithm suitable for gene expression data. The numerical experiments showed that SOVAL outperforms SML in many aspects. The former: (i) gives better accuracies in particular; (ii) has higher power of detecting important genes and (iii) does not require the choice of a baseline class. The main idea of SOVAL is somehow related to the Scott's method of estimating a mixture model (Scott, 2001 and Scott, 2004). The Scott's method relaxed a constraint of the density function and focused on a particular component rather than all components. SOVAL also relaxed a constraint that the sum of the probabilities of the classes is 1 and implicitly found genes important for a specific class rather than all classes. This similarity would partially explain the good prediction performance of SOVAL. We leave this conjecture as a future work. We have seen that the selected genes by SOVAL are much different from those selected by the marginal F-ratio. This is partly because SOVAL measures the classification power of genes for a specific class while the marginal F-ratio measures the overall effect of genes on all classes. Hence, if one wants to detect genes which affect a specific class, SOVAL is more suitable. In this view, SOVAL can be considered as a new way of detecting relevant genes and can be used as a preprocessing procedure for more complicated non-linear classification methods such as the support vector machine or boosting. For this purpose, however, efficient computational algorithms are required since we should work with large numbers of genes without prescreening, and the algorithm proposed in this paper can serve for this purpose.