High dimension low sample size data, like the microarray gene expression levels, pose numerous challenges to conventional statistical methods. In the particular case of binary classification, some classification methods, such as the support vector machine (SVM), can efficiently deal with high-dimensional predictors, but lacks the accuracy in estimating the probability of membership of a class. In contrast, the traditional logistic regression (TLR) effectively estimates the probability of class membership for data with low-dimensional inputs, but does not handle high-dimensional cases. The study bridges the gap between SVM and TLR by their loss functions. Based on the proposed new loss function, a pseudo-logistic regression and classification approach which simultaneously combines the strengths of both SVM and TLR is also proposed. Simulation evaluations and real data applications demonstrate that for low-dimensional data, the proposed method produces regression estimates comparable to those of TLR and penalized logistic regression, and that for high-dimensional data, the new method possesses higher classification accuracy than SVM and, in the meanwhile, enjoys enhanced computational convergence and stability.
Technological invention and information advancement have revolutionized scientific research and technological development. Many sophisticated large-scale data sets have recently been collected. These new data sets and streams pose numerous challenges to conventional statistical or data mining methods due to not only the massive size, but also the high dimensionality.
In this paper, we focus on high dimension low sample size data, the so-called large p small n data, with binary class label responses. Notable examples include clinical assessment of tumor types for microarray gene expression data, in which the number of variables (genes) far exceeds the number of samples (arrays). The traditional logistic regression (TLR) method effectively estimates the probability of class membership for large n small p data, but does not handle data sets with high-dimensional predictors. Besides, a monotone likelihood problem will occur when the predictors are fully separable ( Firth, 1993). In that case, logistic regression will give unreliable estimates. See Albert and Anderson (1984) and Santner and Duffy (1986) for details.
On the other hand, the support vector machine (SVM) has emerged as a powerful pattern classification tool for high-dimensional data. By means of the dual representation, SVM translates an optimization problem of p-variables into the counterpart of n-variables. This characteristic enables SVM to efficiently deal with high-dimensional predictors. Refer to Vapnik (1996) and Cristianini and Shawe-Taylor (2000), among many others, for details. Nonetheless, unlike the logistic regression, SVM lacks the accuracy in estimating the probability of membership for each class. Therefore, SVM is less appropriate to estimate the class probability, which is of significant importance in various scientific disciplines.
In this paper, we aim to develop a high-dimensional regression and classification method which simultaneously combines the strengths of both SVM and TLR. To achieve this goal, we bridge the gap between SVM and TLR by their loss functions. Based on our proposed new loss function, we further propose a pseudo-logistic regression (PsLR) and classification approach which integrates the classification ability of SVM and the regression capability of TLR. Simulation evaluations and real data applications demonstrate that for low-dimensional data, the proposed method produces regression estimates comparable to those of TLR and penalized logistic regression (PeLR) (Eilers et al., 2001), and that for high-dimensional data, the new method possesses higher classification accuracy than SVM and, in the meanwhile, enjoys enhanced computational convergence and stability. As will be discussed in Section 3.2, the PeLR when applied to high-dimensional data, reduces the size of the estimating equations, but could not genuinely resolve the problems of computational instability and solution non-uniqueness. In contrast, our proposed method effectively overcomes these problems.
This paper is organized as follows. In Section 2, we review TLR and SVM, and connect them by their loss functions. In Section 3, we propose the PsLR method. In Section 4, we present some property of PsLR and propose a bias correction procedure for PsLR estimates. We apply our method to simulated data in Section 5 and real data sets in Section 6. Section 7 concludes this paper by a discussion. All detailed derivations are postponed to the Appendix.