Estimation methods are proposed for fitting logistic regression in which outcome and covariate variables are missing separately or simultaneously. One of the two proposed estimators is an extension of the validation likelihood estimator of Breslow and Cain (1988). The other is a joint conditional likelihood estimator that uses both validation and non-validation data. Large sample properties of the proposed estimators are studied under certain regularity conditions. Simulation results show that the joint conditional likelihood estimator is more efficient than the validation likelihood estimator, weighted estimator, and complete-case estimator. The practical use of the proposed methods is illustrated with data from a cable TV survey study in Taiwan.
Logistic regression is used to describe the relationship between a dichotomous response variable and a set of covariate or explanatory variables; see, e.g., Cox (1970) and Pregibon (1981). The covariate variables may be continuous or (with dummy variables) discrete. Researchers often use logistic regression to estimate the effects of various covariates on some binary outcome of interest. It is basically assumed that in a logistic regression model the log-odds of the outcome is a linear function of the covariates. That is, the variables (Y,X,Z)(Y,X,Z) are assumed to follow the model
equation(1)
View the MathML sourceP(Y=1|X,Z)=H(β0+β1TX+β2TZ)=H(βTX).
Turn MathJax on
Here the YY is a binary outcome. (X,Z)(X,Z) is a vector of covariates. H(u)=[1+exp(−u)]−1H(u)=[1+exp(−u)]−1. View the MathML sourceβ=(β0,β1T,β2T)T is a vector of regression parameters for X=(1,XT,ZT)TX=(1,XT,ZT)T. The maximum likelihood method is usually used to estimate the View the MathML sourceβ.
It is required that data consist of precise measurements for the YY and (X,Z)(X,Z) while the maximum likelihood method is used. However, the data as entered are often not measured perfectly. It has been an active research area in practical problems to study logistic regression with missing covariates. For example, Breslow and Cain (1988) proposed a pseudo conditional likelihood method for a two-stage case-control study in which at the second stage some XX’s are observed on each stratum classified by (Y,WY,W) for WW being a categorical surrogate. When the missingness of XX does not depend on both the outcome and missing values, Carroll and Wand (1991) and Pepe and Fleming (1991) proposed semiparametric estimation methods to approximate the likelihood without modeling the distribution of XX given (Z,WZ,W). Little (1992) reviewed related methods in this field. A mean-score method was proposed by Reilly and Pepe (1995) for discrete covariates when XX is missing at random (MAR) (Rubin, 1976). Robins et al. (1994) proposed an efficient estimation method by computing an optimal score function in semiparametric models. Wang et al. (2002) combined the validation and non-validation data to propose a joint conditional likelihood method. Additionally, Chatterjee and Li (2010) have recently developed three estimators, i.e., mean score, pseudo-likelihood, and semiparametric maximum likelihood, for the regression model under partial questionnaire design and other study settings that can generate nonmonotone missing data in covariates.
Unfortunately, there is another common problem in a logistic regression analysis when outcome data is missing. The topic has been studied by Pepe (1992) and Cheng and Hsueh (1999). They discussed bias correction in the estimation of parameters of a logistic regression model when the binary outcome is subject to missing and misclassification. Cheng and Hsueh (2003) proposed estimation methods for a logistic regression model fitting when the binary outcome and covariate values are both subject to measurement errors. Note that they assumed the validation data set consists of a primary sample plus a smaller validation subsample, which is obtained by a double sampling scheme. Lee et al. (2012) proposed a semiparametric method to estimate the parameters of a logistic regression model when both covariates and outcome data are missing simultaneously. Zhao et al. (2009) extended the semiparametric maximum likelihood method for missing covariate problems to deal with more general cases where covariates and/or responses are missing by design in which they estimated asymptotic variances and confidence intervals using the profile log likelihood and EM algorithms for each case, but there has been no study on fitting a regression model to a data set in which covariates and outcome may be missing separately or simultaneously. Therefore, we are motivated by this to propose two estimation methods to deal with the aforementioned case.
Let Y0Y0 and WW be surrogate variables for YY and XX, respectively. Note that the WW is available and independent of YY given (X,Z)(X,Z). Moreover, the X,ZX,Z, and WW are assumed to be categorical. Two semiparametric methods are proposed to estimate the logistic regression parameters View the MathML sourceβ, where the missing data possibly depends on the observed data. The first method is an extension of the validation likelihood approach of Breslow and Cain (1988). The second one is an extension of the joint conditional likelihood method of Lee et al. (2012) that uses the validation and non-validation data. We do not make any model assumptions for the probability of missingness and specification of the conditional distribution of the missing covariates given the observed covariates for both the methods.
The proposed estimators are described in detail in Section 2. In Section 3, we study the asymptotic properties and relative efficiencies of these estimators. Simulation experiments are conducted to investigate their finite-sample performance in Section 4. In Section 5, we apply the proposed methodology and other existing methodology to the cable TV survey data set from Taiwan. Finally, Section 6 provides some concluding remarks.