In this article, two semiparametric approaches are developed for analyzing randomized response data with missing covariates in logistic regression model. One of the two proposed estimators is an extension of the validation likelihood estimator of Breslow and Cain [Breslow, N.E., and Cain, K.C. 1988. Logistic regression for two-stage case-control data. Biometrika. 75, 11–20]. The other is a joint conditional likelihood estimator based on both validation and non-validation data sets. We present a large sample theory for the proposed estimators. Simulation results show that the joint conditional likelihood estimator is more efficient than the validation likelihood estimator, weighted estimator, complete-case estimator and partial likelihood estimator. We also illustrate the methods using data from a cable TV study.
The randomized response techniques (RRT) have been developed (see Warner (1965), Horvitz et al. (1967) and Greenberg et al. (1969)) to obtain more valid estimates when socially sensitive topics are studied. Topics are thought to be socially sensitive when they are threatening to respondents, like for instance questions about illegal or deviant behavior, or questions concerning subjects that are very personal or stressful to respondents. Due to these threats, respondents are less willing to co-operate and when they co-operate, they tend to give more socially desirable answers. These tendencies will unavoidably result in less valid data.
Under RRT, a respondent’s privacy can be protected, the tendency to refuse co-operation or to give non-incriminating or socially acceptable answers will decrease and thus the validity of the data will increase. Warner (1965) developed a related-question RRT by introducing two related questions as follows: (A) I am in favor of capital punishment; (B) I am against capital punishment. Further, Horvitz et al. (1967) and Greenberg et al. (1969) developed an unrelated-question RRT model by introducing the following two questions: (A) I am in favor of capital punishment; (C) I was born in January, February or March, where question C is not related to (A). A chance game (for instance with dice, playing cards or coins) now decides which of the two statements is answered with “true” or “untrue”. Since such techniques does not reveal to the interviewer the group to which a respondent belongs, this will allow us to get an accurate estimate of the true prevalence in the population of the attitudes towards capital punishment. Several papers provide thorough reviews on RRT (e.g., Kuk (1990), Chaudhuri (2002), Chaudhuri and Mukerjee (1985), Kim and Warde, 2004 and Kim and Warde, 2005Saha (2004), Kim et al. (2006) and Cruyff et al. (2008)). Recently, there have been some researches regarding non-randomized response technique models. For example, Yu et al. (2008) proposed two new models for survey sampling with sensitive characteristics and Tian et al. (2007) presented a new survey technique for assessing the association of two binary sensitive variates.
For RRT model with completely observable covariates, Scheers and Dayton (1988) presented a theory for a covariate randomized response model that is an extension of the Warner (1965) procedure and for a covariate extension of the unrelated-question RRT (Greenberg et al., 1969). Corstange (2004) proposed a method to estimate the parameters in a hidden logistic regression. As far as we know, there have been no researches regarding the analysis of data from unrelated-question RRT with missing covariates. Therefore, we consider logistic regression analysis of data from unrelated-question RRT with missing covariates. In Section 2, under the unrelated-question RRT and logistic regression model, two semiparametric estimators are proposed. One of them is an extension of the validation likelihood estimator of Breslow and Cain (1988) and the other is a joint conditional likelihood estimator based on both validation and non-validation data set. In Section 3, we derive the asymptotic properties of the proposed estimators. In Section 4, we review some existing estimates. In Section 5, a simulation study is conducted to investigate the performances of the proposed estimators. In Section 6, the proposed estimators are applied to a cable TV data set. In Section 7, we provide some concluding remarks. Technical details for the asymptotic normal theory are provided in the Appendix.
We have proposed two semiparametric approaches for logistic regression analysis of randomized response data when the covariates on some subjects are missing at random. In general, the JCL estimator outperforms all the other estimators. When the missing rate is high and corr(X,W)(X,W) is not low, the JCL estimator can be much more efficient than all the other estimators. However when missing rate and corr(X,W)(X,W) are both low, the performances of JCL, VL and WE estimators can be close to one another.
The main results are presented for the case when both XX and VV are discrete. An important feature of our proposed method is that the distribution assumptions with respect to covariates are unnecessary. Although the main results are presented for the case when VV is discrete, both VL and WE estimators can be extended to the case when VV is continuous by the approach of Wang and Wang (1997). Moreover, under the normality assumption of XX given (V,Y)(V,Y), the JCL estimators can also be extended to the case when both VV and XX are continuous. However, when there are no assumptions about the distribution of XX, non-parametric kernel estimation techniques are required to extend our approaches since nuisance components involve the estimators of both selection probabilities and the relative risk View the MathML sourceE(exp(β1tX)|Y=0,V). The extension of the present work to non-randomized response technique model (Yu et al., 2008) requires further study.