برآورد مدل رگرسیون خطی با متغیرهای کمکی طبقه ای در معرض پاسخ تصادفی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
24202 | 2006 | 13 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 50, Issue 11, 20 July 2006, Pages 3311–3323
چکیده انگلیسی
The maximum likelihood estimation of the iid normal linear regression model where some of the covariates are subject to randomized response is discussed. Randomized response (RR) is an interview technique that can be used when sensitive questions have to be asked and respondents are reluctant to answer directly. RR variables are described as misclassified categorical variables where conditional misclassification probabilities are known. The likelihood of the linear regression model with RR covariates is derived and a fast and straightforward EM algorithm is developed to obtain maximum likelihood estimates. The basis of the algorithm consists of elementary weighted least-squares steps. A simulation example demonstrates the feasibility of the method.
مقدمه انگلیسی
Randomized response (RR) is an interview technique that can be used when sensitive questions have to be asked and respondents are reluctant to answer directly (Warner, 1965 and Chaudhuri and Mukerjee, 1988). Examples of sensitive questions are questions about alcohol consumption, sexual behavior or fraud. RR variables can be seen as misclassified categorical variables where conditional misclassification probabilities are known. The misclassification protects the privacy of the individual respondent. This paper applies the ideas in Spiegelman et al. (2000) to iid normal linear regression models where some of the covariates are subject to RR. Spiegelman et al. (2000) discuss the logistic regression model with misclassified covariates and estimate the misclassification using main study/validation study designs. The misclassification model of an RR design, however, is different since conditional misclassification probabilities are known to the analyst of an RR data set. This paper specifies the misclassification model of RR and shows how the misclassification can be taken into account in the maximum likelihood estimation of the linear regression model. Furthermore, as an alternative to Newton–Raphson maximization of the likelihood function an EM algorithm (Dempster et al., 1977) is presented. There is quite some literature about RR and the adjustment for the misclassification in the analysis, see, e.g., probability estimation in Chaudhuri and Mukerjee (1988), Bourke and Moran (1988), Moors (1981) and Migon and Tachibana (1997), the logistic regression model with a RR dependent variable in Maddala (1983), and loglinear models in Chen (1989) and Van den Hout and Van der Heijden (2004). RR variables as covariates, however, have not been dealt with. The possibility to include RR variables in regression models enlarges the possible application of RR. As an example, consider the situation where one variable depends on a second variable that models sexual behavior. If respondents are reluctant to answer about their behavior directly, RR can be used. In that case, a standard regression model is incorrect since it does not take into account the misclassification due to the use of RR. A second field that may benefit from the discussion in this paper is statistical disclosure control. There is a similarity between RR designs and the post randomization method (PRAM) as a method for disclosure control of data matrices, see Gouweleeuw et al. (1998). Disclosure control aims at safeguarding the identity of respondents after data have been collected, see, e.g., Bethlehem et al. (1990). If privacy is sufficiently protected, data producers, such as national statistical institutes, can safely pass on data to a third party. The idea of PRAM is to misclassify some of the categorical variables in the original data matrix and to release the perturbed data together with information about the misclassification mechanism. In this way, PRAM introduces uncertainty in the data, i.e., the user of the data cannot be sure whether the individual information in the matrix is original or perturbed due to PRAM. Since the variables that are perturbed are typically covariates such as, e.g., Gender, Ethnic Group, Region, it is important to know how to adjust regression models in order to take into account the misclassification. PRAM can be seen as a specific form of RR and the idea to use RR in this way goes back to the founder of RR, see Warner (1971). Similarities and differences between PRAM and RR are discussed in Van den Hout and Van der Heijden (2002). Domingo-Ferrer and Torra (2001) compare PRAM with other methods for disclosure control and Willenborg and De Waal (2001, Chapter 5) discuss the derivation of misclassification probabilities by means of linear programming. The outline of the paper is as follows. Section 2 introduces the RR model. Section 3 discusses the linear regression model with RR covariates. In Section 4, an EM algorithm is presented that maximizes the likelihood formulated in Section 3. Section 5 discusses the necessity of adjustment for misclassification and presents some simulation results. Section 6 concludes.
نتیجه گیری انگلیسی
This paper presents a method that estimates the iid normal linear regression model with RR covariates. An EM algorithm is presented as an alternative to Newton–Raphson maximization of the loglikelihood. In general, an EM algorithm is considered a stable but a somewhat slow maximization routine and if a Newton–Raphson type of algorithm is possible, it is preferred since it is faster and the estimation of standard errors is almost automatic. However, the present loglikelihood can be quite complex numerically. Especially in the case of PRAM, there might be a large number of perturbed covariates some of which may have a large number of categories. Consider a model that includes the variables Gender (2 categories), Ethnic Background (4 categories), and Region (20 categories). Assume that PRAM is applied. Besides the regression parameters, there are 2×4×20-1=1592×4×20-1=159 nuisance parameters in the maximization of (8). When the number of parameters is not too large using ready-made general purpose maximization routines may be an option. However, these routines are blind with respect to the specific structure of the maximization problem at hand. The EM-routine developed in this study exploits the structure of the normal linear regression model with known misclassification probabilities for discrete covariates. Since in this case the maximization in the M-step is in closed form the EM-algorithm is actually very fast. Moreover, it appeared to be quite robust with respect to starting values: proper convergence was always obtained when using the straightforward OLS-estimates for that purpose. To obtain estimated standard errors we applied a final Newton–Raphson type optimization step. For a more general discussion of the possibility to extend the EM-algorithm in order to obtain standard errors see Little and Rubin (2002, Section 9.2). The simulations in Section 5 show that the perturbation caused by using RR or PRAM cannot be ignored. Furthermore, the simulations demonstrate that the method presented in this paper is feasible and that adjustment for the perturbation is possible. However, sample sizes have to be larger than in the standard situation without perturbation. Protecting privacy is not for free. An important assumption in this discussion is that respondents follow the RR design. This assumption will not always be justified. For instance, it might be that some respondents do not trust the privacy protection offered by the RR design and give a socially desirable answer anyway. These respondents bring about a second perturbation besides the misclassification due to the RR design. This obviously introduces a bias in the results, Nevertheless, one must make do with what one has and the general idea is that RR performs relatively well (Lensvelt-Mulders et al., 2005). Future RR surveys may profit from research into non-compliance, see Boeije and Lensvelt-Mulders (2002). Estimation of the parameters in the linear regression model is only the first step in fitting a linear regression model to data. Future research should address possibilities to check some of the assumptions of the model. Can outliers for instance be detected when some of the covariates are subject to misclassification? Although this paper only discusses the iid normal linear regression model, it might be interesting to investigate the approach for more sophisticated regression models. The reclassification model will stay the same, but it might be that a straightforward EM is not possible anymore.