ترجمه فارسی عنوان مقاله

آزمون های زیبایی تناسب برای مدل رگرسیون لجستیک هنگامی که داده ها با استفاده از طرح نمونه گیری پیچیده جمع آوری شده است

عنوان انگلیسی

Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design

کد مقاله	سال انتشار	تعداد صفحات مقاله انگلیسی
24728	2007	15 صفحه PDF

منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Computational Statistics & Data Analysis, Volume 51, Issue 9, 15 May 2007, Pages 4450–4464

ترجمه کلمات کلیدی

رگرسیون لجستیک - نمونه گیری - نمونه برداری - برآورد بر اساس طراحی -

کلمات کلیدی انگلیسی

Goodness-of-fit, Logistic regression, Survey sampling, Design-based estimation,

دانلود رایگان 2 صفحه اول مقاله لاتین (PDF)

پیش نمایش مقاله

چکیده انگلیسی

Logistic regression models are frequently used in epidemiological studies for estimating associations that demographic, behavioral, and risk factor variables have on a dichotomous outcome, such as disease being present versus absent. After the coefficients in a logistic regression model have been estimated, goodness-of-fit of the resulting model should be examined, particularly if the purpose of the model is to estimate probabilities of event occurrences. While various goodness-of-fit tests have been proposed, the properties of these tests have been studied under the assumption that observations selected were independent and identically distributed. Increasingly, epidemiologists are using large-scale sample survey data when fitting logistic regression models, such as the National Health Interview Survey or the National Health and Nutrition Examination Survey. Unfortunately, for such situations no goodness-of-fit testing procedures have been developed or implemented in available software. To address this problem, goodness-of-fit tests for logistic regression models when data are collected using complex sampling designs are proposed. Properties of the proposed tests were examined using extensive simulation studies and results were compared to traditional goodness-of-fit tests. A Stata ado function svylogitgof for estimating the FF-adjusted mean residual test after svylogit fit is available at the author's website http://www.people.vcu.edu/~kjarcher/Research/Data.htm.

مقدمه انگلیسی

Logistic regression is frequently used in epidemiological studies to model the relationship between a categorical outcome variable and a set of predictor variables. Traditionally, logistic regression assumes that the observations represent a random sample from a population (i.e., independent and identically distributed (iid)), where the model is expressed as equation(1) yi=π(xi)+εi.yi=πxi+εi. Turn MathJax on In this equation, yiyi represents the dichotomous dependent or outcome variable; π(xi)πxi represents the conditional probability of experiencing the event given independent predictor variables xixi, or Pr(Yi=1|xi)PrYi=1|xi; and εiεi represents the binomial random error term. More formally, the conditional probability π(xi)πxi as a function of the independent covariates xixi is expressed as equation(2) View the MathML sourceπxi=PrYi=1|xi=exi′β1+exi′β, Turn MathJax on where β′=(β0,β1,β2,…,βp)β′=β0,β1,β2,…,βp are the model parameters to be estimated and pp is the number of independent terms in the model. Under iid-based sampling, elements are selected independently; therefore, the covariance between elements is zero. Under complex sampling, there may be a number of primary sampling units (PSUs), that is, there are j=1,…,Mj=1,…,M PSUs (or “clusters”) from which mm PSUs are sampled. Furthermore, within each sampled PSU there are i=1,…,Nji=1,…,Nj units from which nmnm are sampled. A disadvantage generally associated with cluster sampling is that elements from the same cluster are often more homogeneous than elements from different clusters. This results in a positive covariance between elements within a cluster. Therefore, the intra-class correlation, which measures the homogeneity within clusters, is generally positive for cluster sample designs, and as a result, traditional maximum likelihood methods for estimation cannot be used. Rather, under complex sampling, which involves both stratification and possibly several stages of cluster sampling, pseudo-maximum likelihood is used (Skinner et al., 1989). The sampling weight, wjiwji, calculated as the inverse of the product of the conditional inclusion probabilities at each stage of sampling, represents the number of units that the given sampled observation represents in the total population. Expanding each observation by its sampling weight will produce a dataset for the NN units in the total population. Conceptually, pseudo-maximum likelihood estimation is like obtaining the maximum likelihood estimates for the expanded dataset. In other words, the logistic regression model is being fit to the ‘census’ data. The model parameters ββ for logistic regression models built from complex survey data are found by using pseudo-maximum likelihood. The contribution of a single observation using pseudo-maximum likelihood is equation(3) π(xji)wji×yji[1-π(xji)]wji×(1-yji).πxjiwji×yji1-πxjiwji×1-yji. Turn MathJax on The pseudo-maximum likelihood function is still constructed as the product of the individual contributions to the likelihood, but now it is the product over the mm clusters sampled and nmnm observations within the given cluster, expressed as equation(4) View the MathML sourcelp(β)=∏j=1m∏i=1njπxjiwji×yji1-πxjiwji×1-yji. Turn MathJax on Given the pseudo-likelihood equation we find the PMLE (pseudo-maximum likelihood estimator) is that value that maximizes the pseudo log-likelihood function equation(5) View the MathML sourcelnLp(β)=∑j=1m∑i=1njwji×yji×lnπxji+wji×1-yji×ln1-πxji. Turn MathJax on The survey sampling design may induce correlation among observations, particularly when cluster samples are drawn. To appropriately estimate standard errors associated with model parameters and estimated odds ratios, it is important to account for the sampling design. The need to account for the sampling design in the statistical analysis of survey data has been widely reported in the literature. A brief tutorial regarding the importance of accounting for clustering and sampling weights, accompanied by an illustration using the National Health and Nutrition Examination Survey I data has previously been reported (Korn and Graubard, 1991). A more comprehensive review was subsequently provided by Korn and Graubard (1995). In another example, the difference between “model-based” (assuming the observations are from a random sample) and “design-based” analyses (an analysis which accounts for the survey design) was illustrated using the Personnes Ages Quid study, a stratified cluster sample (Lemeshow et al., 1998). It is of particular importance to model the survey design when estimating standard errors associated with model parameters or odds ratios. Once a logistic regression model has been fit to a given set of data, the adequacy of the model is examined by overall goodness-of-fit tests and examination of influential observations. One concludes a model fits if the differences between the observed and fitted values are small and if there is no systematic contribution of the differences to the error structure of the model. A goodness-of-fit test that is commonly used to assess the fit of logistic regression models is the Hosmer–Lemeshow test (Hosmer and Lemeshow, 1980). Other goodness-of-fit tests for logistic regression models have been proposed (Cox, 1958; Tsiatis, 1980; Brown, 1982; Azzalini et al., 1989; le Cessie and van Houwelingen, 1991 and le Cessie and van Houwelingen, 1995; Su and Wei, 1991; Osius and Rojek, 1992; Pigeon and Heyse, 1999a and Pigeon and Heyse, 1999b). These goodness-of-fit tests have been studied under independent and identically distributed random variable assumptions, which we refer to as the ‘iid-based’ setting. Although appropriate estimation methods which take into account the sampling design in estimating logistic regression model parameters are available in various statistical packages, there is a corresponding absence of design-based goodness-of-fit testing procedures. Due to this noted absence, it has been suggested that goodness-of-fit be examined by first fitting the design-based model, then estimating the probabilities, and subsequently using iid-based tests for goodness-of-fit and applying any findings to the design-based model (Hosmer and Lemeshow, 2000). Unfortunately, the statistical properties of this method have not been examined. In this article we studied this proposed method and additionally proposed alternative design-based goodness-of-fit tests for logistic regression models. Unlike ordinary goodness-of-fit tests, the proposed tests take the sampling design and weights into account.