افزایش قدرت: روش عملی برای آزمون زیبایی تناسب برای مدل رگرسیون لجستیک با پیش بینی پیوسته
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|24752||2008||11 صفحه PDF||سفارش دهید||محاسبه نشده|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 52, Issue 5, 20 January 2008, Pages 2703–2713
When continuous predictors are present, classical Pearson and deviance goodness-of-fit tests to assess logistic model fit break down. The Hosmer–Lemeshow test can be used in these situations. While simple to perform and widely used, it does not have desirable power in many cases and provides no further information on the source of any detectable lack of fit. Tsiatis proposed a score statistic to test for covariate regional effects. While conceptually elegant, its lack of a general rule for how to partition the covariate space has, to a certain degree, limited its popularity. We propose a new method for goodness-of-fit testing that uses a very general partitioning strategy (clustering) in the covariate space and either a Pearson statistic or a score statistic. Properties of the proposed statistics are discussed, and a simulation study demonstrates increased power to detect model misspecification in a variety of settings. An application of these different methods on data from a clinical trial illustrates their use. Discussions on further improvement of the proposed tests and extending this new method to other data situations, such as ordinal response regression models are also included.
Generalized linear models, and in particular, logistic regression models, are widely used in biomedical research fields. One component of model-fitting is the identification and specification of potential covariates to be included in the linear predictor. Estimation using maximum likelihood and testing the significance of the regression coefficients using either Wald or score tests are usually key goals of the analysis (Cox and Snell, 1989). Significance testing of each coefficient provides information about the relationship between the covariate and response, relative to overall variability. Goodness-of-fit tests, on the other hand, reflect whether the predicted values are an accurate representation of the observed values. Omitted predictors, a misspecified form of the predictor, or an inappropriate link function can all result in poor predictions. If the regression of the response variable on treatment and covariates is linear or exponential, omission of important covariates only reduces the efficiency of the regression coefficient estimates, but has no effect on the consistency of the estimation. For logistic regression, this omission not only reduces the efficiency of the coefficient estimation, but also affects the consistency of the coefficient estimation. It leads to biased estimates of treatment effect, even in randomized experiments (Gail et al., 1984 and Gail et al., 1988; Hauck et al., 1991 and Robinson and Jewell, 1991). In case-control studies, the consistency of the estimators of the population odds ratio is still maintained if a correct logistic regression model is specified (Anderson, 1972 and Xie and Manski, 1988; Nagelkerke et al., 1995 and Nagelkerke et al., 2005). The widely used chi-square statistic can be used as a measure of how far observed sample data deviate from a theoretical model providing expected counts in each of G distinct covariate patterns. Under the assumption of no lack-of-fit and suitable regularity conditions, the test statistic is asymptotically distributed as central chi-square with (G-k-1)(G-k-1) degrees of freedom, where k is the number of regression parameters in the model (not counting the intercept). The deviance statistic can also be used for assessing the goodness-of-fit ( Nelder and Wedderburn, 1972, Williams, 1987 and Agresti, 1990). Under the assumption of a particular form of the underlying distribution, the deviance (2*[LLs-LLf])2*[LLs-LLf]) is a measure of the difference between the log likelihood of the fitted model (LLf)(LLf) and the log likelihood of the saturated model (LLs)(LLs). Under suitable regularity conditions and a properly specified model, the deviance statistic has approximately a chi-square distribution with G-k-1G-k-1 degrees of freedom, where G is the number of distinct covariate patterns. These two statistics fall within a family known as power divergence statistics ( Cressie and Read, 1984 and Read and Cressie, 1988). Although the deviance and Pearson chi-square statistics are routinely provided in most statistical packages, their chi-square limiting null distribution is only valid when the number of observations in each covariate pattern is large. However, this condition is often unrealistic when a large number of categorical covariates or continuous covariates are present in the model. The Hosmer–Lemeshow statistic (Hosmer and Lemeshow, 1980 and Hosmer and Lemeshow, 1989) is a practical goodness-of-fit chi-square test for general logistic regression situations, including those with continuous predictors. To implement this test, the predicted probabilities are grouped into G bins according to either data-driven percentiles of the estimated probabilities or prespecified fixed cutpoints. The test statistic is calculated by comparing the observed frequency (Og)Og) to the average predicted frequency (Eg)(Eg) in the cell g , g=1,2,…,2Gg=1,2,…,2G, via the familiar form of the statistic View the MathML sourceXHL2=∑g=12G(Og-Eg)2/Eg. Simulation studies indicate that, under the null hypothesis of no model lack-of-fit, the Hosmer–Lemeshow statistic can be approximated by a chi-square distribution with G-2G-2 degrees of freedom. The Hosmer–Lemeshow statistic is widely used due to its following properties: (1) it is intuitively appealing and easy to compute; (2) it has sound support from simulation studies; and (3) it is widely available in computer packages. In addition to these properties, lack of a better approach also contributes to its popularity. However, it has the following deficiencies ( Hosmer et al., 1997, Pigeon and Heyse, 1999, Hosmer and Hjort, 2002 and Kuss, 2002): (1) its limiting distribution has not been rigorously derived; (2) it is a conservative test and has low power to detect specific types of lack of fit (such as nonlinearity in an explanatory variable); (3) it is highly dependent on how the observations are grouped; (4) if too few groups are used to calculate the statistic (for instance, five or fewer groups), it will almost always indicate that the model fits the data; and (5) when the Hosmer–Lemeshow statistic indicates a lack of fit, it may be difficult to identify what types of subjects are not modeled well. The Tsiatis goodness-of-fit statistic (Tsiatis, 1980) uses a different approach. Instead of grouping observations by their predicted outcomes, he partitions the multidimensional space of covariates into m distinct regions. An additive region effect for each region is added to the model to measure regional lack-of-fit. A score statistic is used to test that all of the m regional effects are zero. Tsiatis’ procedure is as follows: (1), the space of covariates matrix (x1,x2,…,xk)′(x1,x2,…,xk)′ is partitioned into G distinct regions in k -dimensional space denoted by R1,R2,…,RGR1,R2,…,RG. The indicator functions I(j)I(j)(j=1,2,…,G)(j=1,2,…,G) are defined by I(j)=1I(j)=1 if (x1,x2,…,xk)′(x1,x2,…,xk)′∈Rj∈Rj and I(j)=0I(j)=0 otherwise; (2), the model considered is ln[πi/(1-πi)]=β′Xi+γ′Iiln[πi/(1-πi)]=β′Xi+γ′Ii, where β′=(β0,β1,β2,…,βk)β′=(β0,β1,β2,…,βk), View the MathML sourceXi′=(1,x1i,x2i,…,xki), View the MathML sourceIi′=(Ii(1),Ii(2),…,Ii(G)), and γ′=(γ1,γ2,…,γG)γ′=(γ1,γ2,…,γG). Note that β′Xiβ′Xi models all the original covariates and γ′Iiγ′Iimodels the regional shifts; (3), a score statistic is then constructed to test that γ1=γ2=⋯=γG=0γ1=γ2=⋯=γG=0. More details are provided in Section 2.2 score statistic. Tsiatis's approach is conceptually elegant, but it lacks a general rule for how to partition the covariate space, especially when continuous covariates are present. How to choose the number of distinct regions m has also remained largely unstudied. Pulkstenis and Robinson (2002) presented a goodness-of-fit method which draws on the notable strengths of both the Hosmer–Lemeshow approach and Tsiatis approach. Their method provides guidance on the choice of G and how to partition the covariate space using two-levels of groupings. At level one, G covariate patterns are determined by all distinct combinations of just the categorical explanatory variables in the model. Then at level two, within each unique covariate pattern, all the observations are sorted by model-based fitted probabilities and are split into two subgroups based on the median of fitted probabilities, which incorporates information from the continuous predictors. When the median fitted probability takes on an actual fitted value, Pulkstenis and Robinson suggested the convention of arbitrarily placing it in the first group of lower probabilities. Since the data include continuous covariates, ties among fitted probabilities should be relatively rare. Based on this grouping strategy, Pulkstenis and Robinson proposed the following test statistics: View the MathML sourceχ*2=∑g=1G∑h=12∑j=12(Oghj-Eghj)2Eghj, Turn MathJax on View the MathML sourceD*2=∑g=1G∑h=12∑j=12OghjlogOghjEghj, Turn MathJax on where g indexes covariate patterns of the categorical predictors, h indexes the sub-stratification due to the median-split of the fitted probabilities, and j indexes two response categories. OghjOghj is the observed count for response j in sub-stratification h of g covariate pattern of the categorical predictors and EghjEghj is the model expected count for response j in sub-stratification h of covariate pattern g of the categorical predictors. Through their simulation studies, Pulkstenis and Robinson have indicated that, under the null hypothesis these two test statistics have the approximate chi-square distribution with 2G-k-22G-k-2 degrees of freedom, where k is the number of variables in the model needed to represent all covariates, not including the intercept. Pulkstenis and Robinson's approach allows assessment of goodness-of-fit for models with both continuous and categorical covariates, but also incorporates the full design structure of the categorical predictors into the process. They have shown that this approach is more powerful than the Hosmer–Lemeshow test in situations when an interaction term is omitted from the model. The increase in power for their tests is due to the fact that the structure of individual covariate patterns is kept intact rather than collapsed. The other (positive) aspect of Pulkstenis and Robinson's modified Pearson chi-square and deviance statistics is that you can better determine what area of the covariate space is not fitting well when the tests indicate a significant lack-of-fit. Pulkstenis and Robinson's modified Pearson chi-square and deviance statistics are constructed to deal with only the situations when both categorical and continuous covariates are present and the number of cross-classifications of categorical covariates is not too large. It should also be noted that their statistics’ null distribution is not known exactly and their results are based on simulations. Several other studies have provided insights in different aspects of logistic regression model checking. Motivated by Copas's (1983) study, le Cessie and van Houwelingen (1991) proposed a goodness-of-fit test statistic based on smoothing methods. Osius and Rojek (1992) constructed a limiting test statistic by calculating the first two moments of the power-divergence statistic. Other proposals include McCullagh's conditional approach (McCullagh, 1985 and McCullagh, 1986), Farrington's method (Farrington, 1996), information matrix test (White, 1982, Orme, 1988 and Zheng, 2001), and Copas’ (1989) residual sum of squares test. Extensions of goodness-of-fit test to correlated data situations have also been proposed by different studies (Barnhart and Williamson, 1998 and Horton et al., 1999). In Section 2, a general method for assessing goodness-of-fit test in logistic regression models, applicable to a wide range of continuous and categorical covariate configurations, is presented. Focus, however, is on the situation where there is either a mix of continuous and discrete covariates, or only continuous covariates. Theoretical considerations of the proposed statistics are discussed. A simulation study comparing the proposed tests with existing tests is presented in Section 3. An illustration of the proposed tests in a clinical trial study is provided in Section 4, and concluding remarks are in Section 5.
نتیجه گیری انگلیسی
We proposed to use cluster analysis on covariates in partitioning observations and to use either a Pearson or score statistic to test goodness-of-fit of logistic regression models with continuous covariates. Simulation studies have shown favorable properties as compared to currently widely used Hosmer–Lemeshow test. The proposed tests hold closely to nominal significance level and have demonstrated consistent higher power in all the simulated model scenarios. The two proposed tests also outperform Pulkstenis–Robinson tests, except in the situations where both categorical and continuous covariates are present and an interaction term is omitted. Application of all five tests on the motivating example study indicates that the model fit decisions based on the two proposed tests are in agreement with current clinical knowledge and that the Hosmer–Lemeshow test and the Pulkstenis–Robinson tests do not show ample power. Based on these studies, we propose that the two proposed tests be used for testing logistic regression model fit when continuous covariates are present. When the fitted model has both continuous and categorical covariates, the test of Pulkstenis–Robinson is also recommended. Another advantage of the two proposed tests over the Hosmer–Lemeshow test is that when the test indicates a lack of fit, the Hosmer–Lemeshow test does not provide information about where the model does not fit, while the two proposed tests automatically identify what types of subjects are not modeled well. Improvement and future work on the two proposed tests include the following: First, when both continuous and categorical covariates are present, instead of treating dichotomous and ordinal covariates as interval data (when applying cluster analysis), one another option is to combine our proposal and Pulkstenis and Robinson's method. In step one, all the observations are grouped into a small number of mutually exclusive bins by cluster analysis based on all the continuous covariates only; and in step two, each of these bins is further partitioned by covariate patterns determined by only the categorical covariates. This may further improve the power of the two proposed tests. Second, further evaluate the choice of number of portioned groups G and the method of clustering. In this study, we use the ad hoc rule to determine G and only apply Ward's method in clustering observations. Further study may result in a better or optimal rule in determining G and choice of clustering method. Third, find a better approximation to the Pearson chi-square statistic. Although our recommendation for determining the degrees of freedom for the approximated Pearson chi-square statistic works well in our study, it appears to be somewhat conservative. More studies are warranted for optimal approximation for the Pearson chi-square statistic. Fourth, simulation study evaluation on performance of the two proposed tests on ordinal data and on correlated data will be carried out. Fifth, extensions of the two proposed tests to multinomial data (generalized logit model) are reasonably straightforward, but the issue determining the asymptotic distribution of test statistics is still present.