تجزیه و تحلیل پوشش بهداشت و درمان: یک روش داده کاوی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
21430 | 2009 | 9 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 36, Issue 2, Part 1, March 2009, Pages 995–1003
چکیده انگلیسی
The existing disparity in the healthcare coverage is a pressing issue in the United States. Unfortunately, many in the US do not have healthcare coverage and much research is needed to identify the factors leading to this phenomenon. Hence, this study aims to examine the healthcare coverage of individuals by applying popular machine learning techniques on a wide-variety of predictive factors. Twenty-three variables and 193,373 records were utilized from the 2004 behavioral risk factor surveillance system survey data for this study. The artificial neural networks and the decision tree models were developed and compared to each other for predictive ability. The sensitivity analysis and variable importance measures are calculated to analyze the importance of the predictive factors. The experimental results indicated that the most accurate classifier for this phenomenon was the multi-layer perceptron type artificial neural network model that had an overall classification accuracy of 78.45% on the holdout sample. The most important predictive factors came out as income, employment status, education, and marital status. Using two popular machine learning techniques, this study identified the factors that can be used to accurately classify those with and without healthcare coverage. The ability to identify and explain the reasoning of those likely to be without healthcare coverage through the application of accurate classification models can potentially be used in reducing the disparity in healthcare coverage.
مقدمه انگلیسی
Healthcare coverage in general and the existing disparity in this coverage in specific is a pressing issue in the United States. Many in the US do not have healthcare coverage, and much research has been conducted, and much more is needed to identify the factors leading to this disparity in coverage. Previous work has identified two key situations where understanding of these factors is beneficial (Cunningham & Ginsburg, 2001). First, given that these factors exist at both the state and local level, it is imperative that the individuals responsible for funding decisions correctly interpret the reasons that the uninsured rates may be elevated. Second, it is important that the individuals are able to determine if the uninsured rates may be elevated simply due to unchangeable characteristics within the population. While existing research has identified factors which differ between those with and without coverage, it has not progressed to building discriminatory models to separate them from each other. Further, despite identification of differences on some factors, the disparity has not been reduced (Glover, Moore, Probst, & Samuels, 2004). This study advances the current research by building classification models to identify those belonging to each group – those that do and those that do not have healthcare coverage. Such a model may eventually be used to help reduce healthcare coverage disparity. Similar techniques have previously been introduced in a more general manner for use in targeting customers in the insurance industry (Wu, Kao, Su, & Wu, 2005). Identifying those without healthcare coverage is important, as those without coverage may have reduced access to medical care (Monheit & Vistnes, 2000) and may have more preventable hospitalizations (Services USDoHaH, 2003). Lack of healthcare coverage has also been linked to poor health and early death. This issue has been exacerbated by the clear increasing trend of the number of uninsured in the last 20 years (Cunningham and Ginsburg, 2001 and Herring, 2005). From 1989 to 1999, there was an 18.4% increase in the number of people without insurance. The level of uninsured has grown to approximately 16% of the population as of 2003 (Jonk et al., 2005). It has been estimated that 60 million people are uninsured at some point during a given year (Cunningham & Ginsburg, 2001). The factors leading to the increasing number of uninsured have been studied at both the state and local level (Cunningham & Ginsburg, 2001), and across many socio-demographic factors (Hendryx et al., 2002 and Holtz-Eakin, 2002). Lifestyle factors may also contribute to this problem. Gender is one socio-demographic variable that has been found to be related to healthcare coverage (Carrasquillo et al., 2000, Monheit and Vistnes, 2000 and Shi, 2000). Many studies have found that females are less likely to be insured than males (Hendryx et al., 2002, Holtz-Eakin, 2002 and Monheit and Vistnes, 2000), though the opposite has also been found (Carrasquillo et al., 1999 and Nelson et al., 2004). Race or ethnicity may also be a factor contributing to healthcare coverage disparity (Carrasquillo et al., 2000, Lucas et al., 2003 and Monheit and Vistnes, 2000), with minorities generally having less healthcare coverage (Carrasquillo et al., 1999, Glover et al., 2004 and Monheit and Vistnes, 2000). According to the results of multiple studies, those with lower incomes are less likely to have healthcare coverage (Cardon and Hendel, 2001, Carrasquillo et al., 1999 and Lucas et al., 2003), and having healthcare coverage also has a relationship with employment status (Schmidt & Deichert, 1996) and type of employment (Krieger, Barbeau, & Soobader, 2005). There is also a difference among states, region of the country, or even county in the rate of healthcare coverage (Cardon and Hendel, 2001, Carrasquillo et al., 1999, Nelson et al., 2004 and Schmidt and Deichert, 1996). For example, in the Northeast and Midwest, one is more likely to be insured while in the South or the West, one is more likely to be uninsured (Cardon & Hendel, 2001). Some studies suggest that younger adults have less insurance (Cardon and Hendel, 2001 and Carrasquillo et al., 1999). Also, studies (Jonk et al., 2005 and Woolhandler et al., 2005) that looked at differences between veterans and non-veterans found that fewer veterans had less insurance relative to the remaining population. Education and marital status have also been found to have a relationship with this insurance disparity (Shi, 2000). Disabled Americans may lack coverage (Landerman et al., 1998). For some populations studied, the role of household size in this disparity has also been examined (Glover et al., 2004 and Pol et al., 2002). In addition to the socio-demographic factors described above, lifestyle may also play a part in the existing disparity. Those that are already in poor health may have reduced coverage (Cunningham and Ginsburg, 2001 and Glover et al., 2004). These studies did not specify whether poor health included mental and/or physical health. Smoking status has been studied in relation to type of insurance coverage held (King & Mossialos, 2005). Exercise and alcohol consumption are additional lifestyle variables that have previously been used in classification models for insurance policy purposes (Chae, Ho, Cho, Lee, & Ji, 2001). Further, the extent to which someone is a risk taker may affect whether they secure healthcare coverage (Cunningham & Ginsburg, 2001). Past studies have often looked at a subgroup, such as the near-elderly (Monheit, Vistnes, & Eisenberg, 2001) or immigrants (Herring, 2005), of the overall US population and have succeeded in identifying variables that seem to contribute to the disparity in healthcare coverage for the subgroup. This study will look at healthcare coverage disparity across the population of the US, rather than within a smaller subgroup of the population. It will also address the numerous possible contributing factors, both socio-demographic and lifestyle, and their contribution to the growing disparity in healthcare coverage. Further, the study will utilize machine learning techniques in building classification models. Previously, the issue of healthcare coverage disparity has been studied using primarily statistical techniques such as logistic regression (Glover et al., 2004) and basic descriptive statistics (Shi, 2000 and Woolhandler et al., 2005). For many years linear regression has been the primarily used technique in capturing and representing functional relationships between dependent and independent variables, largely because of its well-known statistically explainable optimization strategies. However, in many problem scenarios, the model accuracy suffers as the assumed linear approximation of a function is not valid. With current technology, machine learning techniques can easily model such scenarios as healthcare coverage, as is addressed in this paper. These techniques are not constrained by the Gauss–Markov assumption (such as multicollinearity and normality) which is a major concern for more traditional models (Uysal & Roubi, 1999). Previously, these techniques have been used to study other healthcare issues such as factors affecting inpatient mortality (Chae, Kim, Tark, Park, & Hoa, 2003) and influencing prenatal care (Prather et al., 1997). In this study, we have attempted to build an accurate classification model, using machine learning techniques. The model could then be used to predict whether or not an individual has healthcare coverage based on specific socio-demographic and lifestyle information as well as the importance of the various factors in the model.
نتیجه گیری انگلیسی
If programs designed to address healthcare coverage disparity are to be effective, those without coverage must be accurately identified. By utilizing factors that are captured, quantified and represented in the data set, this study built classification models that identify these individuals with about 78.86% accuracy, greatly exceeding the percentage that would be expected by chance. Based on these socio-demographic and lifestyle variables, the model identifies those with and without coverage, enabling providers to more efficiently target services to those without coverage, or the government to design additional services more effectively for those in need. This study takes into consideration a wide range of factors that can impact healthcare coverage disparity, drawing primarily from the existing literature. Consistent with previous results, the classification model constructed in this study also found income, employment, education, marital status and location to be among the most significant variables. Even then, the study contributes by taking all the variables found significant in past studies, in the context of a larger set of variables. Thus it illustrates that some variables have the highest discriminating power, even if many other variables are taken into consideration. An important strength of this study is the non-linear nature of the machine learning techniques employed in this study that help overcome issues related to correlation between income, employment and education. Hence the finding of all three variables as important is no longer problematic. Among the variables included in this study, lifestyle variables have not received as much attention in the past. Thus even though smoking status as well as alcohol consumption have been used in classification models (Chae et al., 2001 and King and Mossialos, 2005) before, they have not been among the important variables, sometimes even being non-significant in the multivariate models built. However, two of these variables, binge drinking and smoking status, were found to be within the ten most important variables in this study, suggesting a need to further examine this type of variable and its relationship to healthcare coverage, even though all lifestyle variables were not found to be important. It is important to understand that even though certain variables may not be important when the contribution of other variables like income, education and employment is considered, they may be significant in understanding the disparity issue in greater detail. The strength of this study is that it gives a ranking of the variables in terms of their importance. Thus, instead of leaving out variables that models may find unimportant, we can now look at the relative importance of individual variables in comparison with the other variables that are included. Unlike past studies, which have found veteran and disabled status to impact healthcare coverage, our study found these variables to be relatively unimportant in the current model. This could imply that veterans and the disabled have received improved coverage recently, or that in the context of a larger model, these variables are less important than others in determining coverage availability. It is known that lack of healthcare coverage can have dire consequences. It may lead to poor health, preventable hospitalizations, and even premature death. The problem is becoming more serious, as the number of people in the US without insurance is increasing. The use of machine learning techniques is an additional insightful way to examine this problem that may bring us closer to understanding and addressing the issue of healthcare coverage disparity. The findings from this study can be used as a basis to understand sections of the society that primarily fall in the uninsured category, leading to solutions or alternatives that can help us reduce this divide.