دانلود مقاله ISI انگلیسی شماره 21436
ترجمه فارسی عنوان مقاله

یک روش داده کاوی جدید به منظور اثرات علت و معلولی تخمینی از مداخلات سیاست

عنوان انگلیسی
A new data mining approach to estimate causal effects of policy interventions
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
21436 2010 11 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Expert Systems with Applications, Volume 37, Issue 1, January 2010, Pages 171–181

ترجمه کلمات کلیدی
تورش انتخاب - ارزیابی برنامه ها - داده کاوی - فضای شرطی - تجزیه ماتریکس -
کلمات کلیدی انگلیسی
Selection bias, Program evaluation, Data mining, Conditional space, Matrix decomposition,
پیش نمایش مقاله
پیش نمایش مقاله  یک روش داده کاوی جدید به منظور اثرات علت و معلولی تخمینی از مداخلات سیاست

چکیده انگلیسی

This paper presents a data driven approach that enables one to obtain a measure of comparability between-groups in the presence of observational data. The main idea lies in the use of the general framework of conditional multiple correspondences analysis as a tool for investigating the dependence relationship between a set of observable categorical covariates X and an assignment-to-treatment indicator variable T, in order to obtain a global measure of comparability between-groups according to their dependence structure. Then, we propose a strategy that enables one to find treatment groups, directly comparable with respect to pre-treatment characteristics, on which estimate local causal effects.

مقدمه انگلیسی

The availability of information concerning the processes aimed the monitoring of operational activities of bodies, institutions, private and public companies has over the last decades increased. This phenomenon led to the proliferation of semi-automatic control processes which largely rely on the advances made by information technology and on the development of statistical techniques that are peculiar of modern data mining. On many different fields, new demands arise of an evaluation of the impacts that large-scale actions and policies generate on the various stake-holders, users or managers, involved in the production of relevant goods or services. Reference is made not only to the evaluation of marketing campaigns but also to the assessment of the impact of social or economic policies on the individual citizens or businesses. The modern dataflow within the organizations has turned the monitoring processes into a step of ordinary production process. The verification of the validity of the actions which are developed and implemented becomes part of tools which are available to private or public decision-makers. The evaluation process is, by definition, a process which takes place following a large-scale action and provides a retrospective view of the events. It is clearly possible-but only in a pre-test setting-to validate a certain action or campaign ex-ante, though this validation cannot be made on large samples. For this reason, precisely because reference is made to social sciences, the pre-test is usually part of qualitative research. The logic reference framework is one well known in literature as quasi-experiments or observational studies. Data analysis thus refers to the broader frame of causal inference which is useful to estimate causal effects. The closest approach to modern information treatment using information technology and statistics, i.e. data mining, is the one mainly developed by Rubin and known as Potential Outcome Framework (Rosenbaum and Rubin, 1983, Rubin, 1991, Rubin, 2004, Rubin, 2005 and Rubin, 2006). In this paper we will not dwell on discussing the cultural proximity between data mining and the potential outcome framework but it will clearly emerge that the similarity of purposes generates major synergies in operational applications. One of the operational aims of modern data mining as a statistical semi-industrial process linked to private and public production organizations is minimizing the researcher’s subjective choices during data analysis strategies. For example, in the recent literature on data mining ample room is given to the concept whereby processes such as variable selection, model selection and measurement of model stability have to be dealt with in terms of IT implementation of strategies which are to the largest possible extent semi-automatic. Furthermore, modern data mining is increasingly faced with the need to estimate or allocate the result of more general predictive models to each individual (customers, citizens, points of sale, patients, etc.). As a consequence, the above models will have to be selected among those which best treat data at individual level and which are able to provide business and management intelligence systems with the results of the individual units which make up the reference databases. The latter concept also applies to the evaluation framework, where the effect of an action (a policy or a campaign) should be measured for each treated individual, subsequently releasing this information in the reference database at the level of the individual record of the estimated effect.

نتیجه گیری انگلیسی

The obtained conditional space allows us to measure the influence of conditioning on the data. In fact, this space represents the variability (inertia) common to treated and untreated units, not due to selection mechanism (within-groups inertia). If the within-groups inertia is the same as the total inertia, and thus the between-inertia is near zero, then in the unconditional space is possible to make an unbiased comparison between treated and untreated, being sure that the between-inertia is a correct, global and model-free measure of the influence of conditioning. This is accomplished by the fact that in the Conditional Space hold some important properties: (1) The conditional space, whose coordinates are free of any dependence from an external categorical covariate T, represents an indicator of existing imbalance in covariates between-groups to be compared. (2) If a variable X of the X matrix is independent from the assignment-to-treatment indicator variable T, then the contribute of X to the new conditional space will be an usual contribute (according the eigenvalue and eigenvector decomposition matrix in the generation of a new multidimensional space). (3) If a variable X is perfectly dependent from T then the effect of X is null in the job-creation of the new multidimensional conditional space. This implies that the influence of the assignment mechanism but also the influence of all variables associated to it, has been eliminated. The data mining approach to causal inference could be innovative and promising for different aspects: as propensity score we help design observational studies in a way analogous to the way randomized experiment are designed, without seeing any answers involving outcome variables; we utilize only available data, we do not make assumptions about unobservable covariates that could generate selection bias, we assume to have enough information in the available pre-treatment covariates: in this sense, we might expect that the information matrix X generally includes all variables, both continuous and categorical, causally prior to the treatment assignment T that affect the outcome Y conditional on T; we avoid many problems related to the multi-dimensionality of a data matrix especially if the aim is to implement a matching strategy; it is a method that potentially works for treatment with more than two-levels. However, the key result is that, once established if the amount of conditioning by T is important or not, units will be similar with respect to a distance measure, and not similar with regard to an estimated value of propensity score. Thus, similarity does not depend from which model specification researchers have chosen. The idea of using the inertia between as a measure of comparability between-groups represents an initial stage in learning more from data. Future works in this area might concern the use of a statistical test to determine if the Ib is statistically significant. In turn, future works might concern other classification methods, being the cluster analysis sensitive to the nature of data, the method and the dissimilarity measure adopted. Further, future works might explore analytic properties of the conditional space in order to understand if the coordinates of the conditional space could be used to compute the missing counterfactual at micro individual level.