پیش بینی و تجزیه و تحلیل نمرات آزمون جایگذاری آموزش متوسطه: یک روش داده کاوی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
21448 | 2012 | 9 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 39, Issue 10, August 2012, Pages 9468–9476
چکیده انگلیسی
Understanding the factors that lead to success (or failure) of students at placement tests is an interesting and challenging problem. Since the centralized placement tests and future academic achievements are considered to be related concepts, analysis of the success factors behind placement tests may help understand and potentially improve academic achievement. In this study using a large and feature rich dataset from Secondary Education Transition System in Turkey we developed models to predict secondary education placement test results, and using sensitivity analysis on those prediction models we identified the most important predictors. The results showed that C5 decision tree algorithm is the best predictor with 95% accuracy on hold-out sample, followed by support vector machines (with an accuracy of 91%) and artificial neural networks (with an accuracy of 89%). Logistic regression models came out to be the least accurate of the four with and overall accuracy of 82%. The sensitivity analysis revealed that previous test experience, whether a student has a scholarship, student’s number of siblings, previous years’ grade point average are among the most important predictors of the placement test scores.
مقدمه انگلیسی
Rapid development of a variety of information technologies increased the amount of data collected and used in decision making processes. As the amount of data collected increased in size and complexity, so did the challenges associated with storing, managing and analyzing it. Incapacitating the capabilities of simple relational database systems, such an exponential surge let to development of new data management systems called data warehouses. On the analysis side of data, a new term coined “data mining”. Simply put, data mining is the non-trivial, iterative process of extracting novel patterns (e.g., associations, trends, relationships, natural groupings, etc.) from large data sources in order to enhance evidence-based decision making. Even though data mining is still considered a new paradigm, it has been successfully applied to a variety of domains including education. Understanding the factors that lead to success (or failure) of students at secondary education is an interesting and difficult problem. Therefore, determining the variables that are related to academic achievement of students have always been aroused the curiosity of the researchers. Often centralized placement tests and future academic achievements are considered as related concepts that are derivative of each other. That is, students who are successful in placement tests are assumed to be (in general) successful in their academic endeavors. Though controversial, reasoning behind many of the placement-tests is accredited to this assumption. Placement of students to secondary education institutions in Turkey is being realized by a centralized and standardized placement test since the 2007–2008 academic year. Scores obtained from examination taken by primary education students are combined with other factors using a preset formula to determine the final placement scores of the Secondary Education Transition System (SETS). The aim of this study is twofold: (i) to investigate predictive power of different data mining methods by employing a k-fold cross validation methodology, and (ii) to determine the ranked-importance of predictive variables (i.e., factors) by applying sensitivity analysis on already trained prediction models. It is thought that revealing these variables that directly or indirectly affect achievement would be beneficial to students, parents, teachers, and administrators who are interested in maximizing success. Moreover, the results of the study would be of great value to researchers as well as practitioners in evaluating the effectiveness of these single-test based placement systems. This manuscript is organized as follows. In the next section (Section 2) a literature review of analysis of centralized placement testing is presented. In Section 3 the research methodology is given, where data, prediction and analysis methods and evaluation techniques are all explained in detail. In Section 4, the comparative analyzes of the prediction models and the results of the aggregated sensitivity analyzes are presented. In the last section (Section 5) the discussions of the results as well as the concluding remarks are given.
نتیجه گیری انگلیسی
Data mining is a very useful tool for a wide variety of real-world problems, domains and industries where large amounts of data is being collected and stored. Although educational institutions have not taken advantage of it as much as some other domains (e.g., banking, marketing, healthcare and medicine/biology), they are now moving at an increasing pace to utilize this methodology for a variety of purposes. Some of the noteworthy emerging data mining application areas in the field of education include student need assessment, retention management, major identification and placement test improvement. As the awareness of the capabilities of data mining increases, administrators, practitioners and researchers in educational institutions will identify way to use it in analyzing and solving seemingly unsolvable domain related problems. Success in data mining depends on following a sound methodology, paying due attention to every step in the process, and being critical/through of everything from the start to the end of the experimentation process. Arguable the most important steps are the ones early in the CRISP-DM methodology, namely understanding the domain and understanding the data. To some experts, up to 85% of the total project time is spent on these early phases (Turban et al., 2010). Our study was another example to confirming this claim. Following a sound methodology, such as CRISP-DM helped us greatly on organizing our work and staying focused throughout the study. As this study illustrated, data mining techniques can accurately predict placement test outcomes, and hence allows for analyzes and determination of important predictors. Such analyzes would help better understand the internal structure of these standardized tests, and potentially help in designing more effective and fair assessment tools and techniques. As the cross validation results indicate, in this study data mining techniques predicted significantly better than their statistical counterpart, logistic regression (i.e., C5 decision tree algorithm predicted with an accuracy of close to 95% while logistic regression predicted with an accuracy of close to 83%). When choosing a method over another for a prediction problem, in addition to prediction accuracy, one should also consider factors like efficiency (time it takes to build a model), interpretability (ease of understanding of the developed model), deployability (ease of deploying the model for actual use) and theoretical justification. Taking all of these factors into account collectively may result in using logistic regression instead of decision tree to analyze this problem. It all depends on the priorities of these factors in a given problem situation. Recently, instead of choosing one method, researchers are proposing use of many models collectively (often called ensembles) for better and more robust prediction results.