تجزیه و تحلیل مقایسه ای از روش های داده کاوی در پیش بینی ظرف نتایج NCAA
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
22256 | 2012 | 10 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : International Journal of Forecasting, Volume 28, Issue 2, April–June 2012, Pages 543–552
چکیده انگلیسی
Predicting the outcome of a college football game is an interesting and challenging problem. Most previous studies have concentrated on ranking the bowl-eligible teams according to their perceived strengths, and using these rankings to predict the winner of a specific bowl game. In this study, using eight years of data and three popular data mining techniques (namely artificial neural networks, decision trees and support vector machines), we have developed both classification- and regression-type models in order to assess the predictive abilities of different methodologies (classification versus regression-based classification) and techniques. In the end, the results showed that the classification-type models predict the game outcomes better than regression-based classification models, and of the three classification techniques, decision trees produced the best results, with better than an 85% prediction accuracy on the 10-fold holdout sample. The sensitivity analysis on trained models revealed that the non-conference team winning percentage and average margin of victory are the two most important variables among the 28 that were used in this study.
مقدمه انگلیسی
College football has always been one of the most widely watched sports in the US, with over 50 million in attendance during the course of a single season. It is common to find a college which has a football stadium with a greater seating capacity than the total population of the city in which the college is located. The popularity of American football can be attributed partly to its nature of being ruled by both intricate strategy and physical strength. Because of the physical demands of the game, teams can only play one game a week, and thus they end up playing only 14 competitive games through a season (which includes the end of the season bowl game). Unlike most other competitive team sports, college football does not follow a playoff system for identifying the national champion in a given season. Instead, the annual national champion is determined by a single game between the two “best” teams, which are selected based on a combination of BCS (bowl championship series), rating formulae, and polls (the tallied votes) of sports writers and football coaches. Of the remaining hundreds of teams, the sixty or more most successful teams are invited to play in one of thirty or more end-of-season bowl games. The selection process of the “successful” teams for these bowl games is also partially based on a highly subjective, and mostly controversial, poll-driven rating and ranking process. Predicting the outcome of a college football game (or any sports game) is an interesting and challenging problem. Therefore, challenge-seeking researchers among both academics and industry have spent a great deal of effort on forecasting the outcome of sporting events. Large quantities of historic data are available (often publicly available) from different media outlets regarding the structure and outcomes of sporting events, in the form of a variety of numerically or symbolically represented factors which are assumed to contribute to those outcomes. However, despite the large number of studies in sports (more than 43,000 hits on digital literature databases), only a small percentage of papers has focused exclusively on the characteristics of sports forecasts. Instead, many papers have been written about the efficiency of sports markets. Since most previous betting-market studies have been concerned with economic efficiency (Van Bruggen, Spann, Lilien, & Skiera, 2010), they have not evaluated the actual (or implied) forecasts associated with such events. As it turns out, it is possible to derive a considerable amount of information about the forecasts and the forecasting process from studies that have tested the markets for economic efficiency (Stekler, Sendor, & Verlander, 2010). Bowl games are very important for colleges, both financially (bringing in millions of dollars of additional revenue) and for recruiting highly regarded high school athletes for their football programs. The teams that are selected to compete in a given bowl game split a purse, the size of which depends on the specific bowl (some bowls are more prestigious and have higher payouts for the two teams), and therefore securing an invitation to compete in a bowl game is the main goal of any division I-A college football program. The decision makers in the bowl games are given the authority to select and invite successful bowl-eligible (teams that have six wins against their Division I-A opponents in that season) teams (as per the ratings and rankings) which will play an exciting and competitive game, attract fans of both schools, and keep the remaining fans tuned in via a variety of media outlets for advertising (West & Lamsal, 2008). Every year, people either casually (i.e., recreational office pools for bragging rights) or somewhat seriously (i.e., wagering/betting for monetary gain) put their knowledge of the game on the line in an attempt to accurately predict the outcomes of the bowl games. The emotional and highly dynamic nature of college football, coupled with the selection process, which aims to bring together equally rated opponents from different conferences (which often have not played each other in the recent past), makes this prediction even more challenging and exciting. Many statisticians and quantitative analysts have explored ways to quantify the variables of a college football bowl game numerically and/or symbolically, and to use these variables in a wide variety of models for predicting the outcome of a game (Stekler et al., 2010). As can be seen in the literature review section, many of these studies rely on the ranking-based selection, and even though some have claimed to have met with limited success, many have reported the difficulty of this prediction problem. In this paper, we report on a data mining study where we used eight years of bowl game data, along with three popular data mining techniques (decision trees, neural networks and support vector machines), to predict both the classification-type outcome of a game (win versus loss) and the regression-type outcome (projected point difference between the scores of the two opponents). The rest of the paper is organized as follows. The next section provides a review of the relevant literature in this prediction domain. Section 3 describes the methodology (i.e., the data, prediction model types and evaluation methods used in the study), followed by Section 4, which provides the prediction results. Finally, Section 5 summarizes the study, discusses the findings, and indentifies the limitations and future research directions.
نتیجه گیری انگلیسی
The results of the study show that the classification-type models predict the game outcomes better than regression-based classification models. Of the three classification techniques, classification and regression trees produced the best results, with a prediction accuracy better than 86% on the 10-fold holdout sample, followed by support vector machines (79.51% prediction accuracy) and neural networks (75.00% prediction accuracy). For other assessment metrics (i.e., sensitivity and specificity), we see once again that classification and regression trees produce better results than either support vector machines or neural networks. Even though these results are specific to the application domain and data used in this study, and therefore should not be generalized beyond the scope of the study, they are still exciting because not only are decision trees the best predictors, they are also better in understanding and deployment than the other two machine learning techniques employed in this study. In order to understand the relative importances of the factors used in the study, we conducted a sensitivity analysis on trained prediction models, where we measured the comparative importance of the input variables in predicting the output. That is, a sensitivity analysis measures the relative importance of a variable based on the difference in modeling performance with and without the inclusion of a variable (i.e., the sensitivity of a specific predictor variable is the error of the prediction model without the predictor variable divided by the error of the model with the predictor variable) (Saltelli, Chan, & Scott, 2008). After normalizing, combining and consolidating the sensitivity analysis results of all classification and regression models, the following input variables made it to the top of the list (presented in ranking order): NCTW (non-conference team winning percentage), HMWIN (home win percentage), MARGOVIC (average margin of victory during the current season), TOP25 (success against the top 25 teams during the current season), and LAST7 (success in the last seven games of the season). It is somewhat surprising that none of the “against the odds” variables made it to the top five. The ordered list of variable importance is presented in a horizontal bar-chart in Fig. 4, where the size of the horizontal bars represents the relative importance of each variable with respect to the rest of the predictive variables.The results obtained herein should be interpreted within the scope of the study. The use of other prediction techniques and/or other variables may produce somewhat different results. In order to be able to comment further on the generalizability of the findings in this study, more elaborate experimentations with much larger data sets and prediction techniques are required. The main directions for future research following this study include (a) the enrichment of the variable set (e.g., identifying and including more input variables, representing variables in different forms for better expressiveness, etc.), (b) the employment of other classification and regression methods and methodologies (the use of other prediction techniques such as rough sets, genetic algorithm based classifiers, etc., and the use of ensemble models), (c) experimentation with seasonal game predictions (which may need a combination of static and time series variable identifications), and (d) experimentation with other college and professional sports predictions.