پی فوتبال: مدل شبکه های بیزی برای پیش بینی نتایج بازی فوتبال
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
29185 | 2012 | 18 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Knowledge-Based Systems, Volume 36, December 2012, Pages 322–339
چکیده انگلیسی
A Bayesian network is a graphical probabilistic model that represents the conditional dependencies among uncertain variables, which can be both objective and subjective. We present a Bayesian network model for forecasting Association Football matches in which the subjective variables represent the factors that are important for prediction but which historical data fails to capture. The model (pi-football) was used to generate forecasts about the outcomes of the English Premier League (EPL) matches during season 2010/11 (but is easily extended to any football league). Forecasts were published online prior to the start of each match. We show that: (a) using an appropriate measure of forecast accuracy, the subjective information improved the model such that posterior forecasts were on par with bookmakers’ performance; (b) using a standard profitability measure with discrepancy levels at ⩾5%, the model generates profit under maximum, mean, and common bookmakers’ odds, even allowing for the bookmakers’ built-in profit margin. Hence, compared with other published football forecast models, pi-football not only appears to be exceptionally accurate, but it can also be used to ‘beat the bookies’.
مقدمه انگلیسی
Association Football (hereafter referred to simply as ‘football’) is the world’s most popular sport [11], [43] and [12], and constitutes the fastest growing gambling market [7]. As a result, researchers continue to introduce a variety of football models which are formulated by diverse forecast methodologies. While some of these focus on predicting tournament outcomes [36], [4], [35], [26] and [27] or league positions [34], our interest is in predicting outcomes of individual matches. A common approach is the Poisson distribution goal-based data analysis whereby match results are generated by the attack and defence parameters of the two competing teams [41], [9], [38] and [32]. A similar version is also reported in [10] where the authors demonstrate profitability against the market only at very high levels of discrepancy, but which relies on small quantities of bets against an unspecified bookmaker. A time-varying Poisson distribution version was proposed by [53] in which the authors demonstrate profitability against Intertops (a bookmaker located in Antigua, West Indies), and refinements of this technique were later proposed in [8] which allow for a computationally less demanding model. In contrast to the Poisson models that predict the number of goals scored and conceded, all other models restrict their predictions to match result, i.e. win, draw, or lose. Typically these are ordered probit regression models that consist of different explanatory variables. For example, [37] considered team performance data as well as published bookmakers’ odds, whereas [24] and [22] considered team quality, recent performance, match significance and geographical distance. Ref. [23] compared goal-driven models with models that only consider match results and concluded that both versions generate similar predictions. Techniques from the field of machine learning have also been proposed for prediction. In [55], the authors claimed that a genetic programming based technique was superior in predicting football outcomes to other two methods based on fuzzy models and neural networks. More recently, [52] claimed that acceptable match simulation results can be obtained by tuning fuzzy rules using parameters of fuzzy-term membership functions and rule weights by a combination of genetic and neural optimisation techniques. Models based on team quality ratings have also been considered, but they do not appear to have been extensively evaluated. Knorr-Held [33] used a dynamic cumulative link model to generate ratings for top division football teams in Germany. The ELO rating that was initially developed for assessing the strength of chess players [13] has been adopted to football [3]. In [29], the authors used the ELO rating for match predictions and concluded that the ratings appeared to be useful in encoding the information of past results for measuring the strength of a team, but the forecasts generated were not on par with market odds. Ref. [40] have also assessed an ELO rating based model along with the FIFA/Cocal Cola World rating model and concluded that both were inferior against bookmakers’ forecasts for EURO 2008. Numerous studies have considered the impact of specific factors on match outcome. These factors include: home advantage [28], ball possession [28], and red cards [51] and [56].1 Recently researchers have considered Bayesian networks and subjective information for football match predictions. In particular, [31] demonstrated the importance of supplementing data with expert judgement by showing that an expert constructed Bayesian network model was more accurate in generating football match forecasts for matches involving Tottenham Hotspurt than machine learners of MC4, naive Bayes, Bayesian learning and K-nearest neighbour. A model that combined a Bayesian network along with a rule-based reasoner appeared to provide reasonable World Cup forecasts in [42] through simulating various predifined strategies along with subjective information, whereas in [2] a hierarchical Bayesian network model that did not incorporate subjective judgments appeared to be inferior in predicting football results when compared to standard Poisson distribution models. In this paper we present a new Bayesian network model for forecasting the outcomes of football matches in the distribution form of {p(H), p(D), p(A)}; corresponding to home win, draw and away win. We believe this study is important for the following reasons: (a) the model is profitable under maximum, mean and common bookmakers’ odds, even by allowing for the bookmakers’ introduced profit margin; (b) the model priors are dependent on statistics derived from predetermined scales of team-strength, rather than statistics derived from a particular team (hence enabling us to maximise historical data); (c) the model enables us to revise forecasts from objective data, by incorporating subjective information for important factors that are not captured in the historical data; (d) the significance of recent information (objective or subjective) is weighted using degrees of uncertainty resulting in a non-symmetric Bayesian parameter learning procedure; (e) forecasts were published online before the start of each match [49]; (f) although the model has so far been applied for one league (the English Premier League) it is easily applicable to any other football league. The paper is organised as follows: section 2 describes the historical data and method used to inform the model priors, section 3 describes the Bayesian network model, section 4 describes the assessment methods and section 5 provides our concluding remarks and future work.
نتیجه گیری انگلیسی
We have presented a novel Bayesian network model called pi-football (v1.32) that was used to generate the EPL match forecasts during season 2010/11. The model considers both objective and subjective information for prediction, in which time-dependent data is weighted using degrees of uncertainty. In particular, objective forecasts are generated first and revised afterwards according to subjective indicators. Because of the ‘anonymous’ underlying approach which generates predictions by only considering the strength of the two competing teams given results data and total points, the entire model is easily applicable to any other football league. For assessing the performance of our model we have considered both accuracy and profitability measurements since earlier studies have shown conflicting conclusions between the two and suggested that both measurements should be considered. In [9], the authors claimed that for a football forecast model to generate profit against bookmakers’ odds without eliminating the in-built profit margin it requires a determination of probabilities that is sufficiently more accurate from those obtained by published odds, and [25] suggested that if such a work was particularly successful, it would not have been published. Ours is the first study to demonstrate profitability against all of the (available) published odds. Previous studies have only considered a single bookmaker, since only recently it was proven that the published odds of a single bookmaker cannot be representative of the overall market [7]. In fact, pi-football was able to generate profit against maximum, mean, and common bookmakers’ odds, even allowing for the bookmakers’ in-built profit margin. We showed that subjective information improved the forecast capability of our model significantly. Our study also emphasises the importance of Bayesian networks, in which subjective information can both be represented and displayed without any particular effort. Because of the nature of subjective information, we have been publishing our forecasts online [49] prior to the start of each match (earlier studies which incorporated subjective information have not done so). Appendix G provides examples of both objective (fO) and subjective (fS) forecasts for match instances at the beginning of the EPL season 2010/11. At standard discrepancy levels of 5% the profitability of this model ranges from 2.87% to 9.48%, whereas at higher discrepancy levels (8–11%) the maximum profit observed ranges from 8.86% to 35.63%, depending on the various bookmakers’ odds considered. No other published work appears to be particularly successful at beating all of the various bookmakers’ odds over a large period of time, which highlights the success of pi-football. Clearly the real potential benefits of a model such as this are critically dependent on both the structure of the model and the knowledge of the expert. A perfect BN model would still fail to beat the bookmakers at their own game if the subjective expert inputs are inaccurate. Because of the weekly pressure to get all of the model predictions calculated and published online, there was inevitable inconsistency in the care and accuracy taken to consider all the subjective inputs for each match; in most cases the subjective inputs were provided by a member of the research team who is certainly not an expert on the English premier League. If the model were to be used by more informed experts we feel it would provide posterior beliefs of both higher precision and confidence. An individual component-based analysis failed to provide us with strong conclusions about their distinct efficiency due to the relative low number of relevant occurrences. Planned extensions of this research will determine the distinct component-based effectiveness by adding further evidence of relevant occurrences, and a reverse engineering approach will help us understand how specific model components help in matching bookmakers’ odds (hinting at why bookmakers are indeed experts and also what information they might consider for prediction). We have already summarised several aspects concerning bookmakers’ inefficiency in [7]. Other extensions of the research will determine whether revising the strength of the team (given subjective information) rather than the probability distribution itself would improve the performance of the model; this is important because the former represents a natural causality whereas the latter does not. Further, since we have not yet assessed the impact of time-dependent uncertainty for weighting the more recent information, we plan to determine the degree of irrelevance to prediction per preceding information, as well as the degree of efficiency of the various time-series methodologies introduced throughout the sports academic literature (none of the previous football studies have attempted to measure their efficiency).