In 1954 Paul Meehl (1954, reprinted 1996) published an influential study reviewing 20 pieces of research that compared decisions made by human experts with decisions indicated by the fitted values of simple linear statistical models parameterised on the experts' decisions. The applications were in fields as disparate as the diagnosis of schizophrenia, the probability of released prisoners reoffending, and the academic attainments of college students. In every case, the statistical model performed as well as, and generally better than, the human judges, in spite of the fact that the amount of information available to the human judges was usually greater than the limited number of quantitative inputs available to the statistical model. The logic of this “judgemental bootstrapping” procedure is that in many contexts quite simple models can remove the noise and inconsistency from human decisions, and this more than compensates for the lower information content of the models. Meehl's work stimulated a fierce and negative reaction from medical practitioners, and a summary and rebuttal of their arguments is found in Grove and Meehl (1996). Subsequent studies have reinforced Meehl's findings. The meta-analysis by Grove, Zald, Lebow, Snits, and Nelson (2000) found 136 studies, including several in business and finance-related areas such as bankruptcy prediction and credit rating by banks. Of these studies, 64 show the statistical approach of weighted linear prediction to be superior, 64 show approximately the same outcome from human and statistical approaches, and only 8 favour the human judges.
This paper aims to extend the domain of judgemental bootstrapping further, by investigating whether simple models of the trading recommendations of technical analysts in bond futures markets can outperform the analysts themselves. Technical analysts believe that patterns in the time series of prices can be used to identify profitable trading opportunities in financial markets, in contrast to fundamental analysts, who look at news about interest rates, inflation, company earnings and other economic variables. For determining short term intra-day and day-to-day trading positions, technical analysis is overwhelmingly the more common decision support system used by traders (Lui and Mole, 1998 and Taylor and Allen, 1992).
Our exercise promises to be interesting for five reasons. First, there is little published evidence on the validity of financial market trading systems built on the performance of experts. The many books, journals and articles on expert systems in finance are mostly concerned with finding complex nonlinear models to determine “ideal” trading positions. Second, technical analysts use a combination of quantifiable indicators, subjective pattern recognition tools, and the occasional piece of economic and business news, to support their decisions. Only the quantifiable indicators can be readily used as inputs to a high frequency statistical model, so in this field the human judges have a substantial information advantage. Third, the criterion of success in the markets is profit rather than the percentage of correct outcomes, the typical metric used in the studies surveyed by Grove et al. (2000). Because of the non-normal distribution of price movements in financial markets, profitability and accuracy are only weakly related, as was demonstrated in the study of interest rate futures trading by Leitch and Tanner (1991). Fourth, trading involves making judgements over time against a changing environment, whereas most models of judgement have been developed from more static cross-sectional “case-based” data. Finally, and again in contrast to, say, corporate bankruptcy events and loan defaults, price movements in financial markets do not have clear-cut drivers. On the contrary, a key prediction of the mainstream modern theory of finance is that futures price changes are made near-random by the actions of profit-motivated traders. Timmerman and Granger (2004) give a balanced review of reasons why prices might nonetheless be forecastable even in an efficient market.
A major barrier to the development of expert financial market trading systems is the paucity of objectively verifiable experts (as opposed to self-proclaimed experts, of whom there are many). Section 2 of this paper introduces the German bond futures markets, and the track records of two analysts who followed these markets in the years 2000–01. We conduct tests suggesting that these analysts show genuine expertise. Section 3 reviews the methods of technical analysis, and the results of a survey designed to establish what technical indicators these particular analysts use. In Section 4, we relate the analysts' recommended trading positions to a subset of relevant indicators using the ordered-response model of Aitchison and Silvey (1957). In the clinical literature this procedure of modelling experts by simple quasi-linear models is termed the “statistical” or “actuarial” approach. We have instead followed current practice in calling our procedure “judgemental bootstrapping”, the terminology of Dawes (1971), Dawes and Corrigan (1974) and Armstrong (2001a). Even then, there is the potential for confusion with statistical bootstrap inference methods, compounded here by the fact that we use a resampling methodology when testing for analyst expertise.
Armstrong (2001a) sets out some criteria for best practice in the implementation of conventional judgemental bootstrap exercises. Benchmarking our data against these criteria may help us to clarify its likely strengths and limitations. The criteria relate to the choice of cases, experts and driver variables, and, at a more fundamental level, the choice of problem. The number of cases should ideally be large and disparate. With over 700 trading recommendations in very different trading environments (rising/stable/falling markets), our data satisfy this criterion. The number of experts should be “more than one”, and experts should differ and have demonstrable expertise. We have only two experts, both from the same organisation, though with somewhat different approaches to technical analysis. This does limit the extent to which our study can be used to make generalisations about the value of technical analysis, and the value of the judgemental bootstrap in trader support. Finally, the driver variables in any bootstrap model should ideally be quantifiable, comprehensive and “valid”. We do use some of the most popular statistical indicators cited by technical analysts. However, the sheer number and variety of technical indicators makes it impossible for any model to be comprehensive. Validity may also be an issue. Unlike, say, symptoms of disease or financial distress, technical indicators cannot be always and everywhere a predictor of price changes in financial markets, since this would be a clear and easily exploitable violation of market efficiency. At best, when used judiciously and in combination with other information, technical indictors may be useful in identifying those intermittent phases when financial markets show short-lived trends. If the underlying relationships between technical indicators and trading profits are complicated but reasonably stable, the modelling process can help to make the judgements about trading positions consistent. However, if the underlying relationships are too complex or too fragile to be approximated by a simple quasi-linear model, then the models may not prove robust over time.
In spite of these potential pitfalls, we find, as is consistent with the clinical literature, that trades based on models of analyst recommendations are more profitable than trades based on the recommendations themselves. Moreover, among the alternative specifications investigated, we find that the models which in statistical terms perform best in-sample also yield the highest profits out of sample, suggesting that there is some stability in performance over time. However, the pattern of model trades is rather different from that of analyst trades, with the models trading more often and holding positions for longer. This introduces additional volatility and liquidity risks, so on a risk-adjusted basis it is not clear that the models represent an improvement over human judgement.
Section 5 concludes by looking at the effect of marginally complicating the models, without moving to a full-blooded rule-based expert systems approach. The possibilities considered are
1.
combining model and analyst judgements,
2.
using lagged analyst recommendations as inputs to the model,
3.
pooling data from the two analysts, and
4.
applying the average analyst stop-loss and limit orders to model-determined trading positions.
We also experiment with a single-layer neural network in place of the ordered-response model. Some of these yield more realistic patterns of trading, but only data pooling significantly improves the return and risk relative to the model-only benchmark.