دانلود مقاله ISI انگلیسی شماره 22139
ترجمه فارسی عنوان مقاله

حداقل مجذورات انعطاف پذیر برای داده کاوی زمانی و آربیتراژ آماری

عنوان انگلیسی
Flexible least squares for temporal data mining and statistical arbitrage
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
22139 2009 12 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Expert Systems with Applications, Volume 36, Issue 2, Part 2, March 2009, Pages 2819–2830

ترجمه کلمات کلیدی
داده کاوی زمانی - حداقل مجذورات انعطاف پذیر - متغیر با زمان رگرسیون - سیستم تجارت الگوریتمی - آربیتراژ آماری
کلمات کلیدی انگلیسی
Temporal data mining, Flexible least squares, Time-varying regression, Algorithmic trading system, Statistical arbitrage
پیش نمایش مقاله
پیش نمایش مقاله  حداقل مجذورات انعطاف پذیر برای داده کاوی زمانی و آربیتراژ آماری

چکیده انگلیسی

A number of recent emerging applications call for studying data streams, potentially infinite flows of information updated in real-time. When multiple co-evolving data streams are observed, an important task is to determine how these streams depend on each other, accounting for dynamic dependence patterns without imposing any restrictive probabilistic law governing this dependence. In this paper we argue that flexible least squares (FLS), a penalized version of ordinary least squares that accommodates for time-varying regression coefficients, can be deployed successfully in this context. Our motivating application is statistical arbitrage, an investment strategy that exploits patterns detected in financial data streams. We demonstrate that FLS is algebraically equivalent to the well-known Kalman filter equations, and take advantage of this equivalence to gain a better understanding of FLS and suggest a more efficient algorithm. Promising experimental results obtained from a FLS-based algorithmic trading system for the S&P 500 Futures Index are reported.

مقدمه انگلیسی

Temporal data mining is a fast-developing area concerned with processing and analyzing high-volume, high-speed data streams. A common example of data stream is a time series, a collection of univariate or multivariate measurements indexed by time. Furthermore, each record in a data stream may have a complex structure involving both continuous and discrete measurements collected in sequential order. There are several application areas in which temporal data mining tools are being increasingly used, including finance, sensor networking, security, disaster management, e-commerce and many others. In the financial arena, data streams are being monitored and explored for many different purposes such as algorithmic trading, smart order routing, real-time compliance, and fraud detection. At the core of all such applications lies the common need to make time-aware, instant, intelligent decisions that exploit, in one way or another, patterns detected in the data. In the last decade we have seen an increasing trend by investment banks, hedge funds, and proprietary trading boutiques to systematize the trading of a variety of financial instruments. These companies resort to sophisticated trading platforms based on predictive models to transact market orders that serve specific speculative investment strategies. Algorithmic trading, otherwise known as automated or systematic trading, refers to the use of expert systems that enter trading orders without any user intervention; these systems decide on all aspects of the order such as the timing, price, and its final quantity. They effectively implement pattern recognition methods in order to detect and exploit market inefficiencies for speculative purposes. Moreover, automated trading systems can slice a large trade automatically into several smaller trades in order to hide its impact on the market (a technique called iceberging) and lower trading costs. According to the financial times, the London stock exchange foresees that about 60% of all its orders in the year 2007 will be entered by algorithmic trading. Over the years, a plethora of statistical and econometric techniques have been developed to analyze financial data De Gooijer and Hyndma, 2006. Classical time series analysis models, such as ARIMA and GARCH, as well as many other extensions and variations, are often used to obtain insights into the mechanisms that generates the observed data and make predictions Chatfield, 2004. However, in some cases, conventional time series and other predictive models may not be up to the challenges that we face when developing modern algorithmic trading systems. Firstly, as the result of developments in data collection and storage technologies, these applications generate massive amounts of data streams, thus requiring more efficient computational solutions. Such streams are delivered in real-time; as new data points become available at very high frequency, the trading system needs to quickly adjust to the new information and take almost instantaneous buying and selling decisions. Secondly, these applications are mostly exploratory in nature: they are intended to detect patterns in the data that may be continuously changing and evolving over time. Under this scenario, little prior knowledge should be injected into the models; the algorithms should require minimal assumptions about the data-generating process, as well as minimal user specification and intervention. In this work we focus on the problem of identifying time-varying dependencies between co-evolving data streams. This task can be casted into a regression problem: at any specified point in time, the system needs to quantify to what extent a particular stream depends on a possibly large number of other explanatory streams. In algorithmic trading applications, a data stream may comprise daily or intra-day prices or returns of a stock, an index or any other financial instrument. At each time point, we assume that a target stream of interest depends linearly on a number of other streams, but the coefficients of the regression models are allowed to evolve and change smoothly over time. The paper is organized as follows. In Section 2 we briefly review a number of common trading strategies and formulate the problem arising in statistical arbitrage, thus proving some background material and motivation for the proposed methods. The flexible least squares (FLS) methodology is introduced in Section 3 as a powerful exploratory method for temporal data mining; this method fits our purposes well because it imposes no probabilistic assumptions and relies on minimal parameter specification. In Section 4 some assumptions of the FLS method are revisited, and we establish a clear connection between FLS and the well-known Kalman filter equations. This connection sheds light on the interpretation of the model, and naturally yields a modification of the original FLS that is computationally more efficient and numerically stable. Experimental results that have been obtained using the FLS-based trading system are described in Section 5. In that section, in order to deal with the large number of predictors, we complement FLS with a feature extraction procedure that performs on-line dimensionality reduction. We conclude in Section 7 with a discussion on related work and directions for further research.

نتیجه گیری انگلیسی

We have argued that the FLS method for regression with time-varying coefficients lends itself to a useful temporal data mining tool. We have derived a clear connection between FLS and Kalman filter equations, and have demonstrated how this link enhances interpretation of the smoothing parameter featuring in cost function that FLS minimizes, and naturally leads to a more efficient algorithm. Finally, we have shown how FLS can be employed as a building-block of an algorithmic trading system. There are several aspects of the simple system presented in Section 5 that can be further improved upon, and the remainder of this discussion points to a few general directions and related work that we intend to explore in the future. The problem of feature selection is an important one. In Section 5 the system relies on a set of 432 constituents of the S&P 500 Price Index under the assumption that they explain well the daily movements in the target asset. These explanatory data streams could be selected automatically, perhaps even dynamically, from a very large basket of streams, on the basis of they similarity to the target asset. This line of investigation relates to the correlation detection problem for data streams, a well-studied and recurrent issue in temporal data mining. For instance, Guha, Gunopulos, and Koudas (2003) propose an algorithm that aims at detecting linear correlation between multiple streams. At the core of their approach is a technique for approximating the SVD of a large matrix by using a (random) matrix of smaller size, at a given accuracy level; the SVD is then periodically and randomly re-computed over time, as more data points arrive. The SPIRIT system for streaming pattern detection of Papadimitriou, Sun, and Faloutsos (2005) and Sun, Papadimitriou, and Faloutsos (2006) incrementally finds correlations and hidden variables summarising the key trends in the entire stream collection. Of course, deciding on what similarity measure to adopt in order to measure how close explanatory and target assets are is not an easy task, and is indeed a much debated issue (see, for instance, Gavrilov, Anguelov, Indyk, & Motwani (2000)). For instance, Shasha and Zhu (2004) adopt a sliding window model and the Euclidean distance as a measure of similarity among streams. Their StatStream system can be used to detect pairs of financial time series with high correlation, among many available data streams. Cole, Shasha, and Zhao (2005) combine several techniques (random projections, grid structures, and others) in order to compute Pearson correlation coefficients between data streams. Other measures, such as dynamic time warping, have also been suggested Capitani and Ciaccia, 2005. Real-time feature selection can be complemented by feature extraction. In our system, for instance, we incrementally reduce the original space of 432 explanatory streams to a handful of dimensions using an on-line version of SVD. Other dynamic dimensionality reduction models, such as incremental independent component analysis Basalyga and Rattray, 2004 or non-linear manifold learning Law, Zhang, and Jain, 2004, as well as on-line clustering methods, would offer potentially useful alternatives. Our simulation results have shown gross monetary results, and we have assumed that transaction costs are negligible. Better trading rules that explicitly model the mean-reverting behavior (or other patterns) of the spread data stream and account for transaction costs, as in Carcano, Falbo, and Stefani (2005), can be considered. The trading rule can also be modified so that trades are placed only when the spread is, in absolute value, greater than a certain threshold determined in order to maximize profits, as in Vidyamurthy (2004). In a realistic scenario, rather than trading one asset only, the investor would build a portfolio of models; the resulting system may be optimized using measures that capture both the forecasting and financial capabilities of the system, as in Towers and Burgess (2001). Finally, we point out that the FLS method can potentially be used in other settings and applications, such as predicting co-evolving data streams with missing or delayed observations, as in Yi et al. (2000), and for outlier and fraud detection, as in Adams, Hand, Montana, and Weston (2006).