دانلود مقاله ISI انگلیسی شماره 22309
ترجمه فارسی عنوان مقاله

تجزیه و تحلیل مولفه های اصلی بر اساس هموقتی برای سری زمانی داده کاوی

عنوان انگلیسی
Asynchronism-based principal component analysis for time series data mining
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
22309 2014 9 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Expert Systems with Applications, Volume 41, Issue 6, May 2014, Pages 2842–2850

ترجمه کلمات کلیدی
ارتباط آسنکرون - ماتریس کواریانس - تجزیه و تحلیل مولفه های اصلی - سری زمانی داده کاوی - زمان پویا تاب
کلمات کلیدی انگلیسی
Asynchronous correlation, Covariance matrix, Principal component analysis, Time series data mining, Dynamic time warping
پیش نمایش مقاله
پیش نمایش مقاله  تجزیه و تحلیل مولفه های اصلی بر اساس هموقتی برای سری زمانی داده کاوی

چکیده انگلیسی

Principal component analysis (PCA) is often applied to dimensionality reduction for time series data mining. However, the principle of PCA is based on the synchronous covariance, which is not very effective in some cases. In this paper, an asynchronism-based principal component analysis (APCA) is proposed to reduce the dimensionality of univariate time series. In the process of APCA, an asynchronous method based on dynamic time warping (DTW) is developed to obtain the interpolated time series which derive from the original ones. The correlation coefficient or covariance between the interpolated time series represents the correlation between the original ones. In this way, a novel and valid principal component analysis based on the asynchronous covariance is achieved to reduce the dimensionality. The results of several experiments demonstrate that the proposed approach APCA outperforms PCA for dimensionality reduction in the field of time series data mining.

مقدمه انگلیسی

Time series is one kind of the most important research objects in the field of data mining. The techniques used in this data is called time series data mining (TSDM) (Esling & Agon, 2012). However, since its high dimensionality often renders standard data mining techniques inefficient, the methods used to reduce the dimensionality are devised. So far, there exists many methods to resolve this problem, which are considered as two kinds of dimensionality reduction. One is based on univariate time series, such as discrete Fourier transformation (DFT) (Agrawal, Faloutsos, & Swami, 1993), discrete wavelet transformation (DWT) (Maharaj and Urso, 2011, Struzik and Siebes, 1998 and Struzik and Siebes, 1999), polynomial representation (PR) (Fuchs, Gruber, Pree, & Sick, 2010), piecewise linear approximation (PLA) (Keogh et al., 2001, Papadakis and Kaburlasos, 2010 and Shatkay and Zdonik, 1996), piecewise aggregate approximation (PAA) (Keogh et al., 2000 and Li and Guo, 2011), symbolic aggregate approximation (SAX) (Lee et al., 2009 and Lin et al., 2003). These methods are mainly proposed to reduced dimensionality from the points of univariate time series. In other word, they mainly concentrate on the transformation of a single time series so that the dimension of the reduced representations is lower than that of the original one. The other is based on the time series dataset, such as singular value decomposition (SVD) (Spiegel, Gaebler, & Lommatzsch, 2011), principal component analysis (PCA) (Singhal & Seborg, 2002) and independent component analysis (ICA) (Cichocki & Amari, 2002). SVD and PCA are often seen as the same method to retain the first several principal components and to represent the whole dataset. However, ICA is the development of principal component analysis and factor analysis. In the field of time series data mining, the methods are often used and combined with the corresponding measurements to discover the information and knowledge from time series dataset. Krzanowski (1979) used PCA to construct the principal components and chosen the first k principal components to represent the multivariate time series. At the same time, the similarity between two time series are calculated by using the cosine value of the angle between the corresponding principal components. Singhal and Seborg (2005) proposed a new approach Sdist to compute the similarity based on PCA, which is better than the earlier methods. Karamitopoulos and Evangelidis (2010) used PCA to construct the feature space of the queried time series and projected every time series to the space. They computed the error between two reconstructed time series as the distance between the query time series and the queried one. SVD is often based on PCA, which uses KL decomposition method to reduce the dimensionality of time series. Li, Khan, and Prabhakaran (2006) proposed two methods to choose the feature vectors and used them to classify time series. Weng and Shen (2008) extended the traditional SVD to an two-dimensional SVD (2dSVD) that extracts the principal components from the column–column and row–row directions to compute the covariance matrix. Since feature extraction is one of the most importance tasks for ICA, it was applied to the analysis of time series. Wu and Yu (2005) used FastICA ( Hyvärinen, 1999) to obtain independent principal components for multivariate time series and cluster them by combining with the correspond distance measurement. Baragona and Battaglia (2007) used ICA to detect the anomalies by extracting the unusual components. PCA is the basic theory and widely used to reduce the dimensionality of time series (Karamitopoulos and Evangelidis, 2010 and Bankó and Abonyi, 2012). It uses the variance to measure how much the information is retained. Moreover, the covariance is computed to measure the correlation between two different time series in PCA. However, the traditional PCA uses the linear and synchronous method to compute the covariance between two time series, which is not effective when the two series are similar or correlative at different points in time. In other words, the same shape trends appearing on two time series at different points in time will be regarded as uncorrelated or negative correlated. It means that in some cases PCA works ineffectively. Moreover, the length of time series must be equal when they are research by PCA. Meanwhile, PCA is often used to mine the knowledge from multivariate time series dataset instead of univariate time series dataset. The research motivations of this work are to overcome the above mentioned problems. Firstly, the dimensionality of time series with different lengths can be reduced by the principle of principal component. It means that the proposed method can process the time series with different lengths. However, the existing work including SVD, PCA, and ICA only process the ones with equal length. Secondly, the existing methods only consider the synchronous relationship between two variables or two time series, they neglect the asynchronous relationships. Therefore, the proposed methods must be improved to consider the asynchronous relationships. Thirdly, the important information about time series should be concerned by the proposed method rather than the existing work. The reason is that some points of time series reflect the key shape trends and can provide much more important information than the others. For the above mentioned motivations, the work will include the measurement of asynchronous correlation coefficient, the design of asynchronism-based PCA and the representations of univariate time series for dimensionality reduction. The asynchronous correlation derives from correlation coefficient between a pair of two interpolated time series that are formed by the elements of the best warping path. Moreover, the best warping path can be found by dynamic time warping (DTW) (Yu, Yu, & Hu, 2011). The interpolated time series can be used to improve the effectiveness of correlation coefficient (Pearsons product moment correlation coefficient) (Rodgers & Nicewander, 1988), which measures the similarity (or correlation) between time series with the same shape trends appearing on the different points in time. The asynchronism-based PCA considers the asynchronous correlation to measure the whole time series dataset and obtains the first several principal components that retain the important information about the time series dataset as much as possible. In particular, the tuple of the first several principal components is regarded as the corresponding representations so that every time series can be represented by a short tuple for dimensionality reduction. In comparison to the traditional PCA, the proposed method (Asynchronism-based PCA, APCA) not only can measure the synchronous correlation as PCA does, but also can obtain the asynchronous correlation. It is a good approach to measure the similarity between two time series whose similar shape trends appear on the different points in time. The remainder of the paper is organized as follows. In Section 2, we provide some necessary background material and discuss related work. In Section 3, we present the proposed method. The experimental evaluation of the new method is described in Section 4. Finally, we discuss our results further and conclude in Section 5.

نتیجه گیری انگلیسی

An asynchronism-based principal component analysis (APCA) is proposed to reduce the dimensionality in light of asynchronous correlation between time series. According to the existing methods such as SVD, PCA and ICA, they are often used to reduce the dimensionality of time series. In particular, SVD and PCA have the similar principle and are based on principal components. ICA is the development of principal component analysis and factor analysis. They are widely applied in different fields including time series data mining. Unfortunately, they only considers the synchronous relationship between two time series (or two variables). They neglect the asynchronous relationship. Instead, The proposed method APCA considers the asynchronous covariance in the process of computation and is an improve version of PCA. Since the covariance is often used to measure the correlation between two equal-length variables, it is not work for unequal-length time series. To better measure the covariance between the unequal-length time series with similar shape trends, we use DTW to find the best warping path and obtain the interpolated time series that reflect the similar shape trends of time series at different points in time. Moreover, the length of the interpolated time series can be stretched to be equal. In this way, PCA can be used to measure the interpolated ones with equal length instead of the original time series with unequal length. In addition, the experimental results indicate that APCA used in the field of time series data mining is more powerful than PCA. The advancement of APCA over PCA can be summarized as follows: (1) APCA can reduce the dimensionality of time series with different lengths in the dataset. However, PCA only process the one with equal length. Moreover, the idea of PCA is applied in the procedure of APCA, which means that APCA is the extended version of PCA. (2) Time series with the similar shape trends between different points in time can be mapped to each other by APCA. Moreover, a correlation including synchronism and asynchronism between two time series with different lengths can be reflected. PCA only reflects the synchronous one. (3) Since the repeated value are interpolated to form the interpolated time series, more important information are considered by APCA than by PCA. It means that APCA can obtain more cumulative energy than PCA for the same reduced dimensions (the same number of retained principal components). So, extending the traditional PCA to process time series with different lengths, making PCA consider asynchronous relationships and reflecting the important and repeated information on time series are the unique research contributions of this paper. All those contributions overcome the shortage of PCA and widen the application of the principle of principal components. Although APCA is effective for time series data mining, extra time is cost by DTW in the process of APCA. Presently, much work (Lemire, 2009 and Salvador and Chan, 2007) is concentrated on decreasing time cost of DTW, which attempt to speed up the computation of APCA. However, these methods depend on some factors that are difficult to be set. Therefore, one of the future works is to propose a suitable method without any factors to find the best warping path in APCA so that the execution of the algorithm can be faster. At the same time, since APCA considers the asynchronous relationships and is able to process time series with different lengths, it, as well as SVD, PCA and ICA, can be used to solve the problems about image recognition, speech recognition, financial analysis, and so on. In particular, in the field of multivariate stock data analysis, the asynchronous relationships between different variables are very important for volatility analysis. Therefore, the application of volatility analysis and co-movement analysis of stock market is another research direction.