پیش بینی غلظت PM2.5 با استفاده از مدل نیمه پنهان مارکوف بر اساس سری داده کاوی زمان
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
22161 | 2009 | 10 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 36, Issue 5, July 2009, Pages 9046–9055
چکیده انگلیسی
In this paper, a novel framework and methodology based on hidden semi-Markov models (HSMMs) for high PM2.5 concentration value prediction is presented. Due to lack of explicit time structure and its short-term memory of past history, a standard hidden Markov model (HMM) has limited power in modeling the temporal structures of the prediction problems. To overcome the limitations of HMMs in prediction, we develop the HSMMs by adding the temporal structures into the HMMs and use them to predict the concentration levels of PM2.5. As a model-driven statistical learning method, HSMM assumes that both data and a mathematical model are available. In contrast to other data-driven statistical prediction models such as neural networks, a mathematical functional mapping between the parameters and the selected input variables can be established in HSMMs. In the proposed framework, states of HSMMs are used to represent the PM2.5 concentration levels. The model parameters are estimated through modified forward–backward training algorithm. The re-estimation formulae for model parameters are derived. The trained HSMMs can be used to predict high PM2.5 concentration levels. The validation of the proposed framework and methodology is carried out in real world applications: prediction of high PM2.5 concentrations at O’Hare airport in Chicago. The results show that the HSMMs provide accurate predictions of high PM2.5 concentration levels for the next 24 h.
مقدمه انگلیسی
Prediction of particulate matter (PM) in the air is an important issue in control and reduction of pollutants in the air. Particulate matter is the term used for a mixture of solid particles and liquid droplets found in the air. In particular, fine particles that are smaller than 2.5 or 10 μm (millionths of a meter) in diameter are defined as PM2.5 or PM10. Fine particles (especially, PM2.5) harm human health. The US Environmental Protection Agency (EPA) recently promulgated revised standards for PM and established new annual and 24-h fine particulate standards with PM2.5 mass as the indicator due to scientific data associating fine particle pollution with significant increases in the risk of death from lung cancer, pulmonary illness (e.g., asthma), and cardiovascular disease (Dockery and Pope, 1994, EPA, 2002, Katsouyanni, 1997, Levy, 2000 and Pope et al., 2002). These fine particles are generally emitted from activities such as industrial and residential combustion and from vehicle exhaust. The health effects of exposure to fine particles include: (1) increased premature deaths, primarily in the elderly and those with heart or lung disease, (2) aggravation of respiratory and cardiovascular illness, leading to hospitalizations and emergency room visits, particularly in children, the elderly, and individuals with heart or lung conditions, (3) decreased lung function and symptomatic effects such as those associated with acute bronchitis, particularly in children and asthmatics, (4) new cases of chronic bronchitis and new heart attacks, (5) changes to lung structure and natural defense mechanisms. Fine particles in the air also decrease visibility. The benefits to human health and the environment by reducing fine particles and ozone can be significant. By 2020, the benefits of reductions in fine particles and ozone are estimated to be $113 billion annually (The Clear Skies Act, 2003). By 2010, reductions in fine particles and ozone are estimated to result in substantial early benefits of $54 billion, including 7900 fewer premature deaths, annually (The Clear Skies Act, 2003). Other significant health and environmental benefits include reduced human exposure to mercury, fewer acidified lakes, and reduced nitrogen loads to sensitive ecosystems that cannot currently be quantified and/or monitored but are nevertheless expected to be significant. Predictive models for PM2.5 vary from the extremely simple to extremely complex, yet the ability to accurately forecast air quality remains elusive. Much of the variability in PM concentrations is driven by meteorological conditions, which fluctuate on multiple time and spatial scales. Another significant source of variability is changes in the temporal and spatial patterns of emissions activity. Qualitative and quantitative models to forecast PM and ozone were described in a recent EPA document (EPA, 2003). As documented also by Schlink, Pelikan, and Dorling (2003), a particular technique often has good performance in one respect and poor performance in others. The quantitative models are briefly summarized here. Note that, because PM2.5 has been regulated only since 1997, and a national measurement program implemented only since 1999, fewer forecasting applications have been developed to date for PM2.5 than for ozone. The need for accurate forecasts of PM2.5 continues to grow as epidemiological evidence of PM2.5’s acute health impacts mounts. In the past, a number of techniques have been developed for the prediction of PM concentrations. Essentially, approaches for PM prediction can be classified into five categories: (1) empirical models, (2) fuzzy logic-based systems, (3) simulation models, (4) data-driven statistical models, and (5) model-driven statistical learning methods. Empirical models are developed by field experts and validated by data sets of the studied area. Generally, method performance depends on the variable under study, the geographic location, and the underlying assumptions of the methods. Therefore an empirical method is “best” only for specific situations. Fuller, Carslaw, and Lodge (2002) devised an empirical model to predict concentrations of PM10 at background and roadside locations in London. The model accurately predicts daily mean PM10 across a range of sites from curbside to rural. Predictions of future PM10 can be made using the expected reductions in non-primary PM10 and site specific annual mean NOX predicted from emission inventories and dispersion modeling. However, the model has a limited geographical extent covering London and its immediate surrounding area. The model performance depends on a consistent relationship between PM10 and NOX emissions. The fuzzy logic approach makes it possible to deal with problems affected by uncertainty and to obtain reliable models for non-linear phenomena whose characterization is based on rough and poor data. However, like rule-based systems, the determination of a fuzzy model knowledge base is obtained by the contribution of experts of the field. Raimondi et al., 1997a and Raimondi et al., 1997b proposed a fuzzy logic-based model for predicting ground level pollution. The procedure consists of two different phases. The first phase concerns prediction of meteorological and emission variables (model input) and is implemented through fuzzy prediction of time series. The second phase of modeling concerns the determination (using fuzzy inference methods) of the predicted meteorological classes, each of which contributes in determining model output (i.e. prediction of air pollutant concentration). In recent years, the use of three-dimensional high frequency mesoscale data sets derived from dynamical models to drive air quality simulation models has been growing. Three-dimensional air quality models have been employed to forecast pollutant concentrations. These models use meteorological model output such as the Penn State Mesoscale Model (MM5) and emissions model output for the forecasting period, then apply a mathematical model to simulate transport, diffusion, reactions, and deposition of air pollutants over the geographical area of interest, from urban scale to national scale. These models are extremely complex to set up and require enormous computing resources. They are capable of predicting air quality in areas where no monitoring data exist, but accuracy is limited by the scale at which they are applied – small scale meteorological and emissions variability may not be represented in the models. Emissions data are notoriously uncertain. Performance of these models for ozone has been reasonably good, but to date their ability to model PM2.5 has been poor, due in part to the reasons above but also to the complexities of PM2.5 atmospheric chemistry (Baker, 2004). Data-driven statistical models are developed from collected input/output data. Data-driven statistical models can process a wide variety of data types and exploit the nuances in the data that cannot be discovered by rule-based or fuzzy logic-based systems. Therefore, they are potentially superior to the rule-based systems. Data-driven statistical models include Classification and Regression Tree analysis (CART), regression models, clustering techniques, and neural networks. CART is based on binary recursive partitioning. Each predictor variable is examined (whether it is a continuous or discrete variable) and the data set is split into two groups based on the value of that predictor that maximizes the dissimilarity between groups. The tree is ‘grown’ by exhaustively searching the predictor variables at each branch for the best split. Typical predictors include meteorological conditions (especially temperature, wind speed, wind direction) and also air quality conditions. Seasonal or activity data can be incorporated as well. For PM, these models generally account for about 60% of the variability in the data and for ozone, about 80%. Regression equations have a long history of use as forecasting tools in multiple disciplines. Like CART, multiple predictors are typically incorporated into a regression model that seeks to predict pollutant concentrations. Regression models are most useful and accurate for predicting mean concentrations and less dependable for the extreme values that are generally of most interest when forecasting concentrations for the purpose of warning the public about health risks. Regression models have the advantage of simple computation and easy implementation. However, regression models are based on the assumption of normally distributed data; air quality and meteorological data are generally log-normally distributed. Transformations of the data can improve model performance. Many of the relationships between PM and meteorological variables are curvilinear, which requires additional transformations of the predictor variables. Due to the nature of linear relationship, regression models may not provide accurate predictions in some complex situations. Researchers have applied regression models into different areas such as: downtown area of Santiago, Chile; Ontario, Canada; Taiwan, China; Delhi, India; Maryland, USA and about 100 Canadian sites (Burrows et al., 1997, Chaloulakou et al., 2003, Chelani et al., 2001, Fraser and Yap, 1997, Lu and Fang, 2003, Ojha et al., 2002, Rizzo et al., 2002 and Walsh and Sherwell, 2002). The main purpose of clustering technique is to identify distinct classes among the data. It can be used for spatial classification of ambient air quality data, in the absence of the huge data sets needed for more sophisticated space–time modeling (Surneet, Veena, & Patil, 2002). However, the analysis is based on grossly-average-level data, not intensive daily data. The clustering algorithm developed by Sanchez, Pascual, Ramos, and Perez (1990) has been applied to PM concentrations recorded at each sampler point, and different pollution levels have been obtained in each of them. This algorithm has revealed a satisfactory relationship between PM concentrations and the identified meteorological types. However, the clustering technique such as k-MEANS is very sensitive to the presence of noise and cannot classify outliers. Good quality clustering algorithms are usually expensive. For example, the exact solution of k-MEDOIDS (p-median) clustering algorithm is NP-hard ( Estivill-Castro & Houle, 2001). Artificial neural networks (ANN) are computer programs that attempt to simulate human learning and pattern recognition, and should be well suited to extracting information from imprecise and non-linear data such as air quality and meteorology. They are useful tools for prediction, function approximation and classification. Despite their theoretical superiority, they have produced only a slight improvement over the linear statistical model in forecast accuracy. Extreme values are represented well if they are present in the data set that the network was trained on, but the network cannot accurately extrapolate values outside the training set. For example, Chaloulakou et al. (2003) examine the possibility of using neural network methods as tools for daily average PM10 concentration forecasting. Their results show that, compared with linear regression, root mean square error values are reduced by 8.2–9.4% and false alarm rate values by 7–13%. Other advantages of neural networks include superior learning, noise suppression, non-linear function and parallel computation abilities. One of the major problems with neural networks is that they are not designed with an explanatory capability, the so-called black box approach. In addition, successful implementation of a neural network-based system strongly depends on proper selection of the type of network structure and amount of training data, which are not always available. Recently, the applications of neural networks have become more popular (Chaloulakou et al., 2003, Chelani et al., 2002, Gardner and Dorling, 1998, Kukkonen et al., 2003, McKendry, 2002 and Perez et al., 2000). The objective of Thomas and Jacko’s work is to develop a reliable model for forecasting hourly PM2.5 and CO concentrations at a microscale (adjacent to the expressway). A linear regression model and an neural network model are developed to forecast hourly PM2.5 and CO concentrations using the year 2002 traffic, pollutant, and meteorological data. Both models had reasonable accuracy in predicting hourly PM2.5 concentration. A major problem for these models is that they are developed specifically for the Borman Expressway, some modifications should be made in order for these models to be used for other expressways and roadways (Thomas & Jacko, 2007). The model-driven statistical learning methods assume that both operational data and a mathematical model are available. State space model and Bayesian networks belong to this category. In contrast to the black box approaches such as neural networks, a mathematical functional mapping between the drifting parameters and the selected input variables can be established. Moreover, the model-driven statistical learning methods can be adapted to increase accuracy and to address subtle performance problems. Consequently, model-driven methods can significantly outperform data-driven approaches. Cossentino, Raimondi, and Vitale (2001) employed Bayesian networks to model the temporal series of the particulate matter during the day and the influence that meteorological parameters have upon them. Typical inputs of the networks have been the pollutant concentration at a certain hour and the meteorological parameters at the further hours of the day. The output provided by the networks is the estimate of the probability of reaching a certain pollutant level in the various hours of the day. They concluded that the results are satisfactory and this approach can be profitably used to foresee critical episodes. As indicated by authors, the quality of the results depends on the number of the evidences that are supplied to the network. Chelani et al. (2001) presented a state space model coupled with Kalman filter to forecast metal concentrations observed at Delhi, India. Wind speed is used as an external input. Compared to an autoregressive model, the state space model gives better predictions. The state space model also provides a way of incorporating model and measurement uncertainty into the forecasts (Harnandez, Martin, & Valero, 1992). However, the prediction may not be accurate for peak forecasting. The HMM (hidden Markov-model) approach has become increasingly popular and quite effective in some applications such as speech processing and handwritten word recognition. There are two major reasons for this. First, the models have a rich mathematical structure and can form the solid theoretical foundation for a wide variety of applications. Second, the models have many successful applications in practice (Rabiner, 1989). An added benefit of employing HMMs is the ease of model interpretation in comparison with pure “black box” modeling methods such as artificial neural networks (Baruah & Chinnam, 2003). Through the detection of the adulterated words from a blacklist of words frequently used by spammers, Gordillo and Conde (2007) applied HMMs to classify spam mails. In order to forecast financial market behaviour, a fusion model by combining the HMM, ANN and Genetic Algorithms (GA) was proposed. Using ANN, the daily stock prices are transformed to independent sets of values that become input to HMM. The initial parameters of HMM are optimized by GA (Rafiul Hassan, Nath, & Kirley, 2007). However, there is an inherent limitation associated with the HMMs. This limitation is that the state duration of HMM follows an exponential distribution. In other words, HMM does not provide adequate representation of temporal structure for prediction problems. For example, a HMM does not provide adequate representation of the temporal structure of speech and segmental structure of the handwritten word. To overcome the limitations of HMMs in prediction, a novel prediction methodology is developed using a model-driven statistical learning method, called the hidden semi-Markov model (HSMM). A HSMM is constructed by adding a temporal component into the well-defined HMM structures (Guédon, 1999, Guédon, 2003, Guédon, 2005, Rabiner, 1989, Schmidler, 2000, Yu and Kobayashi, 2003a, Yu and Kobayashi, 2003b, Yu and Kobayashi, 2006, Aydin et al., 2006, Dong and He, 2007a and Dong and He, 2007b). Instead of holding time distributions attached to states, Guédon (1999) proposed a hidden semi-Markov chain in which the time distributions are attached to transitions. Then, Guédon (2003) further extended previously proposed HSMMs in which the end of a sequence systematically coincides with the exit from a state, that is, the sequence length is not independent of the process. This article defines hidden semi-Markov chains with absorbing states and thus defines the likelihood of a state sequence generated by an underlying semi-Markov chain with a right censoring of the time spent in the last visited state. A new forward–backward algorithm is proposed with complexities that are quadratic in the worst case in time and linear in space, in terms of sequence length. This opens the way to the application of the full machinery of hidden semi-Markov chains to long sequences such as DNA sequences. In order to retain the flexibility of hidden semi-Markov chains for the modeling of short or medium size homogeneous zones along sequences and enable the modeling of long zones with Markovian states at the same time, Guédon (2005) investigated hybrid models that combine Markovian states with implicit geometric state occupancy distributions and semi-Markovian states with explicit state occupancy distributions. The Markovian nature of states is no longer restricted to absorbing states since non-absorbing Markovian states can now be defined. In the context of the application to gene finding, the incorporation of non-absorbing Markovian states is critical since the distributions of the lengths of the longest homogeneous zones are approximately geometric. The underlying assumption in the existing HMM and HSMM models is that there is at least one observation produced per state visit and that observations are exactly the outputs (or “emissions”) of states. In some applications, these assumptions are too restrictive. Yu and Kobayashi (2003a) extended the ordinary HMM and HSMM to the model with missing data and multiple observation sequences. They proposed a new and computationally efficient forward–backward algorithm for HSMM with missing observations and multiple observation sequences. The required computational amount for the forward and backward variables is reduced to O(D), where D is the maximum allowed duration in a state. Existing algorithms for estimating the model parameters of a HSMM usually require computations as large as O((MD2 + M2)T), where M is the number of states, T is the period of observations used to estimate the model parameters. Because of such computational requirements, these algorithms are not practical to construct a HSMM model with large state space, large explicit state duration and a large amount of measurement data. Yu and Kobayashi (2003b) proposed a new forward–backward algorithm whose computational complexity is only O((MD + M2)T), a reduction by almost a factor of D when D > M. Since the joint probabilities associated with observation sequence often decay exponentially as the sequence length increases, the implementation of the forward–backward algorithms by programming in a real computer would suffer a severe underflow problem. Yu and Kobayashi (2006) redefined the forward–backward variables in terms of posterior probabilities to avoid possible underflows. The existing algorithms for HSMM are not practical for implementation in hardware because of the computational or logic complexity, thus, a forward recursion is used that is symmetric to the backward one and can reduce the number of logic gates required to implement on a field-programmable gate-array (FPGA) chip. HSMM has been widely used in protein secondary structure prediction. Schmidler (2000) believed the intra-segment residue independence and geometric length distributions implied by HMMs to be inappropriate for modeling protein secondary structure, they presented a Bayesian inference-based method for predicting the secondary structure of a protein from its amino acid sequence. Their model is structurally similar to the class of semi-Markov source models described in Rabiner (1989). Aydin et al. (2006) introduced an improved residue dependency model by considering the patterns of statistically significant amino acid correlation at structural segment borders. The results show that new dependency models and training methods bring further improvements to single-sequence protein secondary structure prediction. In this paper, we developed the hidden semi-Markov models by adding the temporal structures into the HMMs and used them for the prediction of high PM2.5 concentration values. The term ‘prediction’ as used in this paper means establishing the relationship between observed independent variables (predictors, such as meteorological variables) and an observed dependent variable (in this case concentration). When the predictors are forecast by some method, we can “forecast” or “predict” the concentrations.
نتیجه گیری انگلیسی
This investigation presents a HSMM-based framework and methodology for the prediction of PM2.5 concentration levels. A HMM is a probabilistic function of a Markov chain and strictly controlled by the property of the Markov chain. This property says that the current state of the system depends only on the previous one. Therefore, a HMM normally has a short-term memory of the past history and this short-term memory of a HMM limits its power in prediction of future events. The proposed HSMM-based framework and methodology overcomes this problem by adding a temporal component to the HMM structures. In the HSMM case, the conditional independence between the past and the future is only ensured when the process moves from one state to another distinct state (this property holds at each time step in the classical Markovian case). Since a HSMM is equipped with a temporal structure, it can be used to predict pollutant concentrations in air monitoring applications. The evaluation of our proposed approach is carried out in a real world application: the prediction of PM2.5 concentrations in Chicago. The results show that the classification accuracy of PM2.5 concentrations is indeed very promising and HSMMs are able to provide accurate predictions of extreme concentration levels 24 h in advance