داده کاوی برای کمک به سیاست گذاری در مدیریت آلودگی هوا
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
22052 | 2004 | 9 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 27, Issue 3, October 2004, Pages 331–340
چکیده انگلیسی
In the past two decades, the heavy environmental loading has led to the deterioration of air quality in Taiwan. The task of controlling and improving air quality has attracted a great deal of national attention. The Taiwanese government has since set up the National Air Quality Monitoring Network (TAQMN) to monitor nationwide air quality and adopted an array of measures to combat this problem. This study applies data mining to uncover the hidden knowledge of air pollution distribution in the voluminous data retrieved from monitoring stations in TAQMN. The mining process consists of data acquisition from Web sites of 71 data gathering stations nationwide, data pre-processing using multi-scale wavelet transforms, data pattern identification using cluster analysis, and final analysis in mapping the identified clusters to geographical locations. The application of multi-scale wavelet transforms contributes greatly in removing noises and identifying the trend of data. In addition, the proposed two-level self-organization map neural network demonstrates its ability in identifying clusters on the high-dimensional wavelet-transformed space. The identified distribution of suspended particulate PM10 represents a complete, national picture of the present air quality situation, which contrasts the present pollution districts, and could serve as an important reference for government agencies in evaluating present and devising future air pollution policies.
مقدمه انگلیسی
Data mining, also known as knowledge discovery in databases (KDD) (Fayyad, Piatetsky-Shapiro, & Smyth, 1996), is the process of discovering useful knowledge from large amount of data stored in databases, data warehouses, or other information repositories. It is a hybrid disciplinary (Zhou, 2003) that integrates technologies of databases, statistics, machine learning, signal processing, and high-performance computing. This rapidly emerging technology is motivated by the need for new techniques to help analyze, understand or even visualize the huge amounts of stored data gathered from business and scientific applications. The major data mining functions that are developed in commercial and research communities include summarization, association, classification, prediction and clustering (Zhou, 2003). Data mining has been shown capable of providing a significant competitive advantage to an organization by exploiting the potential knowledge of large databases (Bose & Mahapatra, 2001). Recently, a number of data mining applications and prototypes have been developed for a variety of domains (Liao, 2003 and Mitra et al., 2002), including marketing, banking, finance, manufacturing, and health care. In addition, data mining has also been applied to other types of scientific data (Abidi, 2001 and Read, 2000) such as bioinformatical, astronomical, and medical data. In general, techniques and functions that are to be applied in a data mining process depend very much on the application domain and the nature of the data available. This creative process generally involves phases of data understanding, data preparation, modeling, and evaluation (Fayyad et al., 1996). Data understanding starts with an initial data collection and proceeds with activities to get familiar with the data, to identify data quality problems, and to discover first insights into the data. Data preparation covers all activities that construct the final dataset to be modeled from the initial raw data. The tasks of this phase may include data cleaning for removing noise and inconsistent data, and data transformation for extracting the embedded features. The modeling phase applies various modeling techniques, determines the optimal values for parameters in models, and finds the one most suitable to meet the objectives. The evaluation phase evaluates the model found in the last stage to confirm its validity to fit the problem requirements. No matter which areas data mining is applied to, most of the efforts are directed toward the data preparation phase (Pyle, 1999). In this study of mining air pollution data, our data preparation phase particularly emphasizes the data scale issue. The purpose of this study is to apply data mining technology to identify the national air quality distribution of Taiwan, whose hourly air quality data are continuously collected and archived through a network of 71 EPA stations. In dealing with voluminous data, we combine both wavelet transform (WT) and self-organization map (SOM) neural networks as our data mining technology. The former is accredited with capability of investigating temporal variation with different scales, and the latter is known to be effective in isolating clusters in high-dimensional space. With both technologies, one can benefit from better understanding and interpretation of the pollution data. The rest of this paper is organized as follows. Section 2 provides a brief review of air pollution management in Taiwan. Section 3 presents the issues of mining air quality data from the EPA Web site and the underlying technologies for dealing with the issues. Section 4 elaborates on the mining procedure that consists of data acquisition, missing-value handling, data transform, modeling, and performance evaluation. Section 5 discusses the mining results and its comparisons with official distribution districts. Section 6 concludes this paper.
نتیجه گیری انگلیسی
The heavy environmental loading has led to the deterioration of air quality in Taiwan in the past two decades. The task of controlling and improving air quality has attracted a great deal of national attention. The government has since adopted an array of measures to combat this problem. This study applies data mining to identify the national PM10 pollutant distribution, with data retrieved from 71 monitoring stations of the nation. The mining results are presented to contrast the present pollution districts, which could serve as an important reference for the policy maker in formulating future policies. In carrying out this study, we first retrieved relevant PM10 data of 1 year from archived information of 71 stations, and then filled in all missing data. The data is of spatio-temporal nature, hence, we paid particular attention in investigating the scale issue of time series data, and decided to apply continuous wavelet transform, so that mining results may be applicable for either short or long term reference purposes. SOM neural network was applied to identify clusters in such a high-dimensional space. The results confirm that regions determined from the wavelet transform approach can reduce the local small regions using the small scale input data and improve the over-smoothed regions using one large scale input data. Most important of all, the results clearly indicate the distribution of national PM10 pollutant through 7 clusters and their individual severity. Mapping the findings onto the present air quality districts, one is shocked to learn that there are from 2 to 4 clusters in a district, and a district could span from 2 to 7 cluster levels. Based on these findings, we feel strongly that the effectiveness of present pollution control policy, which is entirely based on convenience of administration, may be further improved by taking into consideration the grouping of pollutants, which may be best described in zones. One limitation of the current study is the fact that the SOM network can only provide the capability for hard clustering, meaning that data can only be assigned to one and only one cluster. However, most of the environment science data have spatial transition characteristics in a time period, which means there could be a transition zone between adjacent regions. This suggests a potential need of clustering models that could provide a data item ‘membership degree’ of a cluster. The fuzzy logic-based approaches such as fuzzy SOM, fuzzy c-Means, and rotated principal component analysis (RPCA) approach can be potential candidates for future works on this aspect.