پلی A-IEP : روش داده کاوی برای پیش بینی موثر از سایت های پلی آدنیلاسیون
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
22238 | 2011 | 11 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 38, Issue 10, 15 September 2011, Pages 12398–12408
چکیده انگلیسی
This paper presents a study on polyadenylation site prediction, which is a very important problem in bioinformatics and medicine, promising to give a lot of answers especially in cancer research. We describe a method, called PolyA-iEP, that we developed for predicting polyadenylation sites and we present a systematic study of the problem of recognizing mRNA 3′ ends which contain a polyadenylation site using the proposed method. PolyA-iEP is a modular system consisting of two main components that both contribute substantially to the descriptive and predictive potential of the system. In specific, PolyA-iEP exploits the advantages of emerging patterns, namely high understandability and discriminating power and the strength of a distance-based scoring method that we propose. The extracted emerging patterns may span across many elements around the polyadenylation site and can provide novel and interesting biological insights. The outputs of these two components are finally combined by a classifier in a highly effective framework, which in our setup reaches 93.7% of sensitivity and 88.2% of specificity. PolyA-iEP can be parameterized and used for both descriptive and predictive analysis. We have experimented with Arabidopsis thaliana sequences for evaluating our method and we have drawn important conclusions
مقدمه انگلیسی
During the last decades two main scientific areas, namely biology and computer science have been characterized by major advances that have attracted the interest of all humanity. The growth of World Wide Web and the completion of Human Genome Project are two representative examples that reflect the extent of the development of these two scientific areas. However, biology and computer science have not grown separately. The need of the collaboration between biologists and computer scientists has been grown year by year as the two areas have been progressing and new scientific questions have been arising. Bioinformatics is a novel research area that has emerged as a solution to the aforementioned need. It is a very promising field that aims to provide the means to analyze and explain the vast amounts of biological data, contributing thereby to the development of other related areas like medicine. Two relative subfields of computer science strongly related to artificial intelligence, namely data mining and machine learning, have provided biologists, as well as experts from other areas, a powerful set of tools to analyze new data types in order to extract various types of knowledge efficiently and effectively. These tools combine powerful techniques of artificial intelligence, statistics, mathematics, and database technology. This fusion of technologies aims to overcome the obstacles and constraints posed by the traditional statistical methods. A lot of interesting applications of artificial intelligence in bioinformatics is presented in Ezziane (2006). In this paper we deal with polyadenylation site (or poly(A) site) prediction. Poly(A) site prediction is a challenging problem and the last years has attracted the attention of the scientific community, because the successful cure of this problem promises to provide a lot of answers in various fields of medicine, like cancer research. In many organisms, such as in Arabidopsis thaliana, which is a plant model organism, there are not many highly conserved signals or patterns around the poly(A) site and consequently the recognition of the poly(A) site is not trivial. The discrimination of mRNA 3′ ends that contain a poly(A) site from intronic or 5′ UTR sequences without a poly(A) site seems to be very difficult (mainly with intronic sequences) and the performance of the up to now proposed approaches is moderate. On the other hand, mRNA 3′ ends can be easily discriminated from coding sequences. This variability in the difficulty of discrimination has motivated our work and guided us to an effort to study this problem and define an approach that can improve prediction accuracy. Nowadays, the research in this field is focused on discovering new patterns around poly(A) site and on predicting the poly(A) site accurately. The method we propose can be used for both, pattern discovery and accurate prediction. The prediction of poly(A) sites can be divided into two sub-problems. The first sub-problem deals with the discrimination of the sequences that contain a poly(A) site from the ones that do not and the second deals with the prediction of the position of a poly(A) site inside a sequence. The advantage of this approach is double. Firstly, a large number of irrelevant sequences are filtered out before searching for the position of a poly(A) site inside a sequence increasing notably the prediction accuracy. Secondly, a more specific method for predicting the position of a poly(A) site inside a sequence that focuses only in sequences that contain a poly(A) site leading in better models can be used. This approach can provide an increased performance against a more general method that deals concurrently with the discrimination of sequences and the prediction of poly(A) sites inside a sequence. The first sub-problem of the approach described above has not been studied yet. In this paper we focus on this sub-problem. Our contribution is an approach that combines the concept of emerging patterns (Dong & Li, 1999) and more specifically the interesting ones with a novel distance based scoring method. Our approach maintains the high interpretability of emerging patterns and offers a high prediction performance. The extracted emerging patterns may span across many elements around the polyadenylation site and can provide novel and interesting biological insights. Our method increases significantly the performance of poly(A) site prediction and reaches 93.7% of sensitivity and 88.2% of specificity. Moreover, The method we propose can be parameterized and re-trained in order to deal with poly(A) site prediction in any organism. Beyond the proposed method we draw important conclusions on the problem of discriminating mRNA 3′ ends with poly(A) sites from other sequences without a poly(A) site. This paper is organized as follows. Section 2 provides the necessary background knowledge. Section 3 presents a concise review of the research area that is related to the problem dealt in this study. Section 4 provides some preliminary technical terminology and Section 5 is dedicated to the detailed description of our approach. The results of the experiments that were conducted in order to evaluate our method are presented in Section 6 and finally, the paper is concluded in Section 7
نتیجه گیری انگلیسی
Polyadenylation site prediction is a challenging problem that attracts the interests of many researchers in the areas of medicine, biology, and bioinformatics. Nowadays, the research in this field is focused on discovering new patterns and on predicting the poly(A) site accurately. The approach we have proposed deals with these both dimensions of the problem. The difficulties on poly(A) site prediction are basically derived by the absence of highly conserved signals around the poly(A) site. In August 2009 Mayr and Bartel published their work on a study of normal and cancerous cells. Their results showed a strong correlation between 3′ UTR length and the expression of oncogenes. The important aspect is that the 3′ UTR length is determined by the position of the poly(A) site along the sequence. So it is obvious that polyadenylation is a key element in the understanding of biological processes and diseases like cancer and as a result it is going to be one of the most interesting topics in the field of bioinformatics. In this work we studied the problem of poly(A) site prediction and proposed a method (PolyA-iEP) that can be used for both descriptive and predictive analysis. PolyA-iEP exploits emerging patterns as well as a distance-based scoring method and eventually provides a significant increase in effectiveness, which in our setup reaches 93.7% of sensitivity and 88.2% of specificity. An important benefit of our approach is that it is general, thus can be re-trained and parameterized for use with other sequences possibly from different organisms. In the future we are considering studying the use of more sophisticated classification methods like classifier ensembles in order to increase even more the effectiveness of our approach. Also, our future plans include the experimentation with mRNA sequences of other organisms. The datasets we used and the tool we developed are available at http://mlkd.csd.auth.gr/PolyA/index.html.