دانلود مقاله ISI انگلیسی شماره 22251
ترجمه فارسی عنوان مقاله

داده کاوی برای استنتاج دستوری با معیارهای بیوانفورماتیک

عنوان انگلیسی
Data mining for grammatical inference with bioinformatics criteria
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
22251 2012 5 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Expert Systems with Applications, Volume 39, Issue 3, 15 February 2012, Pages 2330–2334

ترجمه کلمات کلیدی
استنتاج دستوری - بیوانفورماتیک - متن دستور زبان آزاد - الگوهای متوالی
کلمات کلیدی انگلیسی
Grammatical inference, Bioinformatic, Free Context Grammar, Sequential patterns
پیش نمایش مقاله
پیش نمایش مقاله  داده کاوی برای استنتاج دستوری با معیارهای بیوانفورماتیک

چکیده انگلیسی

In this work a novel data mining process is described that combines hybrid techniques of association analysis and classical sequentiation algorithms of genomics, to generate grammatical structures of a specific language. Subsequently, these structures are converted to Context-Free Grammars. Initially the method applies to context-free languages with the possibility of being applied to other languages: structured programming, the language of the book of life expressed in the genome and proteome and even the natural languages. We used an application of a compilers generator system that allows the development of a practical application within the area of grammarware, where the concepts of the language analysis are applied to other disciplines, like bioinformatic. The tool allows measuring the complexity of the obtained grammar automatically from textual data.

مقدمه انگلیسی

During recent years many approaches were introduced as data mining methods for pattern recognition in biological database. Bioinformatics employs computational and data processing technologies to develop methods, strategies and programs that permit to handle, order and study the immense quantity of biological data that have been generated and are currently generated. To this aim, the computational linguistics has received considerable attention in bioinformatics. The study in Searls et al. (1999) indicated that a relation exists between formal languages theory and DNA. Being the linguistic view of DNA sequences, a rich source of ideas to model strings with correlated symbols. Most of the work (Jiménez-Montaño, 2009; Jiménez-Montaño, Feistel, & Diez-Martínez, 2010) has involved examinations of the occurrences of “words” in DNA. Searls and Dong (1993) found that such a linguistic approach proves useful not only in theoretical characterization of certain structural phenomena in sequences, but also in generalized pattern recognition in this domain, via parsing. The information represented on sequences involves grammatical inference for pattern recognition. In this work a novel data mining process is described that combines hybrid techniques of association analysis and classical sequentiation algorithms of genomics to generate grammatical structures of a specific language. Subsequently, these structures are converted to Context-Free Grammars (CFG). Initially the method applies to context-free languages with the possibility of being applied to other languages: structured programming, the language of the book of life expressed in the genome and proteome and even the natural languages. We used an application of the compiler generator so named GAS 1.0 system (López, Sánchez, Alonso, & Moreno, 2009), that represents an Integrated Development Environment (IDE) which allow the development of a practical application within the area for the automatic generation of language-based tools, that starts from the traditional solutions and facilitates the use of formal language theory in other disciplines: Grammar-Based Systems (GBSs) (Mernik, Crepinsek, Kosar, Rebernak, & Zumer, 2004). The tool allows measuring the complexity of the obtained grammar automatically from textual data.

نتیجه گیری انگلیسی

In the experiments, a language LæLæ generated by predetermined CFG View the MathML sourceGæ′ is considered. But later none of the properties of that grammar were utilized to generate the set of production rules that then conformed the grammar View the MathML sourceGæ′. We have proposed a new method of automatic generation of syntactic categories on a codified language. The approach extends to processing of data that are believed to have a grammatical structure that could be automatically generated. We could imagine to find somewhat similar for the analysis of biosequences or for the natural languages, the doubt is served. The IDE attenuates the complexity of the design of the grammar specification, improves the quality of the obtained product and sensibly diminishes the development time and cost. We tried to reduce the learning time for not expert users in the area of compiler generation. The tool allows measuring the complexity of the obtained grammar automatically from textual data