دانلود مقاله ISI انگلیسی شماره 21394
ترجمه فارسی عنوان مقاله

یک روش داده کاوی بر اساس روش های یادگیری ماشین برای طبقه بندی سکانس بیولوژیکی

عنوان انگلیسی
A data mining approach based on machine learning techniques to classify biological sequences
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
21394 2002 7 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Knowledge-Based Systems, Volume 15, Issue 4, May 2002, Pages 217–223

ترجمه کلمات کلیدی
طبقه بندی - داده کاوی - یادگیری ماشین - قوانین تولید - سکانس زیستی - زیر رشته تشخیصی - ساختار اولیه -
کلمات کلیدی انگلیسی
Classification, Data mining, Machine learning, Production rules, Biological sequences, Discriminant substrings, Primary structures,
پیش نمایش مقاله
پیش نمایش مقاله  یک روش داده کاوی بر اساس روش های یادگیری ماشین برای طبقه بندی سکانس بیولوژیکی

چکیده انگلیسی

In molecular biology, biological macromolecules, like desoxyribonucleic acids (DNA) and proteins are coded by strings, called ‘primary structures’. For a long time, biologists gathered these primary structures in large databases. Now, they focus on analyzing these primary structures in order to extract useful knowledge. Data mining approaches can be helpful to reach this goal. In this paper, we present a data mining approach based on machine learning techniques to do classification of biological sequences. By using our approach, we use four steps as follows. (1) In the first step, we construct the set of the discriminant substrings, called discriminant descriptor (DD), associated with each family of primary structures. This construction is made thanks to an adaptation of the Karp, Miller and Rosenberg (KMR) algorithm. (2) In the second step, we use the DDs constructed during the first step to code the families of primary structures by a table of examples vs attributes, called ‘context’. (3) In the third step, we extract knowledge from the context constructed during the second step and represent it by production rules. This extraction is made by using an incremental production rules approach. (4) Finally, during the last step, we use the obtained production rules to do classification of primary structures.

مقدمه انگلیسی

Over the last decades, biologists have sequenced a large number of biological macromolecules [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11] and [12]. Mining biological data can help biologists to do classification in biological sequences. Indeed, 1. for proteins, mining data representing proteins can help biologists in the classification of a newly sequenced protein. The classification of a newly sequenced protein in a known family of proteins can be helpful in learning information, not only about the evolution of this protein, but also about its biological functions [13], [14], [15] and [16]. 2. for desoxyribonucleic acids (DNA), mining data representing functional sites and non-functional sites [17], or data representing protein coding regions and protein non-coding regions [18] and [19], of DNA macromolecules can help biologists in the classification of a candidate site in the set of functional sites or the one of non-functional sites, or the classification of a candidate region in the set of protein coding regions or the one of protein non-coding regions. And hence, can help them to make a step towards gene recognition in DNA macromolecules [20], [21], [22], [23] and [24]. Statistical approaches [20], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34] and [35] used to do classification of biological sequences do not give explanations concerning the classification done. Whereas, machine learning approaches [18], [36], [37], [38] and [39] give such explanations and, hence, can be helpful to improve biologists know-how and knowledge. In this paper, we present a data mining approach based on machine learning techniques to do classification of biological sequences. Our approach uses four steps. 1. In the first step, we construct the set of the discriminant substrings, called ‘discriminant descriptor’ (DD) [40] and [41], associated with each family of primary structures. This construction is made thanks to an adaptation [40] of the Karp, Miller and Rosenberg (KMR) algorithm [42]. 2. In the second step, we code the families of primary structures by a table of examples vs attributes. Indeed, in machine learning, data are coded by sets of attributes describing sets of examples, i.e. a table of examples vs attributes called ‘context’. On the other hand, in molecular biology, biological macromolecules are coded by strings. We build the context by representing each element of the DD by a binary attribute. 3. In the third step, we extract knowledge from the context and represent it by production rules, by using an incremental production rules approach [43] and [44]. 4. Finally, during the last step, we use the obtained production rules to do classification of primary structures. In Section 2 of this paper, we present some definitions and notations. In Section 3, we present our algorithm of construct DDs. In Section 4, we show how we build a context. In Section 5, we present our incremental production rules approach, to do classification of primary structures. In Section 6, we give an illustrative example. In Section 7, we present experimental results. Finally, in the last section, we present our conclusions and perspectives.

نتیجه گیری انگلیسی

In this paper, we have presented a data mining approach based on machine learning techniques to do classification of biological sequences. By using our approach, we use four steps as described above. One of the advantages of this work is that we obtained a knowledge base. The rules in this knowledge base can be used for classification purposes. They can also be used to improve the know-how of the experts, since the rules can contain newly discovered concepts. More importantly, the formalism of production rules allows the explanation of decisions. As an example, we can identify the subsequences responsible for some biological functions [13], [14], [15] and [16]. In an earlier work [41], we used DDs separately to do classification of biological sequences. With this approach, a production rule may use many DDs simultaneously. This simultaneity in the use of DDs makes the use of production rules more significant than the one of single DDs. After inducing production rules, we use a punishment/award algorithm to give weights to these rules. The algorithm applies the rules to a training set. The right rule is awarded by increasing its weight and the wrong one is punished by decreasing its weight. The final weights constitute an order relation between the rules. This relation allows us to choose the good rule among the conflictual ones. Applied on proteins primary structures, our data mining approach gave low classification error rates. Presently, we are bending on experimenting our data mining approach on genes primary structures.