In molecular biology, biological macromolecules, like desoxyribonucleic acids (DNA) and proteins are coded by strings, called ‘primary structures’. For a long time, biologists gathered these primary structures in large databases. Now, they focus on analyzing these primary structures in order to extract useful knowledge. Data mining approaches can be helpful to reach this goal. In this paper, we present a data mining approach based on machine learning techniques to do classification of biological sequences. By using our approach, we use four steps as follows. (1) In the first step, we construct the set of the discriminant substrings, called discriminant descriptor (DD), associated with each family of primary structures. This construction is made thanks to an adaptation of the Karp, Miller and Rosenberg (KMR) algorithm. (2) In the second step, we use the DDs constructed during the first step to code the families of primary structures by a table of examples vs attributes, called ‘context’. (3) In the third step, we extract knowledge from the context constructed during the second step and represent it by production rules. This extraction is made by using an incremental production rules approach. (4) Finally, during the last step, we use the obtained production rules to do classification of primary structures.
Over the last decades, biologists have sequenced a large number of biological macromolecules [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11] and [12]. Mining biological data can help biologists to do classification in biological sequences. Indeed,
1.
for proteins, mining data representing proteins can help biologists in the classification of a newly sequenced protein. The classification of a newly sequenced protein in a known family of proteins can be helpful in learning information, not only about the evolution of this protein, but also about its biological functions [13], [14], [15] and [16].
2.
for desoxyribonucleic acids (DNA), mining data representing functional sites and non-functional sites [17], or data representing protein coding regions and protein non-coding regions [18] and [19], of DNA macromolecules can help biologists in the classification of a candidate site in the set of functional sites or the one of non-functional sites, or the classification of a candidate region in the set of protein coding regions or the one of protein non-coding regions. And hence, can help them to make a step towards gene recognition in DNA macromolecules [20], [21], [22], [23] and [24].
Statistical approaches [20], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34] and [35] used to do classification of biological sequences do not give explanations concerning the classification done. Whereas, machine learning approaches [18], [36], [37], [38] and [39] give such explanations and, hence, can be helpful to improve biologists know-how and knowledge.
In this paper, we present a data mining approach based on machine learning techniques to do classification of biological sequences. Our approach uses four steps.
1.
In the first step, we construct the set of the discriminant substrings, called ‘discriminant descriptor’ (DD) [40] and [41], associated with each family of primary structures. This construction is made thanks to an adaptation [40] of the Karp, Miller and Rosenberg (KMR) algorithm [42].
2.
In the second step, we code the families of primary structures by a table of examples vs attributes. Indeed, in machine learning, data are coded by sets of attributes describing sets of examples, i.e. a table of examples vs attributes called ‘context’. On the other hand, in molecular biology, biological macromolecules are coded by strings. We build the context by representing each element of the DD by a binary attribute.
3.
In the third step, we extract knowledge from the context and represent it by production rules, by using an incremental production rules approach [43] and [44].
4.
Finally, during the last step, we use the obtained production rules to do classification of primary structures.
In Section 2 of this paper, we present some definitions and notations. In Section 3, we present our algorithm of construct DDs. In Section 4, we show how we build a context. In Section 5, we present our incremental production rules approach, to do classification of primary structures. In Section 6, we give an illustrative example. In Section 7, we present experimental results. Finally, in the last section, we present our conclusions and perspectives.
In this paper, we have presented a data mining approach based on machine learning techniques to do classification of biological sequences. By using our approach, we use four steps as described above.
One of the advantages of this work is that we obtained a knowledge base. The rules in this knowledge base can be used for classification purposes. They can also be used to improve the know-how of the experts, since the rules can contain newly discovered concepts. More importantly, the formalism of production rules allows the explanation of decisions. As an example, we can identify the subsequences responsible for some biological functions [13], [14], [15] and [16].
In an earlier work [41], we used DDs separately to do classification of biological sequences. With this approach, a production rule may use many DDs simultaneously. This simultaneity in the use of DDs makes the use of production rules more significant than the one of single DDs.
After inducing production rules, we use a punishment/award algorithm to give weights to these rules. The algorithm applies the rules to a training set. The right rule is awarded by increasing its weight and the wrong one is punished by decreasing its weight. The final weights constitute an order relation between the rules. This relation allows us to choose the good rule among the conflictual ones.
Applied on proteins primary structures, our data mining approach gave low classification error rates. Presently, we are bending on experimenting our data mining approach on genes primary structures.