Most traditional approaches for annotating protein families are not efficient because of high throughput sequences, complex analytic tools and unordered literature and results cannot be reused. Here, we describe a framework, knowledge sharing for protein families (KSPF), that uses sequence pattern data mining and knowledge management to improve upon traditional approaches. It is divided into three modules: automation, retrieval and refinement. This framework builds an environment that allows biological researchers to submit an unknown protein sequence and search for information on its sub-family. Once this sub-family protein category has been found, the related literature and knowledge records provided by previous users can be retrieved. The possible functions of the protein can then be predicted by use of the literature and records. The proposed framework is applicable to all types of protein families. We describe the search for a plant lipid transfer protein (PLTP) with use of the framework. The system KS-PLTP functions to map an unknown sequence to the sub-family of the PLTP knowledge base and predict the sequence's possible function. The prediction rate of KS-PLTP reached 89.6%.
Gene annotation is an important source for representative functional information, because it reveals many sequence functions and valuable information (Eisenberg et al., 2000, Hieter and Boguski, 1997 and Nowak, 1995). Even though gene annotation is an important reference for researching protein families, annotation for a single gene is not enough to recognize the function of a protein family because protein family annotation is more complex than sequence annotation.
Most traditional methods cannot predict the possible functions of an unknown sequence and to which subfamily a protein belongs. Here, we propose a framework, knowledge sharing for protein families (KSPF) that combines data mining and knowledge management to annotate protein families. The framework builds a decision tree and a knowledge base. It involves collecting domain-related literature and functional terms defined by team researchers. Data for the protein family are downloaded from public databases and organized into a decision tree by use of a C4.5 algorithm. The literature and function terms are divided into different subfamily groups. When an unknown sequence is categorized to a subfamily of the decision tree, the researcher can refer to the literature and use the pre-defined function terms to predict the functions of the unknown sequence.
KSPF is applicable to all kinds of protein families, including lipid transfer proteins. Lipid transfer proteins have many different functions in plant physiology. For example, the surface layers of the cell wall of plants are made up of hydrophobic polyesters (Hincha et al., 2001 and Pyee et al., 1994) that protect plant organs against biotic and abiotic stresses (Blein et al., 2002, Buhot et al., 2001 and Wijaya et al., 2000). How to determine the exact biological functions of these proteins is thus an important task. Plant lipid transfer proteins (PLTPs) are small, soluble, basic proteins characterized by their ability to catalyze the transfer or exchange of lipids between membranes in vitro. PLTPs are abundant in expressed sequence tag databases of plants (Douliez et al., 2001, Hincha et al., 2001, Pastorello et al., 2000, Segura et al., 1993 and Sohal et al., 1999). However, manually annotating the functions of unknown PLTP genes from a large amount of data is time consuming and unmanageable. KSPF can be an efficient way to predict the functions of PLTP families.
In the following sections, we give a brief background of the research in searching for functions of proteins; introduce the KSPF framework in detail; discuss the implementation of a practical system, knowledge sharing for plant lipid transfer proteins (KS-PLTP); and demonstrate the efficiency and functions supported by this system.
Here, we propose a framework, KSPF, that combines data mining and knowledge management, to annotate protein families. The knowledge base is designed for a research domain, which collects domain-related literature and functional terms defined by team researchers. The literature describing different function terms is divided into different subfamily groups. When the unknown sequence is categorized to a protein sub-family on a decision tree, the researcher can refer to function terms related to the literature and use the pre-defined function terms to predict the functions of the unknown sequence.
In our case study, although the results in KS-PLTP are impressive, they could be overfit, possibly because of too little training data, which resulted in the PLTP6 and PLTP7 classes not being easily distinguishable in the matrix of Fig. 6, or because some attributes, such as signal peptide and transmembrane, were not included in the attributes of the decision tree. Thus, the decision tree may not represent all classes of PLTP.
Our future development will involve integrating more data resources for KSPF and completing the automation of the whole process of the framework. Coordinating with other data resources, such as microarray, sequence features, and phenotype data, may give a higher prediction rate. As well, more training data for the decision trees needs to be collected. Finally, linking each stage in a batch process and regularly redefining the relationships between the raw data and predicted targets may enhance the power of the KSPF framework.