The Medical and Pharmaceutical industries have shown high interest in the precise engineering of protein hormones and enzymes that perform existing functions under a wide range of conditions. Proteins are responsible for the execution of different functions in the cell: catalysis in chemical reactions, transport and storage, regulation and recognition control. Computational Protein Design (CPD) investigates the relationship between 3-D structures of proteins and amino acid sequences and looks for all sequences that will fold into such 3-D structure. Many computational methods and algorithms have been proposed over the last years, but the problem still remains a challenge for Mathematicians, Computer Scientists, Bioinformaticians and Structural Biologists. In this article we present a new method for the protein design problem. Clustering techniques and a Dead-End-Elimination algorithm are combined with a SAT problem representation of the CPD problem in order to design the amino acid sequences. The obtained results illustrate the accuracy of the proposed method, suggesting that integrated Artificial Intelligence techniques are useful tools to solve such an intricate problem.
Computational Protein Design (CPD) is one of the most important research problems in Computational Molecular Biology (Lippow and Tidor, 1997 and Tian, 2010). The Medical and Pharmaceutical industries are widely interested in precisely understanding how to engineer protein hormones and enzymes that perform existing functions under a wide range of conditions. Proteins are long sequences of 20 different amino acid residues that in physiological conditions adopt a unique 3-D structure (Lehninger et al., 2005). This structure is important because it determines the function of the protein in the cell, for example, catalysis in chemical reactions, transport and storage, regulation, recognition and control (Lesk, 2002). Protein design has become a powerful approach for understanding the relationship between amino acid sequence and 3-D structure and consequently to study the functional aspects of the protein (Sander et al., 1992). The ability to design sequences compatible with a fold may also be useful in structural and functional Genomics, with the identification of functionally important domains in sequences of proteins.
The general goal of CPD is to identify an amino acid sequence that folds into a particular protein 3-D structure with a desired function (Fig. 2). Protein design can be considered as the inverse of the protein folding (PF) problem (Osguthorpe, 2000) because it starts with the structure rather than the sequence and looks for all sequences that will fold into such 3-D structure. Considering that there are 20 naturally occurring amino acids for each position, the combinatorial complexity of the problem amounts to 20110 or 10130 (Floudas et al., 2006).
Over the last decade, a tremendous advance in protein design was witnessed. The maturation of a number of component technologies, including optimization algorithms and combinatorial discrete search, contributed for advances in CPD research. Techniques such as Dead-End-Elimination (Georgiev et al., 2008 and Desmet et al., 1992), Integer Programing (Xie and Sahinidis, 2006) and Monte Carlo (Yang and Saven, 2005, Hom and Mayo, 2006 and Allen and Mayo, 2006) have been applied to protein design, but the problem still remains very challenging. In this article, we present a new method based on clustering strategy, Dead-End-Elimination techniques and SAT-based methods to determine the amino acid sequence of protein 3-D structures.
The remainder of the paper is structured as follows. Section 2 contextualizes the protein design problem and basic concepts used in this article. Section 3 introduces the computational and AI techniques used in the proposed method. Section 4 introduces the new hybrid method for protein design. Section 5 reports several results illustrating the effectiveness of our method. Section 6 concludes and points out directions for further research.
The general goal of a Computational Strategy to Protein design is to identify an amino acid sequence that folds into a particular protein 3-D structure. In this article we presented a new strategy to design protein sequences from 3-D protein structures determined experimentally. The obtained results illustrate the efficacy of the proposed strategy. The main goal of the developed strategy was to design in an efficient way approximate sequences of target protein structures.
The overall contribution of this work is threefold: First, the use of AI and computational techniques and concepts to develop a new, effective algorithm for the protein design problem. Second, the use of clustering strategy to reduce the number of rotamers which in turn reduce the number of SAT clauses to be solved. Third, the use of a Dead-End-Elimination strategy to pruning bad solutions in the set of clauses. This opens several interesting research avenues, with a range of applications in computational biology and bioinformatics. For instance, one could apply the developed method to other classes of proteins; second, one could test other different clustering algorithms; third, one could test the use of different pruning techniques to reduce the number of SAT clauses eliminating the rotamers that increase the potential energy of the 3-D structure of the polypeptide.