تجزیه و تحلیل روند خودکار از داده های پروتئومیکس با استفاده از معماری داده کاوی هوشمند
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|22069||2006||10 صفحه PDF||سفارش دهید||6504 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 30, Issue 1, January 2006, Pages 24–33
Proteomics is a field dedicated to the analysis and identification of proteins within an organism. Within proteomics, two-dimensional electrophoresis (2-DE) is currently unrivalled as a technique to separate and analyse proteins from tissue samples. The analysis of post-experimental data produced from this technique has been identified as an important step within this overall process. Some of the long-term aims of this analysis are to identify targets for drug discovery and proteins associated with specific organism states. The large quantities of high-dimensional data produced from such experimentation requires expertise to analyse, which results in a processing bottleneck, limiting the potential of this approach. We present an intelligent data mining architecture that incorporates both data-driven and goal-driven strategies and is able to accommodate the spatial and temporal elements of the dataset under analysis. The architecture is able to automatically classify interesting proteins with a low number of false positives and false negatives. Using a data mining technique to detect variance within the data before classification offers performance advantages over other statistical variance techniques in the order of between 16 and 46%.
Following the explosive growth in research into the genome, the study of the proteome has become fundamental to biochemical research (Righetti, Stoyanov and Zhukov, 2001). Proteomics is defined as the large-scale identification and characterisation of the proteins encoded in an organism's genome (Alberts, Bray, Lewis, Raff, Roberts and Watson, 2002) and is often described in literature as the next step to dramatically advance drug discovery (Whittaker, 2003). More specifically, proteomics is concerned with the analysis of the structure and function of proteins as well as of protein-protein interactions. Within proteomics, a particular area of interest is the mapping of protein posttranslation modifications (Liebler, 2002). RNA, which is initially transcribed from the genetic details stored in DNA, is translated to protein. Following this translation, the state of a protein can alter during its lifetime, such as from the introduction of a disease (Crenshaw and Cory, 2002). The protein's state within a particular tissue can alter as conditions change and, hence, is indicative of the current physiological state. These posttranslational modifications have a direct effect on the structure, function and turnover of proteins, hence, analysis of these trends of variation may lead to novel avenues to determine how chemical modifications to the proteome affect living systems (Liebler, 2002). Consequently, the analysis of the posttranslational modifications of proteins is particularly important for the study of conditions such as cancer, neurodegenerative diseases, heart disease and diabetes. In order to perform this analysis, a method of measuring the expression of proteins is required. The most popular, and currently unrivalled, technique to perform protein expression analysis is that of two-dimensional electrophoresis (2-DE) (Jenkins and Pennington, 2001 and Pennington et al., 1997). This technique uses two successive electrophoresis runs to separate the proteins from a tissue sample with regards to their isoelectric point and molecular weight. The first run separates the proteins in one dimension and the gel is then rotated 90° and the second run is performed to separate into the second dimension. Each protein expressed using this method appears as a dark spot on these gels (see Fig. 1), following the use of staining techniques, and are then individually analysed for features such as relative abundance, shape and appearance and disappearance across an experimental series (such as over time or between different control groups). Such analysis is often assisted with the use of image analysis software which can automatically detect spot correspondence from one gel to the next (Pederson and Ersboll, 2001 and Pleissner et al., 2001). Following this process, these images can be converted into data which describes each protein, such as volume, area, height and x and y coordinates on a gel. These attributes can be representative of changes to the function of the protein; changes to these attributes can be indicative of an intrinsic link to a particular condition. For example, a protein which has physically altered under a diseased state compared with that of a healthy state may well be intrinsically linked to the physiological state of the organism and, hence, worthy of further investigation.The analysis of this protein data, however, is not a trivial task (Marengo, Leardi, Robotti, Righetti, Antonucci and Cecconi, 2003). Disadvantages of 2-DE include that it is inherently labour-intensive and requires a skill-level such that only trained experts can perform the analysis, often manually. The potentially useful trends are encapsulated within large volumes of multi-dimensional, spatio-temporal post-experimental data, making this manual interpretation of results impractical (Fenyo and Beavis, 2002). Without the availability of reliable tools for post-experimental data analysis, the technique is essentially a descriptive one, limiting the potential for fully automated analysis (Griffin and Aebersold, 2001). The full value of this technique can not then, be realised until this processing bottleneck is resolved; fully automatic approaches for identifying intrinsic trends in gels will go some way towards this goal (Dowsey, Dunn and Yang, 2003). In this paper, we present an intelligent data mining architecture that is able to analyse post-experimental, 2-DE gel data and identify interesting proteins automatically. This approach uses a combination of a data-driven, data mining technique and a goal-driven, machine learning technique which incorporates expert heuristics, such as those used in manual analysis. Data mining is the process of finding trends and patterns in large data sets (Toroslu and Yetisgen-Yildiz, 2005). The data-mining element employed here is that of differential ratio (dFr) data mining, a technique which measures variance of a given object in terms of the log of pair-wise ratios of the elements describing the data over time (or within any given linear series). The machine-learning element concerns the use of a BackPropagation, Multi-Layer Perceptron (MLP) neural network in order to classify the results of the data mining into discrete classes of interesting behaviour. Such classes are defined using expert heuristics, optimised through the use of an Adaptive Nero-Fuzzy Inference System (ANFIS) as described by Malone et al. (2004b). A comparison is drawn to MLPs trained using Principal Component Analysis (PCA) and Covariance as variance measures. Finally, a comparison to a MLP trained on normalised data alone is conducted to quantify any relative benefits of using a variance analysis measure step before classification of the dataset. The remainder of this paper is organised as follows. Section 2 discusses current strategies used in the analysis of 2-DE gel data. Section 3 describes the proposed intelligent data mining architecture. Section 4 presents the results of experimentation and discusses these findings. Section 5 outlines the conclusions.
نتیجه گیری انگلیسی
In this paper we presented an intelligent data mining architecture and performed experiments using two post-experimental, 2-DE gel datasets. Three variance analysis methods were applied to the datasets to use as training and testing data for a BackPropagation, Multi-Layer Perceptron (MLP) neural network in order to classify the results of the data mining into discrete classes of interesting behaviour. The neural network was also trained and tested using normalised data only to assess the benefits of using a variance analysis step before machine learning. Of the three variance analysis methods employed and tested, the differential ratio data mining proved to be the most successful in identifying and representing the salient trends within the data. The intelligent data mining architecture also provided the lowest number of false negatives and false positives of all strategies, an important consideration when attempting a comprehensive and accurate analysis of the data. The architecture also allows the encapsulation of expert opinions through the use of an adaptive fuzzy logic system (ANFIS). This offers the advantage of optimising initially approximate data in an effective manner whilst, following training, allowing fuzzy rules to be extracted which represent the optimised fuzzy membership functions. Such membership functions form the basis of our output classes, which correspond to interesting features of protein behaviour. This research goes some way to addressing the processing bottleneck that exists within post-experimental 2-DE gel data analysis by providing a technique that automatically extracts potentially interesting proteins from within the datasets. Since the technique involves the use of a supervised neural network, normal considerations of suitability apply, i.e. that empirical data must be available in order to train and test the network's ability to learn and classify correctly. Future work will concentrate on expanding the technique to further proteomics data sets. We also aim to show that this approach is suitable more generally as a spatio-temporal data mining technique by expanding to other spatio-temporal datasets such as robotics and meteorological data.