بهینه سازی کلونی مورچه ها برای شناسایی ارتباط نوع ژنتیکی مبتلا یان به دیابت نوع 2
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
7708 | 2011 | 14 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Information Sciences, Volume 181, Issue 9, 1 May 2011, Pages 1609–1622
چکیده انگلیسی
Around 1.8 million people in the UK have type 2 diabetes, representing about 90% of all diabetes cases in the UK. Genome wide association studies have recently implicated several new genes that are likely to be associated with this disease. However, common genetic variants so far identified only explain a small proportion of the heritability of type 2 diabetes. The interaction of two or more gene variants, may explain a further element of this heritability but full interaction analyses are currently highly computationally burdensome or infeasible. For this reason this study investigates an ant colony optimisation (ACO) approach for its ability to identify common gene variants associated with type 2 diabetes, including putative epistatic interactions. This study uses a dataset comprising 15,309 common (>5% minor allele frequency) SNPs from chromosome 16, genotyped in 1924 type 2 diabetes cases and 2938 controls. This chromosome contains two previously determined associations, one of which is replicated in additional samples. Although no epistatic interactions have been previously reported on this dataset, we demonstrate that ACO can be used to discover single SNP and plausible epistatic associations from this dataset and is shown to be both accurate and computationally tractable on large, real datasets of SNPs with no expert knowledge included in the algorithm.
مقدمه انگلیسی
According to [6] around 1.8 million people in the UK have type 2 diabetes, representing around 90% of all diabetes cases in the UK. Diabetes is the leading cause of blindness in the working age population and although type 2 diabetes is traditionally associated with ageing, it is appearing increasingly in younger adults. In 2006, clinical care for diabetic patients accounted for 5% of the NHS budget (about £10 million per day) and this is expected to rise to 10% by 2011. There is currently no cure, but the identification of genetic risk factors has implicated several new genes that are likely to be involved. It is as yet unclear how this will translate into improved prediction and prevention. 1.1. Genetic association studies Genetic association studies aim to discover which genetic variations increase or decrease the likelihood of an individual contracting a given disease by comparing common DNA variants between affected and unaffected individuals. Recent efforts, including this study, aim to link small changes in the DNA of individuals known as single nucleotide polymorphisms (SNPs) in particular positions of the genome to increased risk of the individual developing particular diseases [14]. SNPs consist of two alleles (two of the four bases of the genetic code, A, C, G or T) and because humans have two copies of the genome (diploid), each individual has one of three genotypes at each SNP position. For example, at an A/C SNP, individuals will be one of AA, AC or CC genotypes. These differences in genotype, although small, can have profound effects on the probability of individuals developing certain diseases. In monogenic diseases, such as cystic fibrosis [12], the presence or absence of a single allele completely predicts the presence or absence of the disease [17]. However, for many common diseases, such as type 2 diabetes [15], an individual’s susceptibility is influenced by a complex interaction of environmental and genetic factors which yields a probabilistic connection between genetic variation and the disease. In polygenic diseases, the presence of a risk allele will increase or decrease the probability of the disease, and there may be many different risk alleles [17] associated with a disease. The discovery of single risk alleles requires a computationally tractable search through all SNPs (∼400,000 in the human genome) and their association with the disease. Whilst single risk alleles have been shown to be very informative, much of the genetic variation in a trait or disease remains uncharacterised. One of the leading contenders for this “missing heritability” is epistasis, the interaction between genes in a genome. Moore [21] asserts that epistasis, may influence predisposition to common diseases and the same author, in [20], asserts that it is likely to be ubiquitous, or at least widespread. Furthermore, epistasis has previously been implicated in insulin resistance [2], HIV [4] and Alzheimer’s disease [31]. This discovery of gene–gene interactions in the same dataset (i.e. the search for pairs, triplets and higher order combinations of SNPs) yields combinatorial complexity that is no longer computationally tractable. Until recent technological advances made it possible to investigate a large proportion of variation in the human genome, genetic association studies were limited to looking only at small numbers of genes where research had identified possible biological reasons for their involvement in disease phenotypes. By the end of 2006, the Wellcome Trust Case Control Consortium had performed one of the first and largest genome wide association studies using data from 400,000 SNPs in 17,000 individuals. The WTCCC investigated seven diseases including type 2 diabetes [35] and many other genome wide association studies (GWAS) have since been reported. This explosion of information has triggered the search for methods that are able to analyse such highly dimensional data for statistical correlations with disease status, leading some to apply artificial intelligence (AI) techniques (e.g. [3], [22], [23], [24], [25], [26] and [27]). The task for these AI techniques in genetic association studies is to discover a small number of SNPs (feature selection) that are informative to the disease status and then establish how the alleles for those SNPs combine to classify an individual as to their susceptibility to the disease.
نتیجه گیری انگلیسی
This paper has described the use of ant colony optimisation for the discovery of single SNP-diabetes and epistatic associations in type 2 diabetes data. The results demonstrate the capability of the algorithm to discover statistically significant, previously discovered associations in a dataset containing real samples of one chromosome. The results also indicate that the algorithm is capable of finding statistically significant epistatic associations, one of a number of possible contenders for the missing heritability in GWAS, although more experimentation is required before these can be confirmed due to the small number of individuals identified with the uncommon genotypes. Both the single SNP and epistatic associations were discovered without including explicit expert knowledge regarding SNP analysis in the fitness functions. Experimentation with the algorithm has shown that a number of variables should be considered when applying ACO to problems in bioinformatics and specifically the discovery of SNP associations. The generation of biologically realistic path lengths in particular appears to affect the performance of the algorithm considerably. Additionally, the fitness function is highly important in terms of discovering the desired associations in the data. The ACO algorithm is ideally suited to the area of SNP data analysis as it is able to extract associations from large databases such as this one on standard PC hardware. Additionally, the results of the algorithm run can be presented as a histogram and the best SNPs identified either in isolation or together with other SNPs in single SNP or epistatic associations. This presentation of learned information is not a feature of most search techniques and is especially valuable in this application domain where data are correlated due to phenomena such as linkage disequilibrium. The algorithm has demonstrated flexibility whereby a variety of fitness functions, path lengths, numbers of ants and iterations can be tested to deliver biologically plausible results within a reasonable timeframe on relatively modest machines. The discovery of single SNP associations confirms the capability of the algorithm to identify important single associations, but this capability can be replicated by statistical testing. It is the ability of the algorithm to discover putative epistatic associations which is especially encouraging for the ACO approach as they can be difficult or impossible to find in real-world data with standard methods. In an epistatic association, the two SNPs in isolation do not represent statistically significant associations, it is the combination which yields good results and therefore a combinatorial approach must be sought. Given these properties, it is proposed that the ACO algorithm is an ideal tool for investigating the new wave of large-scale SNP datasets.