Analysing Quantitative Trait Loci data
The Biological community has recently seen a significant increase in the identification of Quantitative Trait Loci (QTL) underlying numerous genetically complex traits (1). The identification of QTL is vitally important to unravel the genetic architecture contributing to such traits. Identifying QTL correlating with a given phenotype, however, can yield multiple chromosomal regions that can encompass a vast quantity of candidate genes (2,3). This can be extremely problematic when attempting to identify candidate Quantitative Trait genes (QTg) underlying a given phenotype (3).
Understanding the precise mechanisms, and identifying suitable candidate QTg, will help to provide insights that will surely enhance our knowledge of these genetically complex traits and may lead to the development of novel therapeutic treatments.
The challenge now facing researchers, however, is no longer focussed on the generation of such data; instead a significant bottleneck has emerged in data analysis, and the subsequent generation of novel biological hypotheses. With the wealth of information resulting from QTL investigations, researchers can quickly become ‘over-loaded’ with information, and may be unable to establish a definitive hypothesis.
The chances of making critical mistakes in hypothesis generation are not only exacerbated by the quantity of data, but also the current manual methods of data analysis (3,4). With no definitive means of readily identifying genes underlying Quantitative Traits, researchers are required to progressively analyse data until sufficient information has been collected to form a hypothesis. Implementing this non-systematic approach can result in hypotheses being drawn based on assumptions and biased filtering of data (5,6).
Tens to hundreds of genes may be under even well defined QTL. It is therefore vital that the identification, prioritisation and functional testing of Quantitative Trait genes (QTg) are carried out systematically without bias introduced from prior assumptions about candidate genes (7).
From the need to prioritise any candidate Quantitative Trait genes (QTg) for further investigation, researchers have opted to investigate QTL data at the level of biological pathways (3,8,9). By investigating at the level of biological pathways, the functional relationships between underlying Quantitative Trait genes and a phenotype become more explicit (1,10,11). It is also possible to obtain a global view of the processes which may contribute to the expression of the phenotype (12). Additionally the explicit identification of responding pathways naturally leads to experimental verification in the laboratory
As a consequence, this case study we will focus on the analysis of QTL data, showcasing how it is possible to identify candidate genes, and their corresponding pathways, believed to correlate with a given phenotype.
In order to identify which genes lie within the QTL region of choice, the physical boundaries of the QTL need to be determined. Each gene is then subsequently annotated with its associated biological pathways, obtained from the KEGG pathway database (13). This process is summarised in Figure 1
Figure 1. An illustration showing the prioritisation of phenotype candidates, from the pathway-driven approach. Those pathways which contain genes from the QTL region are assigned a higher priority (pathways A and B) than those with no link to the QTL region (pathway C). Higher priority pathways are then ranked according to their involvement in the phenotypes expression, based on literature evidence. Abbreviations: CHR – Chromosome; QTL – Quantitative Trait Loci.
For such an approach to be conducted systematically, any Web resources used (including their parameters) must be stated explicitly. To do this, we must identify the Web resources to be used and determine if a Web Service interface has been provided. If, however, no Web Service interface is available, then a different service must be used, or a Web Service must be created to access data behind the specific resource.
In order to determine the genes that lie within a given QTL region, the position of flanking markers used in the original mapping studies should be used. The precise base pair positions of these markers are can then be used to identify the left and right boundaries of the QTL. It should be noted that the precise version of database chosen for mapping these genetic markers should be recorded, to provide an explicit method of data analysis. Failure to record such details hinders the reproducibility of further investigations.
The qtl_pathway workflow was constructed to identify genes within a QTL region, and subsequently map them to pathways held in the KEGG pathway database.
In order to construct this workflow 3 input parameters were required: chromosome_name; start_position; and end_position. These would allow for a physical map to be placed on the chromosome for the QTL region.
The three input parameters can then be connected to BioMart, via a [BioMart processor]. This processor allows for direct communication with the Ensembl dataset, enabling us to retrieve a list of genes for a given chromosome region.
The list of genes from Ensembl, together with UniProt (14) and Entrez gene (15) identifiers, are subsequently cross-referenced to KEGG gene and pathway identifiers. A fragment of this workflow can be seen in Figure 2, which shows the mapping from QTL region to KEGG gene identifiers, to KEGG gene descriptions. Note, only three Web Services were used in this workflow.
Additional shim services were added to format data into the correct input/output style, these services have not been assigned labels in Figure 2. An example workflow where these have been highlighted is given at: domain services and shim services.
Figure 2. Annotation workflow to gather genes in a QTL region, and provide information on the pathways involved with a phenotype. This workflow, shown as a sub-set of the complete workflow, requires a chromosome, and QTL start and stop positions in base pairs. The genes in this QTL region are then returned from Ensembl via a BioMart plug-in. These genes are subsequently annotated with UniProt and Entrez identifiers, start and end positions, Ensembl Transcript ids, and Affymetrix probeset identifiers. The UniProt and Entrez ids are submitted to the KEGG gene database, retrieving a list of KEGG gene ids.
The process of collating the results into single output files is important when analyzing results returned from the workflows. The format required by a number of bioinformatics services is in the format of an array, or list, of input values. More information on these data types can be found at: Lists and Iterations. The resulting output may be a list of outputs, with each output containing a list of gene or pathway identifiers. This mass of inter-connected data makes it difficult to interpret the results gathered from the workflows, where results can be amplified into the hundreds from just a few input values. This problem means that we must alter our methods from data gathering to data gathering and management.
To assist the bioinformatician in the analysis of these workflow outputs, we constructed a simple relational flat-file system, where cross-references to identifiers held in additional files were stored with identifiers from the service that had been invoked. An example of this is the storage of KEGG genes and their KEGG pathway identiﬁers. The gene identiﬁers and pathway identiﬁers were obtained from separate services. To correlate these identiﬁers, however, the researcher would require a means of knowing which pathway identiﬁers were the product of a search with a gene identiﬁer. We resolved this issue with a number of shim services that stored the KEGG gene id with all of its associated KEGG pathway identiﬁers in a tab-delimited format. This enabled the bioinformaticians to query via a gene or pathway identiﬁer and obtain its corresponding gene or pathway identiﬁer, e.g. a query for gene, mmu:13163, would return “mmu:13163 path:mmu04010”.
The generality of these workflows allows them to be re-used for the integration of mapping data in other cases; the QTL gene annotation workflow may be utilised in projects, which use the mouse model organism.
It should be noted that an unavoidable ascertainment bias is introduced into the methodology, in the form of utilising remote resources for candidate selection. The lack of pathway annotations limits the ability to narrow down the true candidate genes from the total genes identified in a QTL region, with the reliance on extant knowledge. A rapid increase in the number of genes annotated with their pathways, however, means that the number of candidate QTg identified in subsequent analyses is sure to increase. The workflows described here provide the means to readily repeat the analysis.
The KEGG pathway database was chosen as the primary source of pathway information due it being publicly available and containing a large set of biological pathway annotations. This results in a bias, relying on extant knowledge from a single data repository; however, this investigation was established as a proof of concept for the proposed methodology and, with further work, may be modified to query any number of pathway databases, provided they offer web service functionality. All the workflows developed in this investigation are available on myExperiment.
1. Doerge, R. (2002) Mapping and analysis of quantitative trait loci in experimental populations. Nat Rev Genet, 3, 43-52.
2. Iraqi, F., Clapcott, S., Kumari, P., Haley, C., Kemp, S. and Teale, A. (2000) Fine mapping of trypanosomiasis resistance loci in murine advanced intercross lines. Mamm Genome, 11, 645-648.
3. Fisher, P., Hedeler, C., Wolstencroft, K., Hulme, H., Noyes, H., Kemp, S., Stevens, R. and Brass, A. (2007) A systematic strategy for large-scale analysis of genotype phenotype correlations: identification of candidate genes involved in African trypanosomiasis. Nucleic Acids Res, 35, 5625-5633.
4. Stevens, R., Tipney, H., Wroe, C., Oinn, T., Senger, M., Lord, P., Goble, C., Brass, A. and Tassabehji, M. (2004) Exploring Williams-Beuren syndrome using myGrid. Bioinformatics, 20 Suppl 1.
5. Hedeler, C., Paton, N., Behnke, J., Bradley, J., Hamshere, M. and Else, K. (2006) A classification of tasks for the systematic study of immune response using functional genomics data. Parasitology, 132, 157-167.
6. Kaminski, N. and Rosas, I. (2006) Gene expression profiling as a window into idiopathic pulmonary fibrosis pathogenesis: can we identify the right target genes? Proceedings of the American Thoracic Society, 3, 339-344.
7. Glazier, A., Nadeau, J. and Aitman, T. (2002) Finding genes that underlie complex traits. Science, 298, 2345-2349.
8. Levison, S., McLaughlin, J., Zeef, L., Fisher, P., Grencis, R. and Pennock, J. (2010) Colonic transcriptional profiling in resistance and susceptibility to trichuriasis: Phenotyping a chronic colitis and lessons for iatrogenic helminthosis. Inflammatory bowel diseases, 16, 2065-2079.
9. Jouffe, V., Rowe, S., Liaubet, L., Buitenhuis, B., Hornshoj, H., SanCristobal, M., Mormede, P. and de Koning, D. (2009) Using microarrays to identify positional candidate genes for QTL: the case study of ACTH response in pigs. BMC Proceedings, 3, S14.
10. Brown, A., Olver, W., Donnelly, C., May, M., Naggert, J., Shaffer, D. and Roopenian, D. (2005) Searching QTL by gene expression: analysis of diabesity. BMC Genet, 6.
11. Fischer, G., Ibrahim, S., Brockmann, G., Pahnke, J., Bartocci, E., Thiesen, H., Serrano-FernÃ¡ndez, P. and MÃ¶ller, S. (2003) Expressionview: visualization of quantitative trait loci and gene-expression data in Ensembl. Genome Biol, 4.
12. Schadt, E. (2006) Novel integrative genomics strategies to identify genes for complex traits. Anim Genet, 37 Suppl 1, 18-23.
13. Kanehisa, M. and Goto, S. (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 28, 27-30.
14. Bairoch, A., Apweiler, R., Wu, C., Barker, W., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. et al. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res, 33.
15. Maglott, D., Ostell, J., Pruitt, K. and Tatusova, T. (2007) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res, 35.