on December 13, 2010 by Katy Wolstencroft in Case Studies, Comments (0)

Functional Genomics in Taverna

Functional genomics exploits the large wealth of biological information that is now available in the post-genomics era in order to understand functional processes (e.g. transcription, translation, protein-protein interaction etc) on a genomic-scale. However, in order to do this effectively, you must dynamically gather and integrate data from a variety of sources. Taverna provides a mechanism to gather and integrate distributed data, allowing previously isolated data sets to be combined and compared.

The CASIMIR consortium (coordination and sustainability of international mouse informatics resources), who had a remit to assess the technical and social aspects of database interoperability in the mouse model organism community, used Taverna workflows to overcome problems associated accessing distributed functional genomics data [1].

A typical data integration problem from CASIMIR is the gathering of data from various sources to find evidence for linkage between particular phenotypes and genes. For example, this could involve gathering data on phenotype differences and allelic variants between strains, genotypes (gene-related data for those in particular genomic locations) and pathways in which these genes may be involved.

The following workflow (figure 1) performs this kind of integration. It uses a BioMart database interface to Ensembl to first recover some Ensembl Gene IDs and their corresponding EMBL IDs for a list of known Mouse Genome Informatics symbols (MGI). Then, for each Ensembl gene, KEGG is used to recover pathway IDs and HTML links to marked-up pathways, and the Mouse Phenome Database (MPD) to retrieve Single Nucleotide Polymorphisms (SNPs) per gene, and the allelic variation per mouse strain.

Figure 1: A workflow for mouse functional genomics from the CASIMIR consortium

The workflow is available for download from myExperiment (http://www.myexperiment.org/workflows/126.html). It can be used as it is, or modified to provide further data integration functionality or analysis (e.g. to identify non-synonymous SNPs, or SNPs from regulatory or other coding regions).

1. Damian Smedley, Morris A. Swertz, Katy Wolstencroft, Glenn Proctor, Michael Zouberakis, Jonathan Bard, John M. Hancock, and Paul Schofield (2008) Solutions for data integration in functional genomics: a critical assessment and case study Brief Bioinform 9(6): 532-544 doi:10.1093/bib/bbn040

Tags: , , , , , , , ,

No Comments

Leave a comment