Identification of differentially-expressed genes from microarray data
There has been a dramatic increase in the amount of quantitative data derived from the measurement of changes at different levels of biological complexity during the post-genomic era. This is especially true in the case of genome-scale analyses of the transcriptome, proteome and metabolome, particularly when such measurements have been made in parallel using high throughput technologies involving microarrays and mass spectrometry techniques . Analyses of these data rely on the performance of in silico experiments, involving the inductive detection of patterns in the data to which some phenotypic significance can be attributed . Such analyses usually rely on statistical testing and linking the results of these tests with information stored in biological databases to summarise and develop conclusions. For example, the analysis of gene expression data generated from microarray experiments consists of a number of steps. The process begins with the normalization and standardization of transcript data, followed by statistical evaluation, and finally, interpretation of the statistical results via the annotation of genes with information relating to their biological function.
In silico experiments on bioinformatics data may be realised as workflows consisting of a pre-defined series of tasks that are related to one another by the flow of data between them. Such workflows can be constructed and enacted using applications such as Taverna that automatically direct the flow of data between the information repositories and computational tools responsible for performing the tasks within an in silico experiment. In this post, we report on how the Taverna workflow system can be used for the statistical analysis of quantitative, post-genomic data with an example from the transcriptomics domain to detect differentially-expressed genes in microarray data. We show a workflow which retrieves data using customised maxdBrowse web services from the maxdLoad2 microarray database . This workflow then performs statistical analysis on the gene expression data using R to generate a list of differentially-expressed genes which is followed by the annotation of the genes with information stored in biological databases.
Microarray data analysis workflow
The purpose of this workflow is to discover differentially-expressed genes by t-test analyses between two sets of microarray data followed by the identification of common terms from the Gene Ontology (GO) associated with these genes. These terms were identified from the biological process, cellular component and molecular function sub-ontologies of the GO using a web service wrapping of the GOTermFinder tool .
Web service access to maxdBrowse
Prior to the analysis of the microarray data, the workflow was also responsible for the selection and retrieval of the gene expression data. The source of these data will differ with every research group but in this workflow example, our data is stored in a maxdLoad2 database, part of the maxd software suite for microarray data storage and visualisation (Figs. 1 and 2) . Microarray data and their associated metadata as defined by MIAME can be loaded into the database using the maxdLoad2 desktop application, whilst data can be queried and retrieved from a web browser using a PHP web application called maxdBrowse . To enable data in maxdLoad2 to be accessed from Taverna workflows, maxdBrowse now provides programmatic access to the database using web services. A default set of operations is provided by the maxdBrowse web service, where each operation performs distinct queries providing access to different metadata about the microarray experiments held within maxdLoad2 as well as the actual gene expression level measurements (Fig. 2).
Use of Beanshell scripts for reconciling mismatches in data interoperability between services and implementing user interaction during workflow execution
It is often necessary to transform data during their transfer between services within workflows due to incompatible mismatches in their syntactic format. In such cases, manual intervention is required to convert the data into the correct form prior to their consumption by a service in order for the workflow to enact successfully. To this end, the ability to execute scripts written in the Beanshell programming language, an interpreted form of Java, was incorporated into a Taverna processor. Beanshell scripts can be written to provide a mechanism of transforming data so that the data are consumable to services within a workflow. Within the example microarray analysis workflow, Beanshell scripts have been used to merge multiple sets of microarray data retrieved from the maxdLoad2 database prior to their analysis using R. A Beanshell script was also used to generate one of the final outputs of the workflow which was in the form of a text file of comma-separated values (CSV) containing a list of the differentially-expressed genes identified from t-tests combined with functional information obtained from public databases (Fig. 1).
An ad hoc, albeit basic, form of user interaction can also be implemented using Beanshell scripts through the use of Java Swing classes as part of a workflow enactment. This was used by the microarray workflow for the multiple selection of two sets of data for t-test analysis by R (Fig. 1).
Invocation of R from Taverna workflows
The next step in the workflow is the statistical analysis of the two sets of gene expression data once they have been selected and retrieved from maxdLoad2. A Taverna processor called RShell has been developed to invoke scripts in the R computing environment, and was used to perform the statistical calculations for the t-test analyses (http://www.r-project.org/) (Figs. 1 and 2). This processor acts as a client to R when it has been deployed as a TCP/IP server using the RServe library (http://rosuda.org/Rserve), relaying the script and its inputs to be executed in the R environment (Fig. 3). The script contains statements in the R programming language to implement the statistical calculations. In the example workflow, the RShell processor invokes the R server with a script to perform a t-test for every gene expression level, identifying those genes which are differentially-expressed from the data retrieved from maxdLoad2 (Fig. 1). The location of the R server to be used for analysis is defined by a hostname and port number declared when configuring the processor. Data are fed to the RShell processor via its input ports and are made available to the R script as variables named after the ports (Fig. 4). In a similar fashion, the results generated after the execution of the R script are available from the RShell processor via its output ports by reading the last value assigned to the variable named after them (Fig. 4). In addition, graphs generated by R can also be returned to Taverna in the form of images as outputs by the workflow.
Use of plugins for extending Taverna functionality
Both the Beanshell and RShell processors provide Taverna with access to generic tools for analysing different types of quantitative data. This is accomplished by these two processors without the need to develop and deploy web services, thereby enabling rapid prototyping without a web services infrastructure. Taverna can also be tailored for use in specific scientific domains by extending its functionality with additional source code through the use of software plugins (add link here). Plugins and their dependencies are shared with the scientific community by their deployment to Maven repositories which can be made available on a web server. Plugins are installed in the Taverna workbench through its Plugin Manager by providing a URL which points to the XML file that lists the available plugins; those required by the user can then be selected for installation and use. In the microarray data analysis workflow, plugins were developed for displaying PDF documents (Fig. 5) as well as for visualising quantitative data when formatted as CSV as a table. These two renderer plugins can be downloaded and installed for use in Taverna 2.2 from http://dbk-ed.mib.man.ac.uk/taverna/t2/plugins/ using the Plugin manager on the Taverna workbench.
The example workflow shown in Figure 1 was evaluated using microarray data originating from a study into the effects of growth rate on the transcriptome of the yeast, Saccharomyces cerevisiae, when grown under different nutrient-limiting conditions (see additional file 1: workflow.xml) . The data from this study were normalized by GC Robust Multi-array Average background adjustment and then stored in the maxdLoad2 database . The enactment of the workflow started with the selection and retrieval of data from the microarray database using the maxdBrowse web service interface and Beanshell scripts. This was followed by the invocation of R with a script to perform a series of t-tests to identify the genes that are differentially-expressed between growth rates and culture conditions (Figs. 1 and 4). Common terms from the GO associated with these genes were then identified using the GOTermFinder web service. The GOTermFinder service returns PDF reports of these common GO terms which were browsed using the Taverna workbench with the PDFRenderer plugin (Fig. 5).
Using the example workflow, the gene expression levels of yeast cultures growing under carbon limitation were selected for comparison between two different growth rates; those genes whose expression differs at 0.07 h-1 compared with 0.1 h-1. When the genes identified from t-test analyses at a p-value of less than 0.05 were subjected to GO analysis by the GOTermFinder tool with a p-value at the 1% level, only broad GO categories relating to the regulation of metabolic processes were identified (see additional file 2: carbon-ttest-0_01.zip). However, when the same set of genes was analysed by the GO tool with a less stringent threshold (p-value < 0.05), more GO terms appeared relevant including those associated with gene expression (see additional file 1: carbon-ttest-0_01.zip). The corresponding analysis of GO molecular functions for the same set of genes showed that some were involved in transcription regulation as well as encoding for a number of protein kinases.
The same analysis was applied to S. cerevisiae cultures growing at rates of 0.07 h-1 and at 0.10 h–1 under nitrogen limitation by selecting the appropriate data stored in the maxdLoad2 database during the enactment of the example workflow. A different pattern from that of the carbon limitation was observed since t-test analysis (p-value < 0.05) and GO analyses (p-value < 0.05) found a higher number of genes whose products are associated with the metabolism of nitrogen compounds (see additional file 3: nitrogen-ttest-0_05.zip). The most over-represented GO molecular function categories for this gene set corresponded to a variety of catalytic activities such as oxidoreductases, transmembrane transporter activities, and transferases.
The full set of results from the above analyses can be downloaded from http://www.mcisb.org/software/taverna/microarray/gcrma.zip.
1. Castrillo J, Zeef L, Hoyle D, Zhang N, Hayes A, Gardner D, Cornell M, Petty J, Hakes L, Wardleworth L et al: Growth control of the eukaryote cell: a systems biology study in yeast. J Biol 2007, 6:4.
2. Kell DB, Oliver SG: Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. 2004, 26(1):99-105.
3. Hancock D, Wilson M, Velarde G, Morrison N, Hayes A, Hulme H, Wood A, Nashar K, Kell D, Brass A: maxdLoad2 and maxdBrowse: standards-compliant tools for microarray experimental annotation, data management and dissemination. BMC Bioinformatics 2005, 6:264.
4. Boyle E, Weng S, Gollub J, Jin H, Botstein D, Cherry J, Sherlock G: GO::TermFinder–open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 2004, 20:3710-3715.