The State of the Bioinformatics Nation
There are a set of characteristics of bioinformatics tools and resources that make workflow a useful solution for undertaking bioinformatics experiments. As well as coping with this situation, workflows have other positive benefits that will also emerge as we explore workflow in this k-blog. We talk about the “state of the bioinformatics nation” [ref] as a response to Lincoln Stein’s article calling for the formation of a bioinformatics nation [ref], rather than a set of city states, and that this nation can be formed around the notion of Web services. In the world of bioinformatics the “city states” arise as local centres of power centred about tools and databases, and each city state has its own way of doing things and it is this diversity that makes doing bioinformatics difficult. A nation has a common identity and way of doing things that helps achieve common goals. The difficulties of the bioinformatics “city states” and emerging nation can be summarised as follows:
Numerous data sources;
Heterogeneity in the technology and conceptualization of these resources;
Distribution of these resources;
The complexity of the data being represented;
The volatility of the delivery, understanding around these data and the means of producing these data.
So, what is the “state of this nation”?
In the 2009 Databases special issue of Nucleic Acids Research there were over 1000 different biological databases listed (Galperin 2008). Each one is a published data resource available for bioinformatics, and many also have associated analysis tools and search algorithms, increasing the number of possible resources to several thousand. These resources have been developed over time by different institutions. Consequently, they are distributed and highly heterogeneous. There are also few standards for data representation or data access. Therefore, despite the availability of resources, integration and interoperability present significant challenges to researchers (Davidson, Overton et al. 1995).
A typical bioinformatics experiment involves gathering data from many of these sources and performing a series of connected sub-experiments on it. If the initial data set is small, this process can be managed by the scientist manually transferring data and results between resources. If the data set is large, however, as in the high-throughput experiments of microarray analyses or proteomics, this process becomes impractical and automated methods have to be employed.
Many of these data sets and tools have been generated by individual groups around the world, and they control their data sets in an autonomous fashion. This means that each group of providers can, and often does, devise their own data formats, terminologies and conceptualization for capturing these data. When it comes to using many resources together, this heterogeneity is a problem.
Public data are either housed in public repositories, such as GenBank or Uniprot, or it is generated and published by individual laboratories in bespoke data stores. In the case of public data stores, some data is mirrored between resources. For example, GenBank, the DDBJ and EMBL all contain primary DNA data and share new data submissions with the other two every 24 hours, but in spite of this, they each retain their own data formats and the data entries themselves can sometimes contain different metadata. Most of the bespoke data stores also use their own formats and control their data sets in an autonomous fashion. The result is that information about the same biological objects can be spread between data resources and, in order to gain the most from existing knowledge, this needs to be gathered together.
The question of how to identify when the same biological object is being referred to is the next problem facing bioinformaticians [ref]. The importance of unique identifiers for biological objects in individual data resources is clear to all, and any data entry in such a resource will have a unique identifier, but it is usually an identifier for that database alone. The identifier for the database entry is often taken as an identifier for the object being described; a Uniprot accession number becomes an identifier for the protein, proteins or pool of proteins being described. The same object in another database will typically have a different identifier. There are still few naming conventions or identification schemes that are globally adopted. This means that a large effort is expended in mapping identifiers across resources. The dynamic nature of the data and the frequency that it is updated also means that this mapping is a continuous process and can be error-prone.
There are several initiatives designed to lessen these problems, but none have been able to offer a complete solution. The Life Science Identifiers (LSIDs) initiative showed great promise [ref]. It was a system designed to unify the process of uniquely naming, referencing and retrieving distributed data objects and concepts, and it was proposed as an industry standard for identification and access in biology. The success of this scheme hinged on mass uptake by data providers. Although some did adopt LSIDs for biological objects, many did not, and the system did not propagate. Instead, individual data providers have their own systems for identification and maintain a mapping to the same biological objects in other data sources, leading to propagation in mapping or ‘look-up’ services, further increasing the number of resources for bioinformaticians to deal with. The “shared names” initiative is another attempt to solve the same problem. The idea is to “share names” of database records across databases, tools and projects. The shared names Web site has a good introduction to the problems involved with identity in bioinformatics.
Another popular approach for unifying resources is to use a controlled vocabulary or ontology to provide the descriptions necessary to identify the same biological objects. One of the most successful of these is the Gene Ontology and the wider Open biomedical Ontologies Foundary activity. OBO are making a community wide effort to provide a common vocabulary for describing entities across biology. The gene Ontology is the most prominent among these ontologies and is now used to describe the major functional attributes of gene products across some 40 genome resources.
Understanding when you are looking at the same biological objects is also important when you look back at previous analyses. We have been focusing here on the problems the community face when designing experiments and gathering data, but it is just as much of an issue when analysing results after experiments have been performed. The mapping resources are potential places for errors to arise, and it is important that all data items can be traced back to their source.
The problems of distribution and identity mapping are barriers to integration and this is vital for extracting the most from the wealth of public data available, often in conjunction with combining data produced locally by the laboratory performing the analysis. Combining data resources is only half of the problem. In typical bioinformatics experiments, once data has been gathered from a variety of sources, a series of analyses are applied to it. These analysis tools are also distributed, and the format of the data accepted and produced also varies widely. In practice, the ideal situation is to be able to automatically extract data from distributed sources in real time, and combine and analyse that data with tools from other locations. On a small scale, this can be achieved by the scientist performing the integration, by gathering data upon visiting web sites and analysing data using online tools and resources. In other words, the primary “integration layer” is often the bioinformaticians themselves. By using their knowledge of the domain, they will navigate through the various web pages offering data or tool access. The analysis of data occurs during this process with the scientist making decisions about what to save and what to discard along the way.
This manual process can be effective, but it is not scalable, subject to user errors, and difficult to reproduce.
Scientists often produce bespoke code to automate this kind of analysis, often screen-scraping the same Web pages that the manual process would use, sometimes using more programmatically amenable forms of access. However, these scripts tend to be brittle. If a web page changes, the script can break. If data is downloaded, the downloads need to be maintained to prevent out of date information being used in experiments. The volatility of knowledge in biology and the dynamic nature of the landscape mean that data analysis solutions across many resources have to be flexible. New data generation techniques frequently arise; new data resources and tools frequently arise; as our understanding of biology changes, so will the data and tools. All of these aspects make experiment across resources difficult.
Data standards have begun to emerge. Many resources now use common vocabularies. Many data sources and tools are available as services, meaning that tools can be built around these resources more easily than a decade ago. There is now more of a bioinformatics nation. In spite of this, the problems of distribution, volatility and heterogeneity still exist and will continue to exist; it is in the nature of bioinformatics. We still hope and expect that bioinformatics will become more nation like, while maintaining innovation.
Workflow systems such as Taverna are inherently designed to fix the access to distributed resources. Taverna forms pipelines that visit these resources and pass data from one source to a tools and then on to another component in the pipeline. Heterogeneity can be addressed by adding local fixes to the data, as they stream through the pipeline, to fix incompatibilities. Finally, workflows inherently address issues of scalability, they do the same thing to each and every piece of data; thus the all too human slips are avoided.
This is the state of the bioinformatics nation that Taverna workflows, and other workflow systems, tackle. To undertake scalable, repeatable bioinformatics experiments, bioinformaticians need to deal with distributions, heterogeneiety, as well as the other problems outlined in this k-blog. Workflows are one solution to the state of the bioinformatics nation and the one upon which we concentrate here.