Case study: Which proteins are associated with my favourite protein?
This blog post links together a few movies that I put together as a mini-tutorial for colleagues working on bioinformatics and genomics. It demonstrates how a basic text mining workflow can be built using Taverna, myExperiment and BioCatalogue, in this case to answer the question of which proteins may be associatated with a protein of interest.
These are the steps of the workflow:
- Start with your favourite protein name.
- Find publications mentioning your favourite protein
- Extract putative protein names mentioned in the abstracts
- Check if the protein names are genuine proteins (i.e. have a uniprot reference)
- Show a list of validated protein names.
- Creating a workflow using a service from BioCatalogue.org to perform a medline search [5’]
- Replacing a service with a nested workflow to take the medline search results apart [2’]
- Use machine learning to recognize protein names in the abstracts [7’]
- Use the complete workflow from myExperiment.org to perform all steps [3’]
- Replace the workflow input with an import from an Excel spreadsheet [2′]
- Use the workflow as a tool in Galaxy [Work in progress!!]
- Creating an input in Taverna
- Finding a document retrieval service through the BioCatalogue plugin
- Adding the operation to the workflow and browsing the BioCatalogue page about the service to find details and example data for the inputs of an operation
- Show that the results of the operation is one large XML document
- Importing a workflow from your local harddisc into the workflow to replace the search service to a workflow that additionally takes the XML document apart (this workflow could also be found on myExperiment, or taken from a larger workflow).
- Search BioCatalogue for a service that extracts protein names from text
- Importing the service manually into Taverna (not using the BioCatalogue plugin this time)
- Use BioCatalogue to find details and examples for the inputs of the operation
- Find the protein discovery workflow on myExperiment
- Load the complete workflow with additional nested workflows to (i) ‘tweak’ the initial input to prioritize the list of abstracts returned by the search service, (ii) filter out false positives by checking if the protein name is associated with a uniprot identifier.
- Show how to use the Excel spreadsheet tool to obtain a list of inputs from an Excel sheet.
Movie 6: Use the workflow as a tool in Galaxy [Work in progress!!]
- Glimpse of tooling to enable the use of a Taverna workflow as a tool in Galaxy, a popular tool in the field of genomics. This is work in progress in collaboration with the Netherlands BioInformatics Centre (NBIC).