Lists and iterations
In many types of workflows you might find the need to perform iterations over items in a list. For instance, if the first service in your workflow connects to BioMart to retrieve all sequences in a certain region of a chromosome, and you want to connect this to a second service to perform a BLAST sequence alignment on each of the sequences, you’ll need Taverna to iterate over the returned sequences.
In the Biomart and BLAST workflow, the hsapiens_gene_ensembl BioMart service will return a list of genome sequences from a region on Human chromosome 22. After running the workflow, this can be inspected on the output port transcript_exon_intron. The next service blast_ddbj takes a single input query (and the constants for BLAST parameters program and database) and returns a BLAST report of performing a sequence alignment against the DDBJ rodent gene database.
By simply connecting these services together, Taverna will recognize the depth mismatch between the list and the expected single argument, and perform implicit iteration. Taverna will execute blast_ddbj multiple times, once for each element in the list, and return a new list of BLAST reports to the text_blast_out output port.
While running the workflow and inspecting the text_blast_out output, you can see individual BLAST report appear as soon as they are returned. The Progress report shows how many iterations have been done, and how many are still queued.
If provenance capture is enabled, it is possible to inspect the individual iterations by selecting the blast_ddbj service in the progress report. Each iteration is listed with its individual input and output values, together with the time and the duration of the invocation.
The list output that is created by this implicit iteration can be used as the basis for further iterations for the next steps in the workflow. Taverna takes advantage of individual service outputs being available before the full iteration is finished, pipelining the list items to start iterations over the next services downstream.
This means that in the modified Biomart and BLAST with concatinated gene id worfklow, the Concatenate_gene_id local worker (which adds the Ensembl gene ID to the BLAST report) is iterating at the same time as blast_ddbj, processing each BLAST report as soon as it is available. This means that the overall execution of larger workflows can be much faster than if each iteration was done in isolation before starting the downstream iterations. Pipelining also allows you to see bits of the final results before the workflow is complete.
Configuring list handling
The Biomart and Blast with concatination workflow highlights an example of how Taverna deals with iterations over multiple input ports. In the Concatenate_gene_id service, both string1 and string2 receives a list while expecting a list. The default list handling Taverna will do is the so-called cross-product, which is to combine every string1 with every string2.
In this workflow that behaviour is not desirable, as it would combine every gene ID with every BLAST report. Instead the list handling on Concatenate_gene_id has been configured to perform a dot product, combining the first element of string1 with the first element of string2, second string1 with second string2, etc.