Behaviors of workflow runs¶
This section describes the expected behavior of a workflow run in the context of various options and features that Tibanna_pony or Tibanna_zebra supports. Overall, Pony and Zebra behave in a very similar way, with just a few very specific differences (see below). The features and behaviors of Pony is tightly associated with fourfront
(https://github.com/4dn-dcic/fourfront) and those of Zebra with cgap-portal
(https://github.com/dbmi-bgm/cgap-portal).
Metadata (overview)¶
The database schemas implemented in fourfront
and cgap-portal
that are relevant to Pony and Zebra are the following:
- Created by Tibanna Pony and Zebra
- WorkflowRun
- FileProcessed
- QualityMetric (and those inherited from QualityMetric)
- QualityMetricWorkflowrun (resource metric report for the run itself)
- processed QualityMetric (e.g. QualityMetricFastqc, QualityMetricBamcheck, … )
- May be handled as an input of a workflow run
- FileFastq
- FileMicroscopy
- FileReference
- FileProcessed (a processed file created from a previous run can be an input)
- Workflow (workflow itself is an input of a workflow run)
For every workflow run, Pony and zebra both create a WorkflowRun
object (WorkflowRunAwsem
more specifically, which inherits from WorkflowRun
) with a new uuid
and an awsem_job_id
that matches the job id of the run. They also create FileProcessed
items for output files that we want to keep (Output processed file
) that has a legit file format defined in the portal (e.g. bam
), sometimes has an accompanying file (extra_file
), again with a legit file format (e.g. bai
). Not all workflow runs create a processed file output and depending on the type of output, a QualityMetric
object may be created (Output QC file
) or some field of the input file may be filled (e.g. md5sum
and file_size
) (Output report file
) or a new extra file of an input file (Output to-be-extra-input file
) may be created. Each of the FileProcessed
and QualityMetric
objects created is assigned a new uuid
. Input files, processed files and QualityMetric
objects are linked from the current WorkflowRun
object and the QualityMetric
objects are linked from a specified file (either input or processed).
If you rerun the same workflow run, it will not overwrite the existing WorkflowRun
, FileProcessed
or QualityMetric
objects, but will create new ones. However, if a QualityMetric
item is linked from any input file, this link would be replaced by the new QualityMetric
. The old QualityMetric
will still exist but just not linked from the input file any more. However, if the workflow run creates a new extra_file
of an input file, a rerun will replace the file on S3
without changing the metadata of the input file. This is harder to trace, so to be safe, one can use an option "overwrite_input_extra" : true
to allow the overwrite - without this option, by default, the rerun will fail to start.
The metadata are created at the beginning of a workflow run except QualityMetric
objects. At the end of a run, they are patched with the status. If the run is successful, the WorkflowRun
object is patched a status complete
. If there was an error, it is patched a status error
.
A resource metric report is linked from WorkflowRun
at the end of each run as a QualityMetricWorkflowrun
object.
Workflow Run Identifier¶
Tibanna provides two identifiers for each workflow run: AWSEM Job ID (or in short, job id) and Execution ARN. They look like below for example:
- AWSEM Job ID (job id): 12-letter random string; each EC2 instance is tagged with a corresponding AWSEM Job ID.
- e.g.
Aq1vSsEAEiSM
- e.g.
esAlf8LRZvNd
- e.g.
XbYhNcbtFKwG
- e.g.
- Execution ARN: AWS ARN-style string that contains info about AWS account, region, step function and execution name on the step function.
- e.g.
arn:aws:states:us-east-1:643366669028:execution:tibanna_pony:hic-bam-run_workflow_test_6dc8d3d6-8eb9-42b0-bfa8-81bcbe57273c
- e.g.
In addition to these two identifiers, a Tibanna Pony or Zebra run is also associated with a WorkflowRun uuid, which is uniquely created for each workflow run. It is an identifier for a WorkflowRun object.
- WorkflowRun uuid
- e.g.
b94e6891-c649-4178-88e9-fad59f04dd7a
- One can access the workflow run on the 4DN data portal if it’s released to public or if you’re logged in and have permission to view this object. https://data.4dnucleome.org/b94e6891-c649-4178-88e9-fad59f04dd7a
- e.g.
Config¶
The config
of pony/zebra input json is directly passed to unicorn and is pretty much the same. There are some additional fields
for pony and zebra that can be specified in config
.
Additional fields for pony and zebra¶
"overwrite_input_extra" : true|false
(defaultfalse
) : if an output file type isOutput to-be-extra-input file
, a rerun of the same workflow run will fail to start unless this flag is set to betrue
, to protect an existing extra file of an input file created by a previous run or an existing run that will create an extra file of an input file. One should use this flag only if one is sure that the previous or the current run has a problem and the output needs to be overwritten."email" : true|false
(defaultfalse
) : if this flag is set to betrue
, it will send an email from4dndcic@gmail.com
to itself (in case ofpony
) orcgap.everyone@gmail.com
to itself (in case ofzebra
). To enable this to work, I had manually registered and verified these two emails on AWS Simple Email Service (SES). Since, it requires a manual registration of an email, it is not currently supported by Unicorn.
There are also recommended fields for pony and zebra, even though they are not pony/zebra-specific (unicorn also supports these features).
Recommended fields for pony and zebra¶
"public_postrun_json" : true
: it is recommended to set this flag totrue
so that the postrun json files are open to public (and can be accessed through Web Browser). Both public and private postrun json files can be accessed through CLI commandtibanna log -j jobid -p
given the right AWS credentials are set up."key_name": "4dn-encoded"
: for security reasons, it is recommended to use the4dn-encoded
key rather than just passwords for sshing to an AWSEM instance.
Input file handling¶
Dimension¶
An input file may have dimension 0~3 (single element, a 1D array, a 2D array, or a 3D array).
Extra files¶
An input file may have extra files. Extra files are equivalent to secondary files in CWL, and usually includes index files (e.g. px2
, idx
, tbi
, bai
, fai
, …). If there are multiple extra files, they should have different formats (extensions). The workflow objects and Tibanna input jsons do not have to specify any extra files and all the extra files associated with a specified input file’s File object is automatically transferred along with the file itself to the AWSEM instance.
However, it is required that the input file’s File object does contain a corresponding extra file, if CWL requires a secondary file for that input.
Renaming files¶
The file key on S3 follows the convention <uuid>/<accession>.<extension>
. Some workflows require some input files to have specific names and to handle this problem, we use the field rename
in the individual input file dictionary in the input json to specify the target name. When the file is downloaded to the AWSEM instance, before running the workflow, the file will be renamed to this target name. By default, it will be the same as the key on S3.
Output file handling¶
There are four types of output - processed file
, QC file
, report file
and to-be-extra-input file
.
Output processed file handling¶
Tibanna creates a FileProcessed item for each processed file output in the beginning of the workflow run (through start_run
) with status to be uploaded by workflow
. At the end of the run, it patches the FileProcessed
objects with status
(uploaded
), md5
and file_size
(through update_ffmeta
).
If an output processed file has an extra file, likewise the metadata for the extra files will also be created in the beginning of the run, with status to be uploaded by workflow
. At the end of the run, the extra files will be patched with status
(uploaded
), md5
and file_size
(through update_ffmeta
). In order for an output processed file to have an extra file(s), the secondary_file_formats
must be specified in the workflow
arguments
field for the corresponding output processed file.
Quality metric handling¶
For QC type output, Tibanna does not create a FileProcessed item but instead creates a QualityMetric item. The quality metric item is created at the end of a workflow run, not at the beginning, since it is linked from one of the File items (either input or output) involved and if we create a new QualityMetric object in the beginning, it would inevitably replace the existing one, and if the run failed, the new one would remain linked despite the fact that the run failed.
Format of QC output¶
An example QC type output is the output of a fastqc
run or a pairsqc
run, which is a zipped file containing an html file, some text files and image files to be used by the html. However, a regular, non-QC workflow may also create a QC-type output. For example, each of the first few steps of the CGAP upstream pipeline creates a bam file along with a simple QC called bam-check
which simply checks that the bam file has a header and is not truncated. These workflows have two (or more, in case there are additional output) output files, one Out processed file
which is the bam
file and one Output QC file
which is the bam-check
report. This bam-check
report does not have any html file and is not zipped. It’s a single text file, which is parsed to create a QualityMetricBamcheck
object.
To allow flexibility in the format of QC type output, certain qc flags are specified in the Workflow
object (not in the tibanna input json), in the arguments
field. There may be multiple QC type output files for a single workflow run, and for each, the following must be specified
"qc_zipped": true|false
: the output file is zipped"qc_html": true|false
: the output file is an html file"qc_json": true|false
: the output file is a json file"qc_table": true|false
: the output file is a table file (tab-delimited text file)"qc_zipped_html": <name_of_html_file>
: the name of the html file in case the output zipped file contains an html file"qc_zipped_tables": <array_of_name(or_suffix)_of_table_files>
: the name of the table files in case the output zipped file contains table files."qc_type": <name_of_quality_metric_type>
: name of theQualityMetric
item type (e.g.quality_metric_fastqc
,quality_metric_bamcheck
). This field can be skipped which means that noQualityMetric
item will be created even though the other QC processings (e.g. unzipping the contents, moving the file to a specific location and creating an html, etc) may still happen. This None option was added originally to be able to handle bamsnap output files as QC files without generating aQualityMetric
item. However, we ended up moving the bamsnap handling to EC2 since it frequently hit lambda memory and runtime limit while unzipping the output (seeqc_unzip_from_ec2
)"argument_to_be_attached_to": <argument>
: the workflow argument name of the file (either input or output) from which theQualityMetric
object should be linked. (e.g. if the QualityMetric object will be link to the processed bam file whose argument name israw_bam
, this field can be set toraw_bam
.) This is a required parameter."qc_unzip_from_ec2": true|false
: whether the output zip file should be unzipped to s3 directly from ec2 (default false). This is relevant only if the qc output is zipped and we want the contents of the zip file to be extracted to a folder in S3.
As you can see above, a text-style QC output can either be a JSON or a TSV format. The main difference is that if the output is a TSV format, the corresponding fields must exist and be specified in the schema of the QualityMetric item. A JSON-format output goes directly to the QualityMetric item, and to allow this, the schema must have additional_properties
to be set true
.
Behavior of Tibanna-ff given different QC parameters¶
The following table is an example scenario where a workflow generates six different QC outputs, which is highly unlikely to happen in the real world, but it is introduced to illustrate the behaviors.
The first column is the argument name which is unique to each of the six QC outputs. Each of them could have a different QC parameter combination. The second column is the argument_to_be_attached_to
, which is a required parameter. The third column is the QualityMetric
item type for each QC output. The fourth column describes the QualityMetric
itemms to be created and the fifth column describes the QualityMetricQclist
item to be created to accommodate multiple QualityMetric
itemms created for a given argument_to_be_attached_to
. A QualityMetricQclist
item is not created if there is only one QualityMetric
item linked from the file metadata of the argument_to_be_attached_to
, unless the same file metadata already has a QualityMetric
item from a different workflow run.
The first row is a typical case scenario in which one QC argument to be attached to one input file and so one QualityMetric
item will be created for this one.
The second and third rows show two different QC arguments to be attached to the same file and each has its own qc_type
. In this case, a QualityMetric
item will be created for each QC argument and a QualityMetricQclist
item will also be created to link two QualityMetric
items to a single file item. An example could be a BAM file linked to both bamcheck
and bamqc
outputs, the former about the sanity check of the output file integrity and the latter about the statistics on the contents of the file including the number of reads, coverages, etc.
The fourth and fifth rows show a similar case except that two different QC arguments have the same qc_type
. Since a file item cannot have multiple QualityMetric
items of the same type, this means only one QualityMetric
item will be created and the two QC arguments will be merged into this single QualityMetric
item. Since only one QualityMetric
item is generated, no QualityMetricQclist
item is created. This case may not be intuitive, but an example is a workflow that generates two separate QC output files, one in JSON format and the other in the HTML format. In such a case, we want to create a single QualityMetric
item with the fields taken from the JSON output and a link to the HTML report.
The sixth row shows a case where qc_type
is not set (set to None
). In this case, no QualityMetric
item is generated. It is used for hidden QC files that do not have a QualityMetric
item associated with it, such as the output for bamsnap
.
Note that QC arguments with the same qc_type
do not lead to a merge if their argument_to_be_attached_to
is different, since each of the file items will have their own QualityMetric
item in this case.
Multiple QC metrics¶
A single workflow run may produce multiple QualityMetric objects and Tibanna Pony/Zebra supports it.
On the CGAP portal, a single File item may have multiple QualityMetric objects, but only through QualityMetricQclist. A File item cannot directly link to multiple QualityMetric objects, since the field quality_metric
in a File object is not a list. A QualityMetricQclist object has a field qc_list
which is a list of links to other QualityMetric objects. 4DN portal currently does not support QualityMetricQclist.
When there are multiple QC-type output, Tibanna Pony/Zebra will segregate the QC output files by argument_to_be_attached_to
. Let’s say there are three QC output files and two of them are associated with out_bam
and the third one is associated with out_bw
. The first two will be merged into a single QualityMetric
object, and the third one will be its own QualityMetric
object, i.e. Tibanna will create two QualityMetric
objects even though there are three QC output files, because there are only two distinct groups based on argument_to_be_attached_to
. The first two QC output files must have the same qc_type
, but may be a different format - e.g. one of them is html and the other one is JSON. A File
item is never associated with more than two QualityMetric
objects of the same type.
Tibanna checks if the File
item to associate a new QualityMetric
object already has any QualityMetric
associated with it. If it does, Tibanna does the following.
- If the existing
QualityMetric
object is the same type as the newQualityMetric
object, replace the old one with the new one. - If the existing
QualityMetric
object is of a different type from the newQualityMetric
object, create a newQualityMetricQclist
object and link it to the correspondingFile
object, move the oldQualityMetric
object to theQualityMetricQclist
object, and add a newQualityMetric
object to theQualityMetricQclist
object. - If the existing
QualityMetric
object is of typeQualityMetricQclist
, check the types ofQualityMetric
objects intheQualityMetricQclist
object, and if there exists aQualityMetric
object of the same type as the newQualityMetric
object, replace this one with the new one. If not, add the newQualityMetric
object to the existingQualityMetricQclist
object.
In theory, a single workflow run could create multiple QualityMetric
types by creating a new QualityMetricQclist
and adding all of the QualityMetric
objects to it, but currently Tibanna does not support it. It may be implemented if we have a case where multiple types of QC is generated by a single workflow for a single file item.
Report-type output handling¶
A report-type output is different from a QC-type output in that no QualityMetric
object is created out of it. A good example of a report-type output is md5
which calculates the md5sum
of an input file and the result report output file that contains the md5sum
value is parsed and the value is patched to the md5sum
(and content_md5sum
if the file is compressed) of the input File
item.
Handling output that becomes an extra file of an input file¶
An example of an Output to-be-extra-input file
is the output of workflow bed2beddb
where the output beddb
file will be attached as an extra_file
of the input bed
file, instead of creating a separate processed file with the beddb
format.
By default, a second run of the same workflow run fails to start, to avoid overwriting the output extra file without any metadata log, unless "overwrite_input_extra": true
is set in the config
of the input json.
The extra file in the input File
metadata is created at the beginning of the run (through start_run
) with status to be uploaded by workflow
and the AWSEM instance will upload the output file to the right bucket with the right key including the right extension (the extension of the extra file). If this upload fails, check_task
will throw and AWSEM error. The last step update_ffmeta
will make sure that the key with the right extension exists in the right bucket, but it does not check that the file is new or not. If it does, it will update the status of the extra file to uploaded
.
Custom fields¶
In case we want to pass one custom fields to WorkflowRun
, FileProcessed
or QualityMetric
objects that are created by a workflow run, we can do that by adding custom fields to the input json. Common examples of custom field would be lab
and award
for pony and project
and institution
for zebra. One could also set genome_assembly
to be passed to a FileProcessed
object.
Common Cutsom fields¶
Common custom fields can be specified in the common_fields
field of the input json and the contents will be passed to all created items including WorkflowRun
, FileProcessed
and QualityMetric
, including QualityMetricWorkflowrun
and QualityMetricQclist
. This can be used for lab
and award
for pony and project
and institution
for zebra. More specific custom fields can be defined using wfr_meta
, custom_pf_fields
and custom_qc_fields
.
Custom fields for workflow run¶
The wfr_meta
field specifies custom fields to be passed to a WorkflowRun
object.
"wfr_meta": { "key1": "value1", "key2": "value2" ,,, }
In the above example, the WorkflowRun
object will have field key1
with value value1
and field key2
with value value2
.
Custom fields for processed files¶
The custom_pf_fields
field specifies custom fields to be passed to a FileProcessed
object. This field has one additional level to specify whether the field should apply to all processed files (ALL
) or a specific processed file (the argument name of the specific processed file).
"custom_pf_fields": {
"ALL": { "key1": "value1", "key2": "value2" },
"out_bam": {"key3": "value3" }
}
In the above example, if we have two output files with argument names out_bam
and out_bw
, the processed file(s) associated with both out_bam
and out_bw
will have field key1
with value value1
and field key2
with value value2
, but only the processed file(s) associated with out_bam
will have field key3
with value value3
.
Custom fields for quality metrics¶
The custom_qc_fields
field specifies custom fields to be passed to all QualityMetric
, except QualityMetricWorkflowrun
and QualityMetricQclist
.
"custom_qc_fields": { "key1": "value1", "key2": "value2" ,,, }
In the above example, the QualityMetric
items will have field key1
with value value1
and field key2
with value value2
.