Workflow management

Whenever you have to deal with multiple jobs on a HPC system, the idea of automating parts or all of the job management process involves describing and implementing so-called 'workflows'. Options for managing workflows are numerous and range from using basic scheduler features such as job arrays and job dependencies, up to using a complex system backed up by a central, multi-user database.

From a cluster computing point of view, a workflow is considered to be a collection of possibly interdependent jobs that are all part of the same study.

If we represent jobs with a dot, and dependency between jobs with an arrow, and arrange them from top to bottom, we can distinguish between "wide", "deep", and "cyclic" workflows.

Wide workflows are composed of many jobs that are actually independent one from another, and a few dependent jobs, for instance for consolidating the results. Deep workflows are composed of many jobs that are dependent on only one previous job. Cyclic workflows are more specific in the sense that they express the dependency of a job onto itself, which must be understood as the same job being re-submitted multiple times. For each type of workflow, specific techniques and tools exist.

Wide, deep and cyclic workflows

Our collegues from the Belgian Consortium des Équipements de Calcul Intensif (CÈCI) have written a very nice documentation on how workflows can be organized on an SLURM-based HPC-cluster such as the terrabyte cluster. While the SLURM tips mentioned in this documentation are also applicable to our cluster, some of the proposed helper software packages are not available as modules on our cluster. However, they may be installed individually by using Spack, Conda Environments or Containers.