(PP23) Decimate: A Portable and Fault-Tolerant Scheduler Extension Efficiently Handling a Large Number of Dependent Jobs
Clouds and Distributed Computing
System Software & Runtime Systems
TimeTuesday, June 26th3:15pm - 3:45pm
DescriptionSimultaneously accommodating needs of our scalability and capacity users on our KAUST XC-40 Shaheen II is challenging. To support workflows composed of a large number of jobs, KAUST Supercomputing laboratory has developed Decimate to ease the submission, monitoring and dynamic steering of workflows of dependent jobs. Written in Python 2.7 and distributed under BSD 2 license and directly installable from pypi.org without requiring any privileges, it extends the SLURM scheduler, transparently adding prologue and epilogue to any user script and submitting the right job dependency that automatically add new chunks of work or relaunch a job in case of a hardware, software or numerical convergence failure.
Decimate allows a user to:
• Submit an arbitrary number of jobs regardless of limitation in the scheduling policy.
• Manage a set of jobs as a single workflow easing their submission, monitoring, deletion or reconfiguration.
• Ease the definition, submission, and management of jobs run on a large set of combinations of parameters.
• Benefit from a centralized log file.
• Send user-configurable mail messages at any step of its execution of the workflow.
• To evaluate the outputs of a job via a user-configurable shell or Python script and make a decision either to stop the whole workflow, to resubmit partially the failing components as is, or to modify it dynamically.
The proposed poster details Decimate’s implementation, features and its application to a complex and challenging ocean ensemble data assimilation and forecasting workflow.