Research Paper
Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning
Event Type
Research Paper
AI/Machine Learning/Deep Learning
Big Data Analytics
Clouds and Distributed Computing
Performance Analysis and Optimization
Performance Tools
TimeWednesday, June 27th9am - 9:30am
LocationSubstanz 1, 2
DescriptionModern distributed computing frameworks for cloud computing
and high performance computing typically accelerate job performance
by dividing a large job into small tasks for execution parallelism.
Some tasks, however, may run far behind others, jeopardizing the job
completion time. In this paper, we present Zeno, a novel straggler diagnosis
system using machine learning methods to automatically identify
and diagnose stragglers for jobs in distributed computing. First, the system
identifies stragglers with an unsupervised clustering method which
groups the tasks based on their execution time. It then uses a supervised
rule learning algorithm to learn diagnosis rules inferring the stragglers
with their resource assignment and usage data. Zeno is evaluated
on traces from a Google’s Borg system and an Alibaba’s Fuxi system.
The results demonstrate that our system is able to generate simple and
easy-to-read rules with both valuable insights and decent performance
in predicting stragglers.
Software Engineer
Senior Staff Engineer