Machine Learning Day
Carme - An Open Source Framework for Multi-User, Interactive Machine Learning on Distributed GPU-Systems
Event Type
Machine Learning Day
AI/Machine Learning/Deep Learning
Clouds and Distributed Computing
HPC workflows
Scientific Software Development
System Software & Runtime Systems
TimeWednesday, June 27th4:30pm - 5pm
LocationPanorama 2
DescriptionIn order to make high performance clusters (HPCs) attractive and effectively usable for machine learning and data science users we provide an open source framework to manage resources for multiple users running Jupyter notebooks on a HPC. Therefore we combine established open source machine learning tools - like Docker-Images and Jupyter notebooks - with HPC back-ends - like Slurm and BeeGFS. Using Docker gives us the freedom to provide container store, from which users can choose, and in addition allowing users to add their own containers. On the other side we have the HPC back-ends. There the user management gives us the chance to use user quotas (e.g. GPU time or job priority) and define different user roles. The use of a scheduler makes it possible to reserve resources and use a huge variety of different queues according to the needs of the respective job. The reservation makes it easy to have a scalable framework for adding GPUs to running job or having strong and weak scaling of deep learning trainings. Using Slurm for that makes it easy to include our framework in existing clusters. As some kind of bonus most of this back-end will remain invisible for the user as we use a web-interface at which the users can not only login but also submit jobs via Jupyter Notebooks. And last but not least Carme uses only open source components.