(PhD14) Hybrid Hierarchical Data Management System: Accelerating Data Processing on HPC Systems
Big Data Analytics
System Software & Runtime Systems
TimeMonday, June 25th1:53pm - 1:57pm
LocationAnalog 1, 2
DescriptionHigh-performance computing (HPC) systems provide an indispensable computing infrastructure for scientific modeling and simulations. While HPC systems are invented to solve compute-intensive workloads, the explosion of scientific data generated from large-scale simulations and advanced sensors makes scientific workflows more complex and more data-intensive.
System architectures are designed to best support the typical workloads running on the clusters. Supporting these coupled workflows that contain both compute-intensive and data-intensive workloads on HPC systems presents new challenges in data management due to their scales, coordination behaviors and overall complexities.
1) Firstly, the increasing amounts of data are likely to store in different data sources, including cloud storage, databases and diverse file systems. The lack of uniform data access interface leads to further increase in workflow complexity.
2) Secondly, the performance gap between memory and disk keeps growing, it’s urgent to provide an efficient data sharing mechanism that allows different workload to share data in memory without repetitive data movement.
3) Thirdly, heterogeneous storage devices (e.g., memory, SSDs, HDDs) are commonly used and the memory storage hierarchy on HPC systems is getting deeper, it’s critical for data management systems that can manage data on tired storage efficiently.
4)Finally, existing HPC parallel file systems, such as Lustre, GPFS, and OrangeFS, face serious challenges in managing metadata because of the centralized metadata management.
Scientific data management can be divided into several stages, including data collection, storing &querying, data processing, publish and sharing, etc. A few data management systems have explored some of these aspects. Storage Resource Broker, developed by San Diego Supercomputer Center, is a middleware that provides uniform API that allows users to access heterogeneous storage resources. iRODS, the successor of SRB, provide a configurable rule engine to set customized data management rules. While SRB and iRODS focus on data storing and data sharing, Qserv and FastQuery focus on data querying to satisfy the ad-hoc user query access to gigantic scientific data. These data management systems simplified data sharing, but provide limited support to accelerate data processing.
To solve the above challenges and accelerate data processing of coupled workflows on HPC systems, we intend to design Hybrid Hierarchical Data Management System, which contains following modules:
Tiered Data Management provides cross-tier data sharing on available storage devices to reduce data movement and customizes data management strategies based on workflow data access patterns to accelerate data processing.
Data-aware Job Scheduling aims to move the compute to the data and make job scheduling decisions based on storage tier and workflow data access patterns.
Distributed Metadata Management aims to solve the centralized metadata management bottleneck via Hash+SubTree Namespace management. All metadata are stored in Key-Value pair to satisfied the diverse metadata requirement on scientific data.
Domain Specific Optimizations Module provides specific optimizations for target applications, such as optimizations for Hadoop/Spark shuffle, optimizations for lots of small files.