(PhD04) Global Task Data Dependencies for PGAS Applications
Programming Models & Languages
System Software & Runtime Systems
TimeMonday, June 25th1:17pm - 1:21pm
LocationAnalog 1, 2
DescriptionThe poster introduces our approach towards fine-grained synchronization of tasks in applications using the Partitioned Global Address Space (PGAS) programming paradigm. We propose to extend the concept of task data dependencies known from OpenMP v4.0 and later to the global address space in order to coordinate task execution on a global scale. We believe that data dependencies are an elegant way to express the data flow between tasks accessing globally shared memory.
The PGAS programming model promises an improvement in both efficiency and programmer productivity by decoupling communication and synchronization. However, the implicit synchronization of traditional two-sided MPI communication has to be replaced by explicit synchronization operations to ensure correct memory access.
The DASH project provides an implementation of the PGAS programming paradigm through distributed data structures with random access capabilities, e.g., single- and multi-dimensional arrays. In its communication back-end, DASH uses the MPI-3 RMA interface and relies heavily on collective operations to ensure happens-before relations between distributed global memory accesses, most commonly barriers (Figure 1).
Different approaches have been proposed, ranging from hierarchical task creation and explicit synchronization using synchronization objects (futures in HPX, events in UPC) over the use of synchronization variables (Chapel, CAF) to replicated task graphs combined with affinity tags (PaRSEC DTD). In contrast, we propose an implicit synchronization scheme (avoiding synchronization objects) that supports distributed task creation to avoid scalability bottlenecks.
The challenge with distributed task creation is the absence of ordering information to reliably match input and output dependencies across process boundaries. In contrast to OpenMP, we cannot rely on the order in which tasks have been created but instead need to restore partial ordering of tasks from additional information provided by the user. We thus introduce the concept of task phases, which can be seen as a logical clock for tasks and allow us to restore happens-before relations. In our model, output dependencies always refer to the current phase while input dependencies refer to a matching output dependency in a previous phase.
The scheduler instances on different processes communicate all encountered remote dependencies to the scheduler owning the referenced memory. In a dedicated step, all schedulers match their local tasks against the received remote dependencies and resolve possible write-after-read hazards. As soon as all additional dependencies have been communicated, the task execution starts and schedulers communicate dependency releases if necessary (Figure 3).
Figure 4 demonstrates the use of our proposed C++ interface for implementing the Blocked Cholesky Factorization. The async function creates a task that executes the given lambda after all specified dependencies have been satisfied, regardless of whether they refer to local or remote memory. Figures 5 and 6 show the performance on both the Cray XC40 Hazel Hen and Oakforest PACS, a KNL-based system. These early results are encouraging and demonstrate the viability of our approach, which yields mostly comparable performance to PaRSEC DTD. On Oakforest PACS, our approach potentially outperforms PaRSEC DTD due to limited scalability of its replicated task graph.