The Programming Models group at Barcelona Supercomputing Center (BSC) has published a new release (version 2020.06) of the OmpSs-2 programming model. In this release, we have added several new major features such as a compiler based on LLVM, an integrated tracing tool, and support for OpenACC kernels. Moreover, we have optimized the scheduler infrastructure, the memory allocator and the discrete dependency system to improve performance and scalability of OmpSs-2 applications on many-core systems.
1. LLVM based compiler
In this release we have included, for the very first time, a compiler based on the LLVM compiler infrastructure that will complement the venerable Mercurium source-to-source compiler. This extended LLVM compiler is in a beta stage, but it already supports most of the OmpSs-2 features when targeting the Nanos6 runtime system. Moreover, the LLVM OpenMP runtime distributed with our extended LLVM compiler has been modified to support the TAMPI library that allows a seamless use of non-blocking MPI calls inside OpenMP tasks.
2. Enhanced support for accelerators
This release is the first one to support kernels specified with OpenACC pragmas. To that end, the Mercurium source-to-source compiler and the Nanos6 runtime have been extended to support a subset of the OpenACC pragmas and the PGI runtime API respectively. Moreover, the CUDA device has been refactored to include automatic data prefetching when CUDA Unified Memory is used. This version also includes support for cuBLAS and similar libraries.
3. General performance enhancements
In this release, we have modified the runtime to use the low-level API of jemalloc to improve the performance and scalability of small memory allocations inside the runtime. The CPU manager and scheduling infrastructure has been refactored to improve performance and scalability on many-core systems. The implementation of work-sharing tasks has been modified to exploit better data-locality across task fors instances. Finally, a new turbo variant of the runtime is available. This variant enables some processor floating-point optimizations, as well as, the discrete dependency system.
4. Integrated tracing library
Nanos6 has a new experimental lightweight tracing module that generates traces in the Common Trace Format (CTF). The module is lockless for most common cases and emits a minimalistic set of Nanos6 events with optional PAPI hardware counters support. Future releases will support MPI and Linux Kernel events. Nanos6 converts CTF to Paraver traces automatically, which can be inspected using the provided new set of Paraver configurations.
5. Enhanced implementation of the discrete dependency system
In this release we have extended the lock-free discrete dependency system to support weak, commutative and concurrent dependencies, so now, it already supports all the OmpSs-2 dependency types but regions.
Most of the new features in this release have been developed in the context of the DEEP-EST and Lo-Sync (PRACE-6IP) projects. The support for OpenACC kernels has been developed in the context of the EPEEC project.