BSC releases OmpSs-2 version 2019.11

20 November 2019

This release includes new scalable scheduling infraestructure.

The Programming Models group at BSC has published the sixth public release (version 2019.11) of the OmpSs-2 programming model.

OmpSs-2 is a data-flow programming model that supports both task nesting and fine-grained dependencies across different nesting levels. This enables the effective parallelization of applications using a top-down methodology.

In this release, we include several new features:

1. New scalable scheduling infrastructure

All the previous schedulers have been unified in a new scheduling infrastructure, so the end user can select at execution time all the properties of the scheduler (ie: scheduling policy, enable/disable immediate successor, etc). The new infrastructure has been optimized to be efficient and scalable on large many-core processors.

2. Work-sharing tasks

A new clause called task for has been added to efficiently exploit fine-grained structured parallelism. This new type of task can be collaboratively executed by several threads, reducing the number of ready tasks required to exploit large many-core processors.

3. Support for Dynamic Load Balancing (DLB)

A new CPU manager has been implemented to support the DLB library (https://pm.bsc.es/dlb). The DLB library enables different runtime systems running on the same node to efficiently share compute resources.

4. Initial support for OmpSs-2@Linter

OmpSs-2 Linter (https://github.com/bsc-pm/ompss-2-linter) is a debugging/linting tool which takes as input an application parallelized using the OmpSs-2 programming model and provides a report of all the potential parallelization errors that were encountered by tracing the application.

5. New discrete implementation of the dependency subsystem

The default dependency system of OmpSs-2 supports memory regions. Although, this feature is performant and compelling for many applications, in some situations, it can also introduce more overhead than the discrete dependency system provided by OpenMP implementations. In this release we have implemented a new discrete dependency system with performance on par with state-of-the-art OpenMP runtimes.

Most of the new features in this release have been developed in the context of the DEEP-EST project.

OmpSs-2 main features

  • Advanced dependency system

Dependency system based on memory regions

Support of in/inout, concurrent, commutative, multideps, scalar & array reductions with dependencies based on memory regions which can partially overlap.

Early release of dependencies

By default, once the body of a task has been executed it will release all the dependencies that are not included on any unfinished descendant task. If the wait clause is specified in the task construct, all its dependencies will be released at once when the task becomes deeply completed.

Weak dependencies

The weakin/weakout clauses specify potential dependencies only required by descendants tasks. These annotations do not delay the execution of the task and can be used to exploit fine-grained dependencies between nesting-levels

Task and array reductions

Extend the task construct adding support for the reduction clause, which works like inout dependencies regarding points 2 & 3. It is also possible to specify expressions with array type in the reduction clause. In addition, the new weakreduction clause can be used to specify the memory region where reductions are going to be defined, allowing nested reductions as well as memory allocation optimizations.

  • CUDA C Kernels

The OmpSs-2 tasking model has been extended to support tasks that are written in CUDA C. The CUDA C kernels annotated with the OmpSs-2 task construct are invoked and scheduled like regular tasks, simplifying the development of heterogeneous applications. The current implementation relies on the Unified Memory feature provided by the latest NVIDIA cards to automatically move the required data from the host to the device and vice-versa.

  • Task-Aware MPI (TAMPI) library

This MPI library developed in the context of the INTERTWinE project (and available at github.com/bsc-pm/tampi) augments the interoperability features of MPI to enhance hybrid programming with tasking models such as OmpSs-2. This MPI library has been extended with a new MPI threading level --called MPI_TASK_MULTIPLE-- that enables the safe use of blocking and non-blocking MPI operations inside a task. This library relies on the Nanos6 pause/resume, external events and polling services APIs to provide this enhanced interoperability between MPI and OmpSs-2.

  • Low-Level runtime APIs

Task Pause/Resume API

A new API that can be used to programmatically suspend and resume the execution of a task. It is currently used to improve the interoperability and performance of hybrid MPI and OmpSs-2 applications.

Nanos6 generic polling services

A new API allows to coordinate the execution of tasks with the execution of arbitrary functions. Its purpose is to improve the interoperability with software that requires polling-like behavior, in a way that minimizes the interference on resource usage.

External events API

This new API can be used to defer the release of dependencies of a task until the task has been executed and a set of external events are fulfilled (e.g. on completion of an asynchronous MPI operation). This API is used to implement the support of asynchronous MPI operations on the TAMPI library.

Native offload API

An asynchronous API to execute OmpSs-2 kernels on a specified set of CPUs from any kind of application, including Java, Python, R, etc.