CellSs Runtime

The CellSs runtime is decoupled in two parts: one part runs in the PPU and another in each of the SPUs. In the PPU, we will differentiate between the master thread and the helper thread.

The most important change in the original user code that the CellSs compiler inserts are the calls to the Execute function whenever a call to an annotated function appears. At runtime, these calls to the Execute function will be the responsible for the intended behavior of the application in the Cell BE processor. At each call to Execute, the master thread will do the following actions:
 

  • Addition of a node in a task graph that represents the called task.
  • Data dependency analysis of the new task with other previously called tasks.
  • Parameters renaming: similarly to register renaming, a technique from the superscalar processor area, we do renaming of the output and input/output parameters. For every function call that has a parameter that will be written, instead of writing to the original parameter location, a new memory location will be used, that is, a new instance of that parameter will be created and it will replace the original one, becoming a renaming of the original parameter location. This allows to execute that function call independently from any previous function call that would write or read that parameter. This technique allows to effectively remove some data dependencies by using additional storage, and thus improving the chances to extract more parallelism.

The helper thread is the one that decides when a task should be executed and also monitors the execution of the tasks in the SPUs.

Given a task graph, the helper thread schedules tasks for execution in the SPUs. This scheduling follows some guidelines:
 

  • A task can be scheduled if its predecessor tasks in the graph have finished their execution
  • To reduce the overhead of the DMA, strands of tasks are submitted to the same SPU.
  • Locality of data is exploited by keeping task outputs in the SPU local memory and scheduling tasks that reuse this data to the same SPU.

The helper thread synchronizes and communicates with the SPUs using a specific area of the PPU main memory for each SPU. The helper thread indicates the length of the strand of tasks to be executed and information related to the input and output data of the tasks.

The SPUs execute a loop waiting for tasks to be executed. Whenever a strand of tasks is submitted for execution, the SPU starts the DMA of the input data, processes the tasks and writes back the results to the PPU memory. The SPU synchronizes with the PPU to indicate end of the strand of tasks using a specific area of the PPU main memory.