Structure
Cell Superscalar system is composed of two key components: a source to source compiler and a runtime library. The figure shows the process flow that a user application will follow in order to be able to generate an executable for the Cell BE. Given a sequential application in C language, with CellSs annotations the source to source compiler is used to generate two different C files.
The first file corresponds to the main program of the application, and should be compiled with a PPE compiler to generate a PPE object.
The second file corresponds to the code that will be executed under request of the main program in the SPEs. This file must be compiled with an SPE compiler to obtain a SPE object, that will be linked with the SPE libraries to obtain a SPE executable. However, in order to be able to execute this program, it must be embedded in the PPE binary executed in the PPE. For this reason, the PPE embedder is used to generate a PPE object, which is then used with the other PPE objects and PPE libraries as inputs to the PPE linker, which finally generates the Cell executable. Besides the CellSs compiler, the rest of the process is the same that must be followed to generate binaries for the Cell BE.

The main program binary is normally started from the command line and starts its execution in the PPE. At the beginning of this program the activity of the SPEs is initiated by uploading the SPE binary in the memory of each SPE used. These programs will remain idle until the main program application starts spawning work to them. Whenever the main program runs into a piece of work that can be spawned in an SPE (from here one, a task), a request to the runtime library is issued. The runtime will create a node representing this task in a task graph, and will look for dependencies with other tasks issued before, adding edges between them. If the current task is ready for execution (no dependencies with other tasks exists) and there are SPEs available, the runtime will make a request to an SPE to execute this task. The corresponding data transfers are done by the runtime using the DMA engines. The call to the runtime is not blocking and therefore, if the task is not ready or all the SPEs are busy, the system will continue with the execution of the main program.
It is important to emphasize that all this (task submission, data dependence analysis, data transfer) is transparent to the user code, which is basically a sequential application with user annotations that indicates which parts of the code will be run in the SPE. The system can dynamically change the number of SPEs used by the application, taking into account the maximum concurrency of the application at each stage.




