Strategies for Efficient Exploitation of Loop-Level Parallelism in Java Concurrency and Computation. Concurrency and Computation: Practice and Experience Vol. 13 (8-9), 663-680 (2001).
CUBA: an Architecture for Efficient CPU/co-processor Data Communication. 22nd International Conference on Supercomputing (2008).
Memory Management on Chip-MultiProcessors with on-chip Memories. Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA'08) 1-7 (2008).
On-Chip memories, the OS perspective. 5th HiPEAC Industrial Workshop (2008).
High-Performance Reverse Time Migration on GPU. XXVIII International Conference of the Chilean Computer Society - XIII Workshop on Parallel and Distributed Systems (WSDP) (2009).
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. ACM SIGARCH Computer Architecture News - ASPLOS '10 (2010). at <http://doi.acm.org/10.1145/1735970.1736059>
Decomposable and Responsive Power Models for Multicore Processors using Performance Counters. (2010). at <http://doi.acm.org/10.1145/1810085.1810108>
Assessing Accelerator-based HPC Reverse Time Migration. Transactions on Parallel and Distributed Systems, Special Issue on Accelerators 22(1), 147-162 (2011).
Design space exploration for aggressive core replication schemes in CMPs. Proceedings of the 20th international symposium on High performance distributed computing 269–270 (2011). doi:http://doi.acm.org/10.1145/1996130.1996169
DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory. Parallel Architectures and Compilation Techniques (PACT) (2011).
Energy accounting for shared virtualized environments under DVFS using PMC-based power models. Future Generation Computer Systems 28, 457 - 468 (2011).
FELI: HW/SW support for On-Chip Distributed Shared Memory in Multicores. Euro-Par (2011).
TARCAD: A template architecture for reconfigurable accelerator designs. IEEE Symposium on Application Specific Processors (SASP) 8-15 (2011).
Assessing the impact of network compression on Molecular Dynamics and Finite Element Methods. 14th International Conference on High-Performance Computing and Communications (HPCC-2012) (2012).
BSArc: Blacksmith Streaming Architecture for HPC Accelerators. ACM International Conference on Computing Frontiers (2012).
Counter-Based Power Modeling Methods: Top-Down vs. Bottom-Up. The Computer Journal (2012). doi:10.1093/comjnl/bxs116
Energy accounting for shared virtualized environments under DVFS using PMC-based power models. Future Generation Computer Systems 28, 457 - 468 (2012).
Hardware-software coherence protocol for the coexistence of caches and local memories. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis 89:1–89:11 (2012). at <http://dl.acm.org/citation.cfm?id=2388996.2389117>
PPMC : Hardware Scheduling and Memory Management Support for Multi Accelerators. 22nd International Conference on Field Programmable Logic and Applications (FPL-2012 (2012).
Comparison Based Sorting for Systems with Multiple GPUs. GPGPU-6 - Six Workshop on General Purpose Processing Using GPUs (2013). doi:10.1145/2458523.2458524
Counter-Based Power Modeling Methods: Top-Down vs. Bottom-Up. Computer Journal 56, 198–213 (2013).
Hardware-Software Coherence Protocol for the Coexistence of Caches and Local Memories. IEEE Transactions on Computers 99, 1 (2013).
A Systematic Methodology to Generate Decomposable and Responsive Power Models for CMPs. IEEE Transactions on Computers 62, 1289-1302 (2013).
The TERAFLUX Project: Exploiting the DataFlow Paradigm in Next Generation Teradevices. Euromicro Conference on Digital System Design, DSD 2013 272–279 (2013).