Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors. OMHI Workshop 2014 (2014).
Analyzing performance improvements and energy savings in Infiniband Architecture using network compression. SBAC-PAD 2014 (2014).
Software-Managed Power Reduction in Infiniband Links. ICPP 2014 (2014).
Comparison Based Sorting for Systems with Multiple GPUs. GPGPU-6 - Six Workshop on General Purpose Processing Using GPUs (2013). doi:10.1145/2458523.2458524
Counter-Based Power Modeling Methods: Top-Down vs. Bottom-Up. Computer Journal 56, 198–213 (2013).
Hardware-Software Coherence Protocol for the Coexistence of Caches and Local Memories. IEEE Transactions on Computers 99, 1 (2013).
A Systematic Methodology to Generate Decomposable and Responsive Power Models for CMPs. IEEE Transactions on Computers 62, 1289-1302 (2013).
The TERAFLUX Project: Exploiting the DataFlow Paradigm in Next Generation Teradevices. Euromicro Conference on Digital System Design, DSD 2013 272–279 (2013).
Assessing the impact of network compression on Molecular Dynamics and Finite Element Methods. 14th International Conference on High-Performance Computing and Communications (HPCC-2012) (2012).
BSArc: Blacksmith Streaming Architecture for HPC Accelerators. ACM International Conference on Computing Frontiers (2012).
Counter-Based Power Modeling Methods: Top-Down vs. Bottom-Up. The Computer Journal (2012). doi:10.1093/comjnl/bxs116
Energy accounting for shared virtualized environments under DVFS using PMC-based power models. Future Generation Computer Systems 28, 457 - 468 (2012).
Hardware-software coherence protocol for the coexistence of caches and local memories. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis 89:1–89:11 (2012). at <http://dl.acm.org/citation.cfm?id=2388996.2389117>
PPMC : Hardware Scheduling and Memory Management Support for Multi Accelerators. 22nd International Conference on Field Programmable Logic and Applications (FPL-2012 (2012).
Assessing Accelerator-based HPC Reverse Time Migration. Transactions on Parallel and Distributed Systems, Special Issue on Accelerators 22(1), 147-162 (2011).
Design space exploration for aggressive core replication schemes in CMPs. Proceedings of the 20th international symposium on High performance distributed computing 269–270 (2011). doi:http://doi.acm.org/10.1145/1996130.1996169
DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory. Parallel Architectures and Compilation Techniques (PACT) (2011).
Energy accounting for shared virtualized environments under DVFS using PMC-based power models. Future Generation Computer Systems 28, 457 - 468 (2011).
FELI: HW/SW support for On-Chip Distributed Shared Memory in Multicores. Euro-Par (2011).
TARCAD: A template architecture for reconfigurable accelerator designs. IEEE Symposium on Application Specific Processors (SASP) 8-15 (2011).
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. ACM SIGARCH Computer Architecture News - ASPLOS '10 (2010). at <http://doi.acm.org/10.1145/1735970.1736059>
Decomposable and Responsive Power Models for Multicore Processors using Performance Counters. (2010). at <http://doi.acm.org/10.1145/1810085.1810108>
High-Performance Reverse Time Migration on GPU. XXVIII International Conference of the Chilean Computer Society - XIII Workshop on Parallel and Distributed Systems (WSDP) (2009).
CUBA: an Architecture for Efficient CPU/co-processor Data Communication. 22nd International Conference on Supercomputing (2008).
Memory Management on Chip-MultiProcessors with on-chip Memories. Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA'08) 1-7 (2008).
On-Chip memories, the OS perspective. 5th HiPEAC Industrial Workshop (2008).
Strategies for Efficient Exploitation of Loop-Level Parallelism in Java Concurrency and Computation. Concurrency and Computation: Practice and Experience Vol. 13 (8-9), 663-680 (2001).