Abstract
Breadth First Search (BFS) plays a key role in computational science, networking, and artificial intelligence applications. Although the BFS approach has been extensively studied, particularly in its direction-optimized form, existing implementations still present three main issues: (1) high memory footprint; (2) the under-realized lightweight representations using bitmaps; and (3) the underuse of modern hardware such as Tensor Cores Units (TCUs).
In this paper, we propose BerryBees, an efficient algebraic BFS algorithm that leverages the Matrix Multiply-Accumulate (MMA) instructions of TCUs. The novelty of BerryBees lies in (1) the Binarized Row Slice (BRS) format, which encodes the adjacency matrix by using bitmaps to represent non-empty row segments; and (2) a warp-level algorithm that leverages TCUs for accelerating both SpMV and SpMSpV operations for enhanced BFS performance. The experimental results on three latest NVIDIA GPUs show that BerryBees outperforms five state-of-the-art BFS methods: GAP, Gunrock, Enterprise, GSWITCH, and GraphBLAST, and delivers average speedups of 1.42 ×, 1.97×, 5.05×, 1.24×, and 3.74× (up to 9.99×, 13.66×, 114.07×, 13.97×, and 24.74×), respectively.