From Threads to Tiles: T2T, a Compiler for CUDA-to-NPU Translation via 2D Vectorization
CUDA's programming model, exposing massive parallelism via fine-grained scalar threads, has become the de facto standard for GPU computing. Concurrently, NPUs are emerging as highly efficient accelerators, but their architecture is fundamentally different, relying on coarse-grained, explicit 2-D tile-based instructions. This creates a critical challenge: bridging the semantic gap ``\textit{From Threads to Tiles}''. A direct translation is infeasible, as it requires lifting the implicit parallelism of CUDA's scalar model into the explicit, multi-dimensional vector space of NPUs, a problem we formalize as a lifting challenge.
This paper introduces T2T, a compiler framework that automates this ``Threads to Tiles'' translation via the \emph{2-D Vectorization} technique. T2T first transforms a CUDA kernel's implicit SIMT parallelism into a structured, explicit loop nest via our \textit{Unified Parallelism Abstraction (UPA)}, making the parallelism analyzable. From this representation, T2T's core vectorization engine systematically selects optimal pairs of loops and maps them onto the NPU's 2-D tile instructions to maximize hardware utilization. To ensure correctness and handle performance-critical CUDA features, a final set of semantics-preserving optimizations is applied, including efficient control-flow management and vectorization of warp-level intrinsics.
We implement T2T based on Polygeist and evaluate representative NPU architectures. On a diverse set of benchmarks, kernels translated by T2T achieve up to 73% of native CUDA performance on an A100 GPU and outperform baseline translation approaches by up to 6.9$\times$. Our work demonstrates that a systematic, compiler-driven approach to 2-D vectorization is a principled and high-performance path for porting the rich CUDA ecosystem to the evolving landscape of NPU accelerators.
Mon 2 FebDisplayed time zone: Hobart change
15:50 - 17:10 | |||
15:50 20mTalk | Enabling Automatic Compiler-Driven Vectorization of Transformers Main Conference Shreya Alladi University of Murcia, Alberto Ros University of Murcia, Alexandra Jimborean University of Murcia Pre-print Media Attached | ||
16:10 20mTalk | Unlocking Python Multithreading Capabilities using OpenMP-Based Programming with OMP4Py Main Conference César Piñeiro University of Santiago de Compostela, Juan C. Pichel University of Santiago de Compostela Pre-print Media Attached | ||
16:30 20mTalk | The Parallel-Semantics Program Dependence Graph for Parallel Optimization Main Conference Yian Su Northwestern University, Brian Homerding Northwestern University, Haocheng Gao Northwestern University, Federico Sossai Northwestern University, Yebin Chon Princeton University, David I. August Princeton University, Simone Campanoni Google / Northwestern University Pre-print Media Attached | ||
16:50 20mTalk | From Threads to Tiles: T2T, a Compiler for CUDA-to-NPU Translation via 2D Vectorization Main Conference Shuaijiang Li Institute of Computing Technology at Chinese Academy of Sciences, Jiacheng Zhao Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences; Zhongguancun Laboratory, Ying Liu Institute of Computing Technology, Chinese Academy of Sciences, Shuoming Zhang Institute of Computing Technology at Chinese Academy of Sciences, Lei Chen University of Chinese Academy of Sciences, Yijin Li Institute of Computing Technology at Chinese Academy of Sciences, Yangyu Zhang Institute of Computing Technology,Chinese Academy of Sciences, lizhicheng Institute of Computing Technology at Chinese Academy of Sciences, Runyu Zhou Institute of Computing Technology at Chinese Academy of Sciences, Xiyu Shi Institute of Computing Technology at Chinese Academy of Sciences, Chunwei Xia University of Leeds, Yuan Wen University of Aberdeen, Xiaobing Feng ICT CAS, Huimin Cui Institute of Computing Technology, Chinese Academy of Sciences Pre-print | ||