From Threads to Tiles: T2T, a Compiler for CUDA-to-NPU Translation via 2D Vectorization (CGO 2026 - Main Conference)

Who

Shuaijiang Li, Jiacheng Zhao, Ying Liu, Shuoming Zhang, Lei Chen, Yijin Li, Yangyu Zhang, lizhicheng , Runyu Zhou, Xiyu Shi, Chunwei Xia, Yuan Wen, Xiaobing Feng, Huimin Cui

Track

CGO 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 2 Feb 2026 16:50 - 17:10 at Bronte - Parallelization / Vectorization Chair(s): V Krishna Nandivada

Abstract

CUDA's programming model, exposing massive parallelism via fine-grained scalar threads, has become the de facto standard for GPU computing. Concurrently, NPUs are emerging as highly efficient accelerators, but their architecture is fundamentally different, relying on coarse-grained, explicit 2-D tile-based instructions. This creates a critical challenge: bridging the semantic gap ``\textit{From Threads to Tiles}''. A direct translation is infeasible, as it requires lifting the implicit parallelism of CUDA's scalar model into the explicit, multi-dimensional vector space of NPUs, a problem we formalize as a lifting challenge.

This paper introduces T2T, a compiler framework that automates this ``Threads to Tiles'' translation via the \emph{2-D Vectorization} technique. T2T first transforms a CUDA kernel's implicit SIMT parallelism into a structured, explicit loop nest via our \textit{Unified Parallelism Abstraction (UPA)}, making the parallelism analyzable. From this representation, T2T's core vectorization engine systematically selects optimal pairs of loops and maps them onto the NPU's 2-D tile instructions to maximize hardware utilization. To ensure correctness and handle performance-critical CUDA features, a final set of semantics-preserving optimizations is applied, including efficient control-flow management and vectorization of warp-level intrinsics.

We implement T2T based on Polygeist and evaluate representative NPU architectures. On a diverse set of benchmarks, kernels translated by T2T achieve up to 73% of native CUDA performance on an A100 GPU and outperform baseline translation approaches by up to 6.9$\times$. Our work demonstrates that a systematic, compiler-driven approach to 2-D vectorization is a principled and high-performance path for porting the rich CUDA ecosystem to the evolving landscape of NPU accelerators.

Link to Preprint

https://www.conference-publishing.com/Proc/CGO26/cgo26/cgo26main-p77-p

Shuaijiang Li

Institute of Computing Technology at Chinese Academy of Sciences

China

Jiacheng Zhao

Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences; Zhongguancun Laboratory

Ying Liu

Institute of Computing Technology, Chinese Academy of Sciences

Shuoming Zhang

Institute of Computing Technology at Chinese Academy of Sciences

China

Lei Chen

University of Chinese Academy of Sciences

China

Yijin Li

Institute of Computing Technology at Chinese Academy of Sciences

China

Yangyu Zhang

Institute of Computing Technology,Chinese Academy of Sciences

China

lizhicheng

Institute of Computing Technology at Chinese Academy of Sciences

China

Runyu Zhou

Institute of Computing Technology at Chinese Academy of Sciences

China

Xiyu Shi

Institute of Computing Technology at Chinese Academy of Sciences

China

Chunwei Xia

University of Leeds

United Kingdom

Yuan Wen

University of Aberdeen

United Kingdom

Xiaobing Feng

ICT CAS

China

Huimin Cui

Institute of Computing Technology, Chinese Academy of Sciences

China

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 2 Feb
Displayed time zone: Hobart change

15:50 - 17:10	Parallelization / VectorizationMain Conference at Bronte Chair(s): V Krishna Nandivada IIT Madras

15:50 20m Talk		Enabling Automatic Compiler-Driven Vectorization of Transformers Main Conference Shreya Alladi University of Murcia, Alberto Ros University of Murcia, Alexandra Jimborean University of Murcia Pre-print Media Attached
16:10 20m Talk		Unlocking Python Multithreading Capabilities using OpenMP-Based Programming with OMP4Py Main Conference César Piñeiro University of Santiago de Compostela, Juan C. Pichel University of Santiago de Compostela Pre-print Media Attached
16:30 20m Talk		The Parallel-Semantics Program Dependence Graph for Parallel Optimization Main Conference Yian Su Northwestern University, Brian Homerding Northwestern University, Haocheng Gao Northwestern University, Federico Sossai Northwestern University, Yebin Chon Princeton University, David I. August Princeton University, Simone Campanoni Google / Northwestern University Pre-print Media Attached
16:50 20m Talk		From Threads to Tiles: T2T, a Compiler for CUDA-to-NPU Translation via 2D Vectorization Main Conference Shuaijiang Li Institute of Computing Technology at Chinese Academy of Sciences, Jiacheng Zhao Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences; Zhongguancun Laboratory, Ying Liu Institute of Computing Technology, Chinese Academy of Sciences, Shuoming Zhang Institute of Computing Technology at Chinese Academy of Sciences, Lei Chen University of Chinese Academy of Sciences, Yijin Li Institute of Computing Technology at Chinese Academy of Sciences, Yangyu Zhang Institute of Computing Technology,Chinese Academy of Sciences, lizhicheng Institute of Computing Technology at Chinese Academy of Sciences, Runyu Zhou Institute of Computing Technology at Chinese Academy of Sciences, Xiyu Shi Institute of Computing Technology at Chinese Academy of Sciences, Chunwei Xia University of Leeds, Yuan Wen University of Aberdeen, Xiaobing Feng ICT CAS, Huimin Cui Institute of Computing Technology, Chinese Academy of Sciences Pre-print