Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References (CGO 2026 - Main Conference)

Who

Hongzheng Chen, Bin Fan, Alexander Collins, Bastian Hagedorn, Evghenii Gaburov, Masahiro Masuda, Matthew Brookhart, Chris Sullivan, Jason Knight, Zhiru Zhang, Vinod Grover

Track

CGO 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 2 Feb 2026 15:10 - 15:30 at Bronte - DSLs Chair(s): Olivia Hsu

Abstract

Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap. While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines–a process that is labor-intensive, error-prone, and unsustainable. To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. Central to our approach is a novel IR abstraction, asynchronous references (aref), which
expresses warp-level communication without exposing low-level hardware details. Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting. Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1x speedup over highly optimized cuBLAS GEMM kernels. For attention workloads, Tawa attains 1.2x speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.

Link to Preprint

https://www.conference-publishing.com/Proc/CGO26/cgo26/cgo26main-p90-p

Hongzheng Chen

Cornell University

United States

Bin Fan

Nvidia

United States

Alexander Collins

NVIDIA

United Kingdom

Bastian Hagedorn

NVIDIA

Germany

Evghenii Gaburov

NVIDIA

United States

Masahiro Masuda

NVIDIA

United States

Matthew Brookhart

NVIDIA

United States

Chris Sullivan

NVIDIA

United States

Jason Knight

NVIDIA

United States

Zhiru Zhang

Cornell University, USA

United States

Vinod Grover

NVIDIA

United States

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 2 Feb
Displayed time zone: Hobart change

14:10 - 15:30	DSLsMain Conference at Bronte Chair(s): Olivia Hsu Stanford University

14:10 20m Talk		FORTE: Online DataFrame Query Optimizer Main Conference Yoonho Choi POSTECH, Kyoungtae Lee Seoul National University, Minji Kim Ewha Womans University, Hyungsoo Jung Seoul National University, Hyojin Sung Seoul National University Pre-print
14:30 20m Talk		LEGO: A Layout Expression Language for Code Generation of Hierarchical Mapping Main Conference Amir Mohammad Tavakkoli University of Utah, Cosmin E. Oancea University of Copenhagen, Denmark, Mary Hall University of Utah Pre-print Media Attached
14:50 20m Talk		Pushing Tensor Accelerators beyond MatMul in a User-Schedulable Language Main Conference Yihong Zhang University of Washington, Derek Gerstmann Adobe, Andrew Adams Adobe Research, Maaz Bin Safeer Ahmad University of Washington, Seattle Pre-print Media Attached
15:10 20m Talk		Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References Main Conference Hongzheng Chen Cornell University, Bin Fan Nvidia, Alexander Collins NVIDIA, Bastian Hagedorn NVIDIA, Evghenii Gaburov NVIDIA, Masahiro Masuda NVIDIA, Matthew Brookhart NVIDIA, Chris Sullivan NVIDIA, Jason Knight NVIDIA, Zhiru Zhang Cornell University, USA, Vinod Grover NVIDIA Pre-print