Pushing Tensor Accelerators beyond MatMul in a User-Schedulable Language (CGO 2026 - Main Conference)

Who

Yihong Zhang, Derek Gerstmann, Andrew Adams, Maaz Bin Safeer Ahmad

Track

CGO 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 2 Feb 2026 14:50 - 15:10 at Bronte - DSLs Chair(s): Olivia Hsu

Abstract

Tensor accelerators now represent a growing share of compute resources in modern CPUs and GPUs. However, they are hard to program, leading developers to use vendor-provided kernel libraries that support tensor accelerators. As a result, the usage of tensor accelerators is limited to the provided interface, mainly designed for traditional ML and scientific computing workloads.

In this paper, we show that tensor accelerators can improve the performance of applications beyond simple variants of MatMul. For example, many image processing pipelines are linear transformations over matrices in disguise and can therefore utilize such specialized hardware. This is nonetheless hindered by the difficulties in programming tensor accelerators. We tackle this problem with compiler-based techniques. We use the Halide user-schedulable language and express operations as Halide algorithms succinctly. To this end, we implement a flexible tensor instruction selector based on equality saturation. The tensor instruction selector supports both CPU- and GPU-attached tensor accelerators and works with existing scheduling operations (e.g., producer-consumer fusion). Together, this enables developers to write diverse accelerator-leveraging applications in a few dozen lines.

Using our system, we demonstrate the potential of tensor accelerators beyond their traditional domains. We implement several image processing pipelines (e.g., filtering, resampling, and denoising) in our system and evaluate them against non-accelerator-leveraging baselines. We show that these pipelines can achieve significant speedups. For example, a downsampling routine is sped up by $6.1\times$ by utilizing Tensor Cores on an Nvidia RTX 4070 GPU.

Link to Preprint

https://www.conference-publishing.com/Proc/CGO26/cgo26/cgo26main-p331-p

Yihong Zhang

University of Washington

United States

Derek Gerstmann

Adobe

United States

Andrew Adams

Adobe Research

United States

Maaz Bin Safeer Ahmad

University of Washington, Seattle

United States

Media

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 2 Feb
Displayed time zone: Hobart change

14:10 - 15:30	DSLsMain Conference at Bronte Chair(s): Olivia Hsu Stanford University

14:10 20m Talk		FORTE: Online DataFrame Query Optimizer Main Conference Yoonho Choi POSTECH, Kyoungtae Lee Seoul National University, Minji Kim Ewha Womans University, Hyungsoo Jung Seoul National University, Hyojin Sung Seoul National University Pre-print
14:30 20m Talk		LEGO: A Layout Expression Language for Code Generation of Hierarchical Mapping Main Conference Amir Mohammad Tavakkoli University of Utah, Cosmin E. Oancea University of Copenhagen, Denmark, Mary Hall University of Utah Pre-print Media Attached
14:50 20m Talk		Pushing Tensor Accelerators beyond MatMul in a User-Schedulable Language Main Conference Yihong Zhang University of Washington, Derek Gerstmann Adobe, Andrew Adams Adobe Research, Maaz Bin Safeer Ahmad University of Washington, Seattle Pre-print Media Attached
15:10 20m Talk		Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References Main Conference Hongzheng Chen Cornell University, Bin Fan Nvidia, Alexander Collins NVIDIA, Bastian Hagedorn NVIDIA, Evghenii Gaburov NVIDIA, Masahiro Masuda NVIDIA, Matthew Brookhart NVIDIA, Chris Sullivan NVIDIA, Jason Knight NVIDIA, Zhiru Zhang Cornell University, USA, Vinod Grover NVIDIA Pre-print