Progressive Low-Precision Approximation of Tensor Operators on GPUs: Enabling Greater Trade-Offs between Performance and Accuracy (CGO 2026 - Main Conference)

Who

Fan Luo, Guangli Li, Zhaoyang Hao, Xueying Wang, Xiaobing Feng, Huimin Cui, Jingling Xue

Track

CGO 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 4 Feb 2026 10:30 - 10:50 at Bronte - Tensor Optimization Chair(s): Bastian Hagedorn

Abstract

Recent GPUs integrate specialized hardware for low-precision arithmetic (e.g., FP16, INT8), offering substantial speedups for tensor operations. However, existing methods typically rely on coarse, operator-level trial-and-error tuning, which restricts the performance–accuracy trade-off space and limits achievable gains.

We present Platensor, a progressive low-precision approximation framework that expands this trade-off space through fine-grained, tile-level strategies. The key idea is to exploit the tiled computation patterns of GPUs to enable flexible precision control and richer optimization opportunities. Platensor performs a two-phase exploration: a fast rule-based pass that selects promising tile-level configurations, followed by an evolutionary search that refines them. It then automatically generates optimized kernels that combine tiles of different precisions.

Experiments on GEMM operators and representative applications—including kNN, LLMs, and HPL-MxP—show that Platensor significantly broadens the attainable performance-accuracy trade-offs and more fully leverages low-precision arithmetic on modern GPUs compared to operator-level tuning.

Link to Preprint

https://www.conference-publishing.com/Proc/CGO26/cgo26/cgo26main-p140-p

Fan Luo

Institute of Computing Technology at Chinese Academy of Sciences

China

Guangli Li

Institute of Computing Technology, Chinese Academy of Sciences

China

Zhaoyang Hao

Institute of Computing Technology at Chinese Academy of Sciences

China

Xueying Wang

Beijing University of Posts and Telecommunications

China

Xiaobing Feng

ICT CAS

China

Huimin Cui

Institute of Computing Technology, Chinese Academy of Sciences

China

Jingling Xue

UNSW Sydney

Australia

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 4 Feb
Displayed time zone: Hobart change

09:50 - 11:10	Tensor OptimizationMain Conference at Bronte Chair(s): Bastian Hagedorn NVIDIA

09:50 20m Talk		Multidirectional Propagation of Sparsity Information across Tensor Slices Main Conference Kaio Henrique Andrade Ananias Universidade Federal de Minas Gerais, Danila Seliayeu University of Alberta, Jose Nelson Amaral University of Alberta, Fernando Magno Quintão Pereira Federal University of Minas Gerais Pre-print Media Attached
10:10 20m Talk		Synthesizing Specialized Sparse Tensor Accelerators for FPGAs via High-Level Functional Abstractions Main Conference Hamza Javed McGill University, Canada, Christophe Dubach McGill University Pre-print
10:30 20m Talk		Progressive Low-Precision Approximation of Tensor Operators on GPUs: Enabling Greater Trade-Offs between Performance and Accuracy Main Conference Fan Luo Institute of Computing Technology at Chinese Academy of Sciences, Guangli Li Institute of Computing Technology, Chinese Academy of Sciences, Zhaoyang Hao Institute of Computing Technology at Chinese Academy of Sciences, Xueying Wang Beijing University of Posts and Telecommunications, Xiaobing Feng ICT CAS, Huimin Cui Institute of Computing Technology, Chinese Academy of Sciences, Jingling Xue UNSW Sydney Pre-print
10:50 20m Talk		Tensor Program Superoptimization through Cost-Guided Symbolic Program Synthesis Main Conference Alexander Brauckmann University of Edinburgh, Aarsh Chaube University of Edinburgh, José Wesley De Souza Magalhães University of Edinburgh, Elizabeth Polgreen University of Edinburgh, Michael F. P. O'Boyle University of Edinburgh Pre-print Media Attached