Eliminating Redundancy: Ultra-compact Code Generation for Programmable Dataflow Accelerators (CGO 2026 - Main Conference)

Who

Prasanth Chatarasi, Alex Gatea, Bardia Mahjour, Jintao Zhang, Alberto Mannari, Chris Bowler, Shubham Jain, Masoud Ataei Jaliseh, Nicole Khoun, Kamlesh Kumar, Viji Srinivasan, Swagath Venkataramani

Track

CGO 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 2 Feb 2026 10:50 - 11:10 at Bronte - Compiling for ML 1 Chair(s): Albert Cohen

Abstract

Modern AI accelerators adopt dataflow architectures to achieve both high peak throughput (TOPS) and energy efficiency (TOPS/W). These designs feature wide datapaths and hierarchical scratchpad memories that supply dense compute arrays with high-bandwidth data access and extensive operand reuse. Complementing the compute–memory subsystem is a lightweight control path that orchestrates data movement, program loading, and register initialization. To reduce energy and area overheads, conventional processor features—such as instruction caches, execution stacks, and branch speculation—are deliberately omitted. While this streamlined design maximizes efficiency, it shifts a critical responsibility onto the compiler: transforming complex kernels into highly compact instruction streams that must fit entirely within the limited instruction buffers (IBUFFs) of the accelerator’s programmable units.

In this paper, we introduce two novel compiler transformations—Loop Absorption (LA) and Loop Index Set Merging (LISM) for ultra compact code generation. Loop Absorption merges isomorphic sibling operations into a single loop body, while LISM unifies adjacent loops with similar bodies into a unified iteration space. Together, these complementary techniques eliminate redundant code patterns and produce compact hierarchical loop nests. We implement LA and LISM in the IBM Spyre compiler and evaluate them on diverse deep learning workloads including ResNet-50, Inception-v3, SSD, and BERT-Large. Across these models, our combined approach achieves a geometric mean compression of 1.48$\times$ over the baseline, enabling layers that previously exceeded IBUFF capacity to compile successfully.

Link to Preprint

https://www.conference-publishing.com/Proc/CGO26/cgo26/cgo26main-p29-p

Prasanth Chatarasi

IBM Research

United States

Alex Gatea

IBM

United States

Bardia Mahjour

IBM

Canada

Jintao Zhang

Unaffiliated

United States

Alberto Mannari

IBM

United States

Chris Bowler

IBM

United States

Shubham Jain

IBM Research

Masoud Ataei Jaliseh

IBM

United States

Nicole Khoun

IBM

United States

Kamlesh Kumar

Unaffiliated

United States

Viji Srinivasan

IBM Research

United States

Swagath Venkataramani

IBM Research

United States

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 2 Feb
Displayed time zone: Hobart change

09:50 - 11:10	Compiling for ML 1Main Conference at Bronte Chair(s): Albert Cohen Google DeepMind

09:50 20m Talk		Enabling Spill-Free Compilation via Affine-Based Live Range Reduction Optimization Main Conference Prasanth Chatarasi IBM Research, Alex Gatea IBM, Wei Wang IBM, Chris Bowler IBM, Shubham Jain IBM Research, Masoud Ataei Jaliseh IBM, Nicole Khoun IBM, Alberto Mannari IBM, Bardia Mahjour IBM, Viji Srinivasan IBM Research, Swagath Venkataramani IBM Research Pre-print
10:10 20m Talk		GRANII: Selection and Ordering of Primitives in GRAph Neural Networks using Input Inspection Main Conference Damitha Lenadora University of Illinois at Urbana-Champaign, Vimarsh Sathia University of Illinois Urbana Champaign, Gerasimos Gerogiannis University of Illinois at Urbana-Champaign, Serif Yesil NVIDIA, Josep Torrellas University of Illinois at Urbana-Champaign, Charith Mendis University of Illinois at Urbana-Champaign Pre-print
10:30 20m Talk		Fast Autoscheduling for Sparse ML Frameworks Main Conference Bobby Yan Stanford University, Alexander J Root Stanford University, Trevor Gale Stanford University, David Broman KTH Royal Institute of Technology, Fredrik Kjolstad Stanford University Pre-print
10:50 20m Talk		Eliminating Redundancy: Ultra-compact Code Generation for Programmable Dataflow Accelerators Main Conference Prasanth Chatarasi IBM Research, Alex Gatea IBM, Bardia Mahjour IBM, Jintao Zhang Unaffiliated, Alberto Mannari IBM, Chris Bowler IBM, Shubham Jain IBM Research, Masoud Ataei Jaliseh IBM, Nicole Khoun IBM, Kamlesh Kumar Unaffiliated, Viji Srinivasan IBM Research, Swagath Venkataramani IBM Research Pre-print