QIGen: A Kernel Generator for Inference on Nonuniformly Quantized Large Language Models (CGO 2026 - Main Conference)

Who

Tommaso Pegolotti, Dan Alistarh, Markus Püschel

Track

CGO 2026 Main Conference

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 15:50 - 16:10 at Bronte - Compiling for ML 2 Chair(s): Fabrice Rastello

Abstract

Efficient inference on large language models (LLMs) has become a popular topic in both academia and industry. Roughly speaking, LLMs consist of a collection of weight matrices, and generative inference on these models essentially computes a sequence of matrix-vector products and thus can be heavily memory-bound. Consequently, much work has been devoted to reducing the size of the weights to lower bit-widths through various forms of quantization. In turn, these diverse precision formats complicate both the arithmetic and optimized kernel implementation.
So far, the vast majority of implementation work for mixed-precision LLM computation has been done manually. Currently, one of the most powerful and complex scalar LLM compression techniques is \emph{nonuniform} quantization, in which a matrix is divided unevenly into parts that are quantized with different bit-widths, minimizing the output compression error. In this paper, we present QIGen, the first kernel generator for LLM inference on CPUs to support nonuniform quantization in full generality. Given a nonuniformly quantized LLM and target CPU characteristics, QIGen first generates the diverse set of needed custom matrix-vector product kernels and combines them with a suitable storage format.
We benchmark and analyze QIGen-generated code in various experiments. In particular, we show that our code achieves Pareto optimality in terms of performance and accuracy with respect to the most used open-source tool. We show a speedup of up to $1.3\times$ for the matrix-vector and $3.4\times$ for the matrix-matrix computations even when using uniform quantization.

Link to Preprint

https://www.conference-publishing.com/Proc/CGO26/cgo26/cgo26main-p70-p

Tommaso Pegolotti

ETH Zürich

Switzerland

Dan Alistarh

IST Austria

Markus Püschel

ETH Zurich

Switzerland

Media