QIGen: A Kernel Generator for Inference on Nonuniformly Quantized Large Language Models
This program is tentative and subject to change.
Efficient inference on large language models (LLMs) has become a popular topic in both academia and industry. Roughly speaking, LLMs consist of a collection of weight matrices, and generative inference on these models essentially computes a sequence of matrix-vector products and thus can be heavily memory-bound. Consequently, much work has been devoted to reducing the size of the weights to lower bit-widths through various forms of quantization. In turn, these diverse precision formats complicate both the arithmetic and optimized kernel implementation.
So far, the vast majority of implementation work for mixed-precision LLM computation has been done manually. Currently, one of the most powerful and complex scalar LLM compression techniques is \emph{nonuniform} quantization, in which a matrix is divided unevenly into parts that are quantized with different bit-widths, minimizing the output compression error. In this paper, we present QIGen, the first kernel generator for LLM inference on CPUs to support nonuniform quantization in full generality. Given a nonuniformly quantized LLM and target CPU characteristics, QIGen first generates the diverse set of needed custom matrix-vector product kernels and combines them with a suitable storage format.
We benchmark and analyze QIGen-generated code in various experiments. In particular, we show that our code achieves Pareto optimality in terms of performance and accuracy with respect to the most used open-source tool. We show a speedup of up to $1.3\times$ for the matrix-vector and $3.4\times$ for the matrix-matrix computations even when using uniform quantization.
This program is tentative and subject to change.
Tue 3 FebDisplayed time zone: Hobart change
15:50 - 17:10 | Compiling for ML 2Main Conference at Bronte Chair(s): Fabrice Rastello University Grenoble Alpes - Inria - CNRS - Grenoble INP - LIG | ||
15:50 20mTalk | QIGen: A Kernel Generator for Inference on Nonuniformly Quantized Large Language Models Main Conference Pre-print Media Attached | ||
16:10 20mTalk | DyPARS: Dynamic-Shape DNN Optimization via Pareto-Aware MCTS for Graph Variants Main Conference Hao Qian University of New South Wales, Guangli Li Institute of Computing Technology, Chinese Academy of Sciences, Qiuchu Yu Institute of Computing Technology at Chinese Academy of Sciences, Xueying Wang Beijing University of Posts and Telecommunications, Jingling Xue University of New South Wales Pre-print Media Attached | ||
16:30 20mTalk | Compiler-Runtime Co-operative Chain of Verification for LLM-Based Code Optimization Main Conference Hyunho Kwon Yonsei University, Sanggyu Shin SAIT, Ju Min Lee Yonsei University, Hoyun Youm Yonsei University, Seungbin Song SAIT, Seongho Kim Yonsei University, Hanwoong Jung Samsung Advanced Institute of Technology, Seungwon Lee Samsung Advanced Institute of Technology, Hanjun Kim Yonsei University Pre-print | ||
16:50 20mTalk | Hexcute: A Compiler Framework for Automating Layout Synthesis in GPU Programs Main Conference Xiao Zhang University of Toronto; NVIDIA, Yaoyao Ding University of Toronto; Vector Institute; NVIDIA, Bolin Sun University of Toronto; NVIDIA, Yang Hu NVIDIA, Tatiana Shpeisman Google, Gennady Pekhimenko University of Toronto / Vector Institute Pre-print Media Attached | ||