Enabling Automatic Compiler-Driven Vectorization of Transformers
This program is tentative and subject to change.
Compiling neural networks and Transformers for edge devices faces significant challenges due to resource constraints and the reliance on manually optimized operations for performance, among others. These limitations hinder the scalability and portability of neural network applications on resource-constrained platforms, such as edge devices utilizing the RISC-V ecosystem. Addressing these issues, this paper introduces innovative techniques to overcome the inefficiencies of current compilation methods and reduce dependence on manual optimizations.
This work proposes a novel compilation flow, ONNX-MLIR-LLVM (OML), which leverages MLIR and LLVM IR to enable automatic optimizations and generate stand-alone RISC-V binaries. Through comprehensive analysis, we identify key barriers preventing the auto-vectorizer from handling vectorization-friendly operators, particularly reduction operations and vectorization-unfriendly data layouts. We address these through a versatile MLIR reduction detection pass and a compile-time transpose pass, respectively.
Our automatic transformations (OML-vect) unlock the capabilities of the MLIR affine super-vectorizer, reducing reliance on manual vectorization. Evaluations on both x86 and RISC-V across eight neural networks and Transformer models demonstrate that automatic vectorization via OML-vect achieves on average 94% and 91% on x86 and RISC-V, respectively, compared to baseline and 2% and 8% more performance on x86 and RISC-V, respectively, compared to manually vectorized libraries, offering an efficient and portable solution for edge device deployments.
This program is tentative and subject to change.
Mon 2 FebDisplayed time zone: Hobart change
15:50 - 17:10 | |||
15:50 20mTalk | Enabling Automatic Compiler-Driven Vectorization of Transformers Main Conference Shreya Alladi University of Murcia, Alberto Ros University of Murcia, Alexandra Jimborean University of Murcia Pre-print Media Attached | ||
16:10 20mTalk | Unlocking Python Multithreading Capabilities using OpenMP-Based Programming with OMP4Py Main Conference César Piñeiro University of Santiago de Compostela, Juan C. Pichel University of Santiago de Compostela Pre-print Media Attached | ||
16:30 20mTalk | The Parallel-Semantics Program Dependence Graph for Parallel Optimization Main Conference Yian Su Northwestern University, Brian Homerding Northwestern University, Haocheng Gao Northwestern University, Federico Sossai Northwestern University, Yebin Chon Princeton University, David I. August Princeton University, Simone Campanoni Google / Northwestern University Pre-print Media Attached | ||
16:50 20mTalk | From Threads to Tiles: T2T, a Compiler for CUDA-to-NPU Translation via 2D Vectorization Main Conference Shuaijiang Li Institute of Computing Technology at Chinese Academy of Sciences, Jiacheng Zhao Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences; Zhongguancun Laboratory, Ying Liu Institute of Computing Technology, Chinese Academy of Sciences, Shuoming Zhang Institute of Computing Technology at Chinese Academy of Sciences, Lei Chen University of Chinese Academy of Sciences, Yijin Li Institute of Computing Technology at Chinese Academy of Sciences, Yangyu Zhang Institute of Computing Technology,Chinese Academy of Sciences, lizhicheng Institute of Computing Technology at Chinese Academy of Sciences, Runyu Zhou Institute of Computing Technology at Chinese Academy of Sciences, Xiyu Shi Institute of Computing Technology at Chinese Academy of Sciences, Chunwei Xia University of Leeds, Yuan Wen University of Aberdeen, Xiaobing Feng ICT CAS, Huimin Cui Institute of Computing Technology, Chinese Academy of Sciences Pre-print | ||