CGO 2026
Sat 31 January - Wed 4 February 2026 Sydney, Australia
co-located with HPCA/CGO/PPoPP/CC 2026

Recent GPUs integrate specialized hardware for low-precision arithmetic (e.g., FP16, INT8), offering substantial speedups for tensor operations. However, existing methods typically rely on coarse, operator-level trial-and-error tuning, which restricts the performance–accuracy trade-off space and limits achievable gains.

We present Platensor, a progressive low-precision approximation framework that expands this trade-off space through fine-grained, tile-level strategies. The key idea is to exploit the tiled computation patterns of GPUs to enable flexible precision control and richer optimization opportunities. Platensor performs a two-phase exploration: a fast rule-based pass that selects promising tile-level configurations, followed by an evolutionary search that refines them. It then automatically generates optimized kernels that combine tiles of different precisions.

Experiments on GEMM operators and representative applications—including kNN, LLMs, and HPL-MxP—show that Platensor significantly broadens the attainable performance-accuracy trade-offs and more fully leverages low-precision arithmetic on modern GPUs compared to operator-level tuning.