Effective Tiling for the Snitch Cluster
In our work, we explore tiling for a novel DL accelerator cluster called the Snitch Cluster. Unlike systolic array based accelerators, Snitch compute cores use a combination of streaming registers, hardware loops, and pipelining to achieve high FPU utilization on loop intensive computations such as matrix multiplication. Schedulers for DL workloads must take into account low level scheduling details of their target hardware, custom RISC-V instructions in the case of Snitch, to make informed decisions. Currently there are no cost models readily available to guide scheduling on Snitch.
We present Myrtle, a tiling cost model for an 8-core Snitch Cluster, parameterized by three categories of input: application, hardware, and low-level scheduling details. We combine memory footprint calculation, streaming register configuration counts, identification of streaming vs regular register loads, and heuristics for pruning to generate a promising search space and automatically select a close to optimal tile size. Building upon a Snitch-specific tile layout, we aim to take advantage of the regularity of the Snitch architecture to develop a highly interpretable cost model trained with Support Vector Regression (SVR) and Generalized Additive Models (GAMs).