Stimpack: Convertible Neural Processor Supporting Adaptive Quantization for Real-Time Neural Network
DescriptionThis paper presents Stimpack, a non-deterministic approach towards convertible neural processing unit (NPU) and adaptive network quantization to mitigate tail latency under massive neural network (NN) inference load. The key idea of Stimpack is to determine network quantization and core conversion according to the load amount. Stimpack generally performs like a conventional architecture, but if an SLO violation is expected, it quantizes networks with halved data precision and computes quantized networks with doubled throughput. Compared with a state-of-the-art NPU, Stimpack achieves 51.1% higher throughput and allows 45.1% higher load on average while satisfying service-level objective (SLO) and near-ideal accuracy.
TimeWednesday, July 12th6:00pm - 7:00pm PDT
LocationLevel 2 Lobby