Efficient Transformer Inference with Statically Structured Sparse Attention
DescriptionSelf-attention matrices of Transformers are sparse because the relevant context of each token is limited to just a few other tokens. To reduce the burden of self-attention on Transformer inference, we propose static, structured, sparse attention masks that split attention matrices into dense regions, skipping computations outside these regions while reducing computations inside these regions. To support the proposed masks, we design an entropy-aware finetuning algorithm to encourage attention sparsity. We further modify a deep learning accelerator to exploit our sparsity pattern. Compared to a dense baseline, we achieve 55% energy reduction with <1% accuracy loss and 5% area overhead.
TimeWednesday, July 12th2:10pm - 2:25pm PDT
Location3012, 3rd Floor
AI/ML Architecture Design