As Transformer models increase in size, the computational cost of running inference also grows. Many organisations now face the challenge of deploying state-of-the-art models in a cost-effective way.
Fast and Accurate GPU Quantization for Transformers
· 23 min read