Accelerate BERT Inference with Knowledge Distillation & AWS Inferentia

Hugging Face SageMaker Workshop:

Accelerate BERT Inference with Knowledge Distillation & AWS Inferentia

This workshop will demonstrate how to accelerate BERT Inference using different optimization techniques like knowledge distillation to make the model fast and keep the accuracy. 🏎

In the workshop, you will learn how to apply knowledge distillation to compress a large model to a small model, and then from the small model to an optimized neuron model with AWS Inferentia. By the end of this process, our model will go from 100ms+ to 5ms+ latency - a 20x improvement! 🤯 🏎

You will learn how to:

Apply knowledge-distillation with BERT-large as teacher and MiniLM as student
Compile a Hugging Face Transformer model with AWS Neuron for AWS Inferentia
Deploy the distilled & optimized model to Amazon SageMaker for production-grade fast inference

The workshop is a hands-on workshop where you will get temporary free access to AWS accounts to participate and accelerate your own model.