Documents
Presentation Slides
QuantPipe: Applying Adaptive Post-Training Quantization for Distributed Transformer Pipelines in Dynamic Edge Environments
- Citation Author(s):
- Submitted by:
- Haonan Wang
- Last updated:
- 19 May 2023 - 2:22pm
- Document Type:
- Presentation Slides
- Document Year:
- 2023
- Event:
- Presenters:
- Haonan Wang
- Paper Code:
- https://github.com/usc-isi/PipeEdge
- Categories:
- Log in to post comments
Pipeline parallelism has achieved great success in deploying large-scale transformer models in cloud environments, but has received less attention in edge environments. Unlike in cloud scenarios with high-speed and stable network interconnects, dynamic bandwidth in edge systems can degrade distributed pipeline performance. We address this issue withQuantPipe, a communication-efficient distributed edge system that introduces post-training quantization (PTQ) to compress the communicated tensors. QuantPipe uses adaptivePTQ to change bitwidths in response to bandwidth dynamics, maintaining transformer pipeline performance while incur-ring limited inference accuracy loss. We further improve the accuracy with a directed-search analytical clipping for integer quantization method (DS-ACIQ), which bridges the gap between estimated and real data distributions. Experimental results show that QuantPipe adapts to dynamic bandwidth to maintain pipeline performance while achieving a practical model accuracy using a wide range of quantization bitwidths,e.g., improving accuracy under 2-bit quantization by 15.85% on ImageNet compared to naive quantization.