Sorry, you need to enable JavaScript to visit this website.

QuantPipe: Applying Adaptive Post-Training Quantization for Distributed Transformer Pipelines in Dynamic Edge Environments

Citation Author(s):
Haonan Wang, Connor Imes, Souvik Kundu, Peter A. Beerel, Stephen P. Crago, John Paul Walters
Submitted by:
Haonan Wang
Last updated:
19 May 2023 - 2:22pm
Document Type:
Presentation Slides
Document Year:
Haonan Wang
Paper Code:

Pipeline parallelism has achieved great success in deploying large-scale transformer models in cloud environments, but has received less attention in edge environments. Unlike in cloud scenarios with high-speed and stable network interconnects, dynamic bandwidth in edge systems can degrade distributed pipeline performance. We address this issue withQuantPipe, a communication-efficient distributed edge system that introduces post-training quantization (PTQ) to compress the communicated tensors. QuantPipe uses adaptivePTQ to change bitwidths in response to bandwidth dynamics, maintaining transformer pipeline performance while incur-ring limited inference accuracy loss. We further improve the accuracy with a directed-search analytical clipping for integer quantization method (DS-ACIQ), which bridges the gap between estimated and real data distributions. Experimental results show that QuantPipe adapts to dynamic bandwidth to maintain pipeline performance while achieving a practical model accuracy using a wide range of quantization bitwidths,e.g., improving accuracy under 2-bit quantization by 15.85% on ImageNet compared to naive quantization.

0 users have voted: