Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

Feb 22, 2021

Benjamin Hawks, Javier Duarte, Nicholas J. Fraser, Alessandro Pappalardo, Nhan Tran, Yaman Umuroglu

Figure 1 for Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

Figure 2 for Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

Figure 3 for Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

Figure 4 for Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

Share this with someone who'll enjoy it:

Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits depending on the application from lower inference latencies to higher data throughputs to more efficient energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. However, techniques developed for this study have potential application across many other domains. We study various configurations of pruning during quantization-aware training, which we term \emph{quantization-aware pruning} and the effect of techniques like regularization, batch normalization, and different pruning schemes on multiple computational or neural efficiency metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to standard neural architecture optimization techniques. While the accuracy for the benchmark application may be similar, the information content of the network can vary significantly based on the training configuration.

* 22 pages, 7 Figures, 1 Table

View paper on

Share this with someone who'll enjoy it: