TY - JOUR
T1 - Performance Evaluation of INT8 Quantized Inference on Mobile GPUs
AU - Kim, Sumin
AU - Park, Gunju
AU - Yi, Youngmin
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - During the past several years, the need for on-device deep learning has rapidly increased, and the performance of mobile GPUs has significantly increased. As a viable approach for efficient on-device deep learning, INT8 quantized inference has been actively studied and proposed but there are currently few frameworks that support INT8 quantization for mobile GPUs. This paper presents a unified framework that integrates various INT8 quantization methods, such as symmetric, asymmetric, per-layer, and per-channel, and discusses their impact on accuracy and efficiency on recent mobile GPUs. Moreover, we discuss the performance and accuracy of INT8 quantized Winograd convolution and propose INT8 Winograd convolution with F( 2× 2 , 3× 3 ), where weight tensors are quantized in INT4 and input tensors are quantized in INT6. We evaluated the performance of INT8 methods, including INT8 Winograd, for ResNet50, MobileNet-v1, and VGG16 on Mali G52, G72, and G76 GPUs on Odroid N2, Galaxy S9, and Galaxy Note 10+, respectively. INT8 quantized inference based on General Matrix Multiplication (GEMM) was 1.67× faster than FP32 GEMM for ResNet50 on Mali G52, and was further accelerated by batch normalization folding and by the proposed INT8 Winograd convolution, achieving 2.45× speedup in total with an accuracy loss of only 0.31%p.
AB - During the past several years, the need for on-device deep learning has rapidly increased, and the performance of mobile GPUs has significantly increased. As a viable approach for efficient on-device deep learning, INT8 quantized inference has been actively studied and proposed but there are currently few frameworks that support INT8 quantization for mobile GPUs. This paper presents a unified framework that integrates various INT8 quantization methods, such as symmetric, asymmetric, per-layer, and per-channel, and discusses their impact on accuracy and efficiency on recent mobile GPUs. Moreover, we discuss the performance and accuracy of INT8 quantized Winograd convolution and propose INT8 Winograd convolution with F( 2× 2 , 3× 3 ), where weight tensors are quantized in INT4 and input tensors are quantized in INT6. We evaluated the performance of INT8 methods, including INT8 Winograd, for ResNet50, MobileNet-v1, and VGG16 on Mali G52, G72, and G76 GPUs on Odroid N2, Galaxy S9, and Galaxy Note 10+, respectively. INT8 quantized inference based on General Matrix Multiplication (GEMM) was 1.67× faster than FP32 GEMM for ResNet50 on Mali G52, and was further accelerated by batch normalization folding and by the proposed INT8 Winograd convolution, achieving 2.45× speedup in total with an accuracy loss of only 0.31%p.
KW - INT8 Winograd convolution
KW - INT8 quantization
KW - On-device deep learning
KW - mobile GPU
UR - http://www.scopus.com/inward/record.url?scp=85121350023&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2021.3133100
DO - 10.1109/ACCESS.2021.3133100
M3 - Article
AN - SCOPUS:85121350023
SN - 2169-3536
VL - 9
SP - 164245
EP - 164255
JO - IEEE Access
JF - IEEE Access
ER -