Design of Efficient Floating-Point Convolution Module for Embedded System
Abstract
:1. Introduction
- The principle of convolution operation and the BF16 format are studied, then a dedicated FP32-to-BF16 quantization unit is proposed according to the minimum error quantization algorithm. While maintaining the accuracy of the results, the demand for memory and bandwidth is reduced;
- An efficient serial-to-matrix conversion unit and a BF16 convolution operation unit are proposed. In order to make the convolution module run at a high frequency, we optimize the critical path by eliminating the overflow and underflow handling logic cells and optimizing the mantissa multiplication.
- An analysis of data error distribution of different data formats is perfomed (e.g., INT8, INT16, FP16 and BF16);
- A BF16 convolution module using the TSMC 90 nm library is synthesized to evaluate its area consumption, and implemented on the Xilinx PYNQ-Z2 FPGA board to evaluate its performance.
2. Related Work and Background
2.1. Related Work
2.2. Background
2.2.1. Convolution Operation
2.2.2. BF16
2.2.3. Quantization
- In comparison with the FP32 networks, the speed of the convolution operation is greatly improved after quantization;
- In comparison with the FP32 networks, the memory taken up by the weights of the BF16 networks is reduced by 50%, effectively improving the data-processing capability;
- BF16 can reduce the hardware overhead while maintaining accuracy to the greatest extent.
3. The Hardware Architecture of BF16 Convolution Module
3.1. Quantization Unit: FP32 to BF16
- If the absolute value of the difference between one number and its nearest integer is less than 0.5, the number is rounded to the nearest integer;
- If the difference between one number and its nearest integer is exactly 0.5, the result depends on the integer part of the number. If the integer part is even (odd), the number is rounded towards (away from) zero. In either case, the rounded number is an even integer.
3.2. Serial-to-Matrix Conversion Unit
- Input data from INPUT_FEATURES to the 1st row and 1st column of SHIFT_BUFFER;
- SHIFT_BUFFER performs a right-shift operation for each row;
- The data in the last column of SHIFT_BUFFER is passed onto the 1st column of CONV_BUFFER;
- CONV_BUFFER performs a right-shift operation for each row;
- The data in the last column of CONV_BUFFER are passed to the SHIFT_BUFFER of the next row.
3.3. Convolution Operation Unit
3.3.1. Multiplication Unit
3.3.2. Addition Unit
3.3.3. Adder Tree
- If the number of addends is even, it can be the same as the classic adder tree;
- If the number of addends is odd, the remaining addends are passed to the next stage.
4. Experiments and Results
4.1. Error Comparison
4.2. Resource Consumption Comparison
4.3. Performance Comparison
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
References
- Qayyum, A.B.A.; Arefeen, A.; Shahnaz, C. Convolutional Neural Network (CNN) Based Speech-Emotion Recognition. In Proceedings of the 2019 IEEE International Conference on Signal Processing, Information, Communication & Systems (SPICSCON), Dhaka, Bangladesh, 28–30 November 2019; pp. 122–125. [Google Scholar]
- Yang, X.; Yu, H.; Jia, L. Speech recognition of command words based on convolutional neural network. In Proceedings of the 2020 International Conference on Computer Information and Big Data Applications (CIBDA), Guiyang, China, 17–19 April 2020; pp. 465–469. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Chen, S.; Liu, Y.; Gao, X.; Han, Z. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In Proceedings of the Chinese Conference on Biometric Recognition, Urumqi, China, 11–12 August 2018; pp. 428–438. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
- Wang, K.; Zhang, D.; Li, Y.; Zhang, R.; Lin, L. Cost-effective active learning for deep image classification. IEEE Trans. Circuits Syst. Video Technol. 2016, 27, 2591–2600. [Google Scholar] [CrossRef] [Green Version]
- Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Corfu, Greece, 20–27 September 1999; pp. 1150–1157. [Google Scholar]
- Liang, M.; Hu, X. Recurrent convolutional neural network for object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3367–3375. [Google Scholar]
- Maturana, D.; Scherer, S. Voxnet: A 3d convolutional neural network for real-time object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–3 October 2015; pp. 922–928. [Google Scholar]
- Chen, L.; Wei, X.; Liu, W.; Chen, H.; Chen, L. Hardware Implementation of Convolutional Neural Network-Based Remote Sensing Image Classification Method. In Proceedings of the International Conference in Communications, Signal Processing, and Systems, Dalian, China, 14–16 July 2018; pp. 140–148. [Google Scholar]
- Mohammadnia, M.R.; Shannon, L. A multi-beam Scan Mode Synthetic Aperture Radar processor suitable for satellite operation. In Proceedings of the 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP), London, UK, 6–8 July 2016; pp. 83–90. [Google Scholar]
- Farabet, C.; Martini, B.; Akselrod, P.; Talay, S.; LeCun, Y.; Culurciello, E. Hardware accelerated convolutional neural networks for synthetic vision systems. In Proceedings of the 2010 IEEE International Symposium on Circuits and Systems, Paris, France, 30 May–2 June 2010; pp. 257–260. [Google Scholar]
- Peemen, M.; Setio, A.A.; Mesman, B.; Corporaal, H. Memory-centric accelerator design for convolutional neural networks. In Proceedings of the 2013 IEEE 31st International Conference on Computer Design (ICCD), Asheville, NC, USA, 6–9 October 2013; pp. 13–19. [Google Scholar]
- Hassibi, B.; Stork, D.G.; Wolff, G.J. Optimal brain surgeon and general network pruning. In Proceedings of the IEEE International Conference on Neural Networks, San Francisco, CA, USA, 28 March–1 April 1993; pp. 293–299. [Google Scholar]
- Song, D.; Zhang, P.; Li, F. Speeding Up Deep Convolutional Neural Networks Based on Tucker-CP Decomposition. In Proceedings of the 2020 5th International Conference on Machine Learning Technologies, Beijing, China, 19–21 June 2020; pp. 56–61. [Google Scholar]
- Courbariaux, M.; Bengio, Y.; David, J.P. Binaryconnect: Training deep neural networks with binary weights during propagations. Adv. Neural Inf. Process. Syst. 2015, 28, 3123–3131. [Google Scholar]
- Zhang, B.; Lai, J. Design and Implementation of a FPGA-based Accelerator for Convolutional Neural Networks (in Chinese). J. Fudan Univ. 2018, 57, 236–242. [Google Scholar]
- Chen, Y.H.; Emer, J.; Sze, V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM Sigarch Comput. Archit. News 2016, 44, 367–379. [Google Scholar] [CrossRef]
- Bettoni, M.; Urgese, G.; Kobayashi, Y.; Macii, E.; Acquaviva, A. A convolutional neural network fully implemented on fpga for embedded platforms. In Proceedings of the 2017 New Generation of CAS (NGCAS), Genova, Italy, 1 September 2017; pp. 49–52. [Google Scholar]
- Zhang, J.; Li, J. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 25–34. [Google Scholar]
- Kala, S.; Nalesh, S. Efficient CNN Accelerator on FPGA. IETE J. Res. 2020, 66, 733–740. [Google Scholar] [CrossRef]
- Fukushima, K. Neocognitron. Scholarpedia 2007, 2, 1717. [Google Scholar] [CrossRef]
- Half Precision Arithmetic: fp16 Versus Bfloat16. Available online: https://nhigham.com/2018/12/03/half-precision-arithmetic-fp16-versus-bfloat16/ (accessed on 13 December 2020).
- Lee, H.J.; Kim, C.H.; Kim, S.W. Design of Floating-Point MAC Unit for Computing DNN Applications in PIM. In Proceedings of the 2020 International Conference on Electronics, Information, and Communication (ICEIC), Barcelona, Spain, 19–22 January 2020; pp. 1–7. [Google Scholar]
- bfloat16—Hardware Numerics Definition. Available online: https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.pdf?source=techstories.org (accessed on 31 November 2020).
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Yu, Q.; Wang, C.; Ma, X.; Li, X.; Zhou, X. A deep learning prediction process accelerator based FPGA. In Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Shenzhen, China, 4–7 May 2015; pp. 1159–1162. [Google Scholar]
- Tang, R.; Jiao, J.; Xu, H. Design of Hardware Accelerator for Embedded Convolutional Neural Network (in Chinese). Comput. Eng. Appl. 2020, 27, 1–8. [Google Scholar]
Data Format | u | xmins | xmin | xmax |
---|---|---|---|---|
BF16 | 3.91e−03 | 9.18e−41 | 1.18e−38 | 3.39e+38 |
FP16 | 4.88e−04 | 5.96e−08 | 6.10e−05 | 6.55e+04 |
FP32 | 5.96e−08 | 1.40e−45 | 1.18e−38 | 3.40e+38 |
f [7:0] | p (3 bits) | v |
---|---|---|
1XXX XXXX | 000 | 1 (Valid) |
01XX XXXX | 001 | 1 (Valid) |
001X XXXX | 010 | 1 (Valid) |
0001 XXXX | 011 | 1 (Valid) |
0000 1XXX | 100 | 1 (Valid) |
0000 01XX | 101 | 1 (Valid) |
0000 001X | 110 | 1 (Valid) |
0000 0001 | 111 | 1 (Valid) |
0000 0000 | XXX | 0 (Invalid) |
Parameters | 400 MHz | 800 MHz | 1 GHz | ||||||
---|---|---|---|---|---|---|---|---|---|
INT16 | FP16 | BF16 | INT16 | FP16 | BF16 | INT16 | FP16 | BF16 | |
Total Area () | 90,867.87 | 87,868.36 | 71,576.06 | 93,280.32 | 108,443.67 | 80,367.14 | - | - | 85,450.98 |
- Quantization | 599.76 | 161.58 | 163.70 | 599.76 | 161.58 | 163.70 | - | - | 163.70 |
- Serial-to-matrix conversion | 14,063.31 | 14,063.31 | 14,063.31 | 14,136.70 | 14,136.70 | 14,136.70 | - | - | 14,136.70 |
- Convolution operation | 76,204.80 | 73,643.47 | 57,349.05 | 78,543.86 | 94,145.39 | 66,066.74 | - | - | 71,150.58 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, J.; Zhou, X.; Wang, B.; Shen, H.; Ran, F. Design of Efficient Floating-Point Convolution Module for Embedded System. Electronics 2021, 10, 467. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10040467
Li J, Zhou X, Wang B, Shen H, Ran F. Design of Efficient Floating-Point Convolution Module for Embedded System. Electronics. 2021; 10(4):467. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10040467
Chicago/Turabian StyleLi, Jiao, Xinjing Zhou, Binbin Wang, Huaming Shen, and Feng Ran. 2021. "Design of Efficient Floating-Point Convolution Module for Embedded System" Electronics 10, no. 4: 467. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10040467