1. Introduction
In recent years, convolution neural networks (CNNs) [
1] have been considered as among the excellent choices for various tasks such as image classification, object detection, semantic segmentation, and so on [
2,
3,
4,
5]. There have been inevitable trade-offs between model accuracy and computational cost in deep learning models. Currently, energy consumption draws attention in the deep learning community with the concerns about climate change and carbon emissions. In an effort to reduce the power consumption of neural network models, spiking neural networks (SNNs) [
6,
7,
8] have attracted significant research interest. In artificial neural networks (ANNs), the artificial neuron model has been inspired by the behavior of biological neurons, but their behavior is not exactly the same as that of biological ones. A biological neuron receives spike signals through its dendrites via its synapses, accumulates the received signals into its membrane potential, emits spikes through its axon only when its membrane potential reaches the inherently specified threshold, and resets the membrane potential to the resting potential if a spike is emitted [
6]. Spiking neurons refer to the neuron model that receives spikes, maintains membrane potential, and emits spikes as in biological neurons. SNNs are neural networks of which the neurons are spiking neurons. In SNNs, all signals transmitted between neurons are spikes, and hence, their hardware implementation just needs to send spikes, when needed, without keeping some constant voltage continuously over some period. This helps reduce operational power compared to the conventional neural networks. The hardware devices for executing SNNs are denoted as neuromorphic devices [
9,
10,
11]. Once such a neuromorphic device prevails, SNNs are expected to be deployed in various resource-limited devices such as IoT devices, embedded systems, and portable devices. With such expectation, SNNs are even referred to as the third-generation neural networks.
In ANNs, there is only one kind of neuron, of which the behavior is just the weighted sum of their input values with an activation function. There are several activation functions such as sigmoid, hyper-tangent, ReLU, GeLU, Swish, and so on. On the contrary, in SNNs, there are different kinds of spiking neurons such as the Hodgkin–Huxley model, the leaky integrate-and-fire (LIF) model, the integrate-and-fire (IF) model, the soft-reset IF model, the spike response model (SRM), Izhikevich’s model, the FitzHugh–Nagumo (FHN) model, and so on [
6,
7]. Due to the diversity of spiking neurons and their behavioral dynamics, SNNs are more difficult to train than ANNs. There have been various training algorithms developed for SNNs [
6,
8].
The primary differences between CNNs (or ANNs in general) and SNNs lie in the data representation and the number of required forward computation passes for inference. In ANNs, the input and output signals of neurons are real-valued, and only a single feed-forward pass is required for inference. On the contrary, input and output signals in SNNs are sparse spikes over a certain time period, and their inference requires multiple feed-forward passes over the time period, also known as inference latency.
Figure 1 shows the behaviors of an ANN and an SNN, where the ANN processes real values and the SNN processes spikes.
In image recognition tasks, compared to the resounding successes achieved by CNNs over the past decade, SNN training algorithms have shown limited performance, yet remain an active research field. The SNN training algorithms can be categorized into three major approaches: bio-inspired learning approach [
12,
13,
14,
15,
16,
17,
18], spike-based backpropagation approximation approach [
19,
20,
21,
22,
23], and ANN–SNN conversion approach [
24,
25,
26,
27,
28]. The biologically based plausible learning approach attempts to train SNNs by adjusting weights based on local learning rules for synaptic strength in an unsupervised manner [
12,
13,
14] or in a semi-supervised manner [
15,
16,
17,
18]. It exhibits a trade-off between biological plausibility and performance.
The spike-based backpropagation approximation approach [
19,
20,
21,
22,
23] directly trains SNNs by approximating the error backpropagation algorithm, widely used for training traditional artificial neural networks (ANN), so as to be applicable for spikes. Compared to the biologically plausible learning approach, the SNN learning algorithms of this approach have generally shown better accuracy and require a higher computational budget, but are less biologically plausible.
The ANN–SNN conversion approach [
24,
25,
26,
27,
28] has proven to be promising to train deep SNNs. It first trains an ANN with some constraints for the given training dataset, and then, it converts the trained ANN model into an SNN model, which consists of spiking neurons with appropriate firing thresholds. CNN models have been widely used as ANNs for the image recognition tasks. The CNN–SNN conversion algorithms [
24,
25,
26,
27,
28] require a rather long inference latency, while having a trade-off between inference latency and accuracy. From these observations, we propose a new CNN–SNN conversion algorithm, which reduces the conversion loss from the trained CNN to an SNN with low inference latency.
The proposed CNN–SNN conversion algorithm uses a threshold balancing technique, which pays attention to inference latency. The experimental results on the MNIST dataset [
29] have shown that the proposed method could produce an SNN model, of which the accuracy of 99.33% is even higher than the accuracy of 99.31% of its corresponding CNN model, with a low inference latency of 64 time steps. In addition, the experimental results on the Fashion-MNIST [
30] and CIFAR-10 datasets [
31] have shown that the converted SNNs experience less conversion loss than other CNN–SNN conversion methods with low latency. Specifically, with a latency of 64 time steps, the proposed threshold balancing method has reduced conversion losses of approximately 10% and 8%, respectively, compared to the methods in [
25,
26]. For the latency of 128 time steps, experiments have shown that those reduced losses were 45% and 30%, respectively.
The rest of this paper is organized as follows: The foundations of the CNN–SNN conversion methodology and related works are provided in the next section.
Section 3 presents a new CNN–SNN conversion method with the proposed threshold balancing technique. The experimental results and further discussion are described in
Section 4 and
Section 5, respectively. The last section draws the conclusions.
2. Foundations of CNN–SNN Conversion and Related Works
This section first presents the foundations of the CNN–SNN conversion approach for the image recognition tasks. Then, it gives a short discussion about previous works, as well as their limitations, which motivated our work.
Algorithm 1 shows the basic CNN–SNN conversion procedure, which is illustrated in
Figure 2. First, a CNN having some designated constraints is trained by the gradient descent method with the given training dataset. Next, an SNN is designed, which has the same architecture as the trained CNN, and the weights of the SNN are assigned the corresponding weights of the trained CNN. After that, the firing thresholds of the spiking neurons in the SNN are determined by a threshold balancing technique. Lastly, for inference with the SNN, the input data are encoded into spike trains, which are a sequence of spikes with timing information.
Algorithm 1: Basic CNN–SNN conversion procedure. |
Step1. CNN training: Train a CNN with designated constraints Step2. Weight transferring: Transfer weights from the trained CNN to an SNN with the same architecture Step3. Threshold balancing: Assign firing thresholds to spiking neurons of the SNN Step4. SNN inference preparation: Encode the input data into spike trains that are amenable to the SNN |
Diehl et al.’s method [
24] takes the CNN–SNN conversion approach. It first trains a CNN model having the rectified linear unit (ReLU) activation function [
32]. It organizes an SNN model of the same architecture for the trained CNN, of which the neurons are integrate-and-fire (IF) [
33] neurons (please refer to
Appendix A for the details of the IF neuron model). It uses the activation-based threshold balancing technique to determine the firing thresholds of spiking neurons. The threshold balancing technique finds the maximum activation values at each layer of the trained CNN model when the whole training set is fed into the CNN model, and then, it uses the maximum activation values as the firing threshold in the corresponding layers of the SNN model. The threshold balancing technique is also known as the data-based normalization technique. It has been observed that the CNN–SNN conversion requires the converted SNN to have a long inference latency, such as more than 500 time steps, so as to achieve a loss that is comparable to that of the corresponding CNN model for such benchmarks as the MNIST dataset. This implies that the decrease in the inference latency causes a significant increase in conversion loss. To overcome this problem, Burkitt’s method [
28] first determines the firing thresholds of spiking neurons with the activation-based threshold balancing technique and then scales them by a ratio, which is empirically selected. The conversion loss of the CNN–SNN conversion method is attributed to the following factors [
24]:
The first factor stems from the difference in the input integration () between the CNN model and the SNN model. In the CNN, the input values x are floating-point values, while in the SNN model, the input values are represented by binary values {0,1} at each time step.
The second factor comes from the difference in activation behavior between the neurons with the ReLU activation of the CNN model and the IF neurons of the SNN model.
The last factor lies in the threshold balancing technique. A too-high firing threshold at each layer of the SNN yields a low firing rate for most neurons with low latency. This leads to neurons with a low firing rate, which cannot adequately contribute to the information transmission in the SNN model.
To reduce the conversion loss caused by the difference in the input integration process between the CNN and SNN, a threshold balancing technique such as the spike-based normalization technique (also known as spike-norm) [
25,
26] sets the firing threshold at each layer with the maximum weighted input summation from the Poisson input. However, the spike-norm technique still requires the converted SNN model to use a sizeable amount of time steps for a conversion loss comparable to the corresponding CNN model. This phenomenon occurs because the assigned thresholds are still so high that most neurons result in having a low firing rate with low latency. In addition, the spike-norm technique has some limitations caused by the Poisson characteristics in the input encoding as follows:
The threshold at each layer may change in different trials due to the probabilistic nature of the input Poisson spike trains. The change of the firing threshold could affect the performance of the converted SNN model. That is, the accuracies of the converted SNN mode are different in different trials.
The spike conversion of a very small input value can be a challenge to generate a spike train with low latency, which may cause information transmission loss in an SNN model.
On the other hand, to reduce the conversion loss caused by the difference in activation behavior between the CNN model and the SNN model, Han et al.’s method [
26] uses soft-reset IF neurons instead of IF neurons for the SNN model.
Although existing CNN–SNN conversion methods have made certain achievements in minimizing the conversion loss from a trained CNN to an SNN, they still require rather high inference latency. The inference latency is strongly affected by the adopted threshold balancing technique. We propose a CNN–SNN conversion method that uses a new threshold balancing technique, which can reduce the inference latency while maintaining performance.
5. Further Discussion
Over the past several years, SNNs have attracted significant research interest due to their energy efficiency. Specifically, recent concerns about training SNNs lie in not only improving the accuracy, but also minimizing their power consumption. As mentioned in
Section 1, the SNN training algorithms can be categorized into the bio-inspired learning approach [
12,
13,
14,
15,
16,
17,
18], the spike-based backpropagation approximation approach [
19,
20,
21,
22,
23], and the ANN–SNN conversion approach [
24,
25,
26,
27]. The biologically based plausible learning approach generally uses local learning rules for shallow networks, which have some restrictions on their scalability and expressive power. The spike-based backpropagation approximation approach uses some variants of the error backpropagation algorithm, which approximates the derivatives of spike signals with surrogate functions. Compared to the biologically plausible learning approach, the approximation approach has generally shown better accuracy, requires a higher computational cost, and is difficult to apply to training deeper SNNs. The ANN–SNN conversion approach including the CNN–SNN conversion approach indirectly trains SNNs by using the weights of the trained SNNs having the same architecture. The conversion method does not care much about the number of layers as in the bio-inspired learning approach and the spike-based backpropagation approximation approach because the weights of the model are trained in its corresponding ANN or CNN model. Hence, the ANN–SNN conversion approach has the features of the scalability of the model architecture, yet usually requires a rather long inference latency while, having a trade-off between inference latency and accuracy. In the conversion approach, the determination of the threshold values for the spike neurons is one of the key factors that strongly affects the performance of the converted SNNs. The proposed threshold balancing method determines the threshold values for each channel at the convolutional layers. Sengupta et al.’s method [
25] takes a similar approach to the proposed method, but it does not take into account the channels in determining the threshold values. The proposed threshold method has shown good performance for low latency compared to the existing methods [
24,
25,
26,
27,
28].
From the experiments for a specific SNN architecture on the MNIST and Fashion-MNIST datasets, we observed that the proposed conversion method could produce SNN models with better performance with low latency. The experiments with the deep SNN models on the CIFAR-10 dataset showed that the conversion method could generate comparable deep SNNs to other conversion techniques.
Table 5 shows the performance of the SNN models on the MNIST dataset surveyed in the literature. It shows the accuracies along with the allowed inference latency for the SNN models, which might have different architecture from each other. It also describes the used neural encoding method, the training approach, and the learning type, such as supervised, unsupervised, and semi-supervised learning.
As observed in
Table 5, the bio-inspired learning approach usually produces SNNs with lower accuracy than the rest of the training approaches [
12,
16]. Although Lee et al.’s method [
21] obtained an SNN model with better accuracy than our work, it requires a much higher training cost and higher inference latency than our work. One reason for this slightly inferior performance compared to their model lies in that the accuracy of our trained CNN model (99.31%) is lower than that (99.59%) of their trained SNN model. At a latency of 64 time steps, our method produced an SNN model with better performance than all other methods. Even at a latency of only four time steps, our method produced an SNN model with comparable performance.
To evaluate the effects of the threshold balancing techniques and the spike neurons, we conducted the experiments on the MNIST dataset for the following 10 combinations: the proposed balancing technique + soft-IF pair, the spike-norm + soft-IF pair, the spike-norm + IF pair, the Act-Norm channelwise + IF pair, the act-norm + IF pair, the robust-norm + soft-IF pair, the act-norm + soft-IF pair, the robust-norm + IF pair, the proposed balancing technique + IF pair, and the act-norm channelwise + soft-IF pair. Here, IF indicates the integrate-and-fire neuron shown in
Figure A1, soft-IF indicates the soft-reset IF neuron shown in
Figure A2, spike-norm indicates the spike-based normalization technique [
25] of using the maximum of the weighted sums of spikes over the latency, act-norm channelwise indicates the threshold balancing technique [
27] of using the maximum activations in the ANN models, and robust-norm [
28] indicates the threshold balancing technique [
28] of using a scaled maximum activation in the ANN models.
Figure 12 shows the performance of each threshold balancing technique and neuron model pair for the same SNN architecture on the MNIST dataset. Please refer to
Appendix G for more detail.
As seen in
Figure 12, most experiments have shown better performance for the combination with the soft-reset IF neuron model than the combinations with the IF neuron model. This seems to be attributed to the soft-reset IF neuron model better approximating the ReLU activation in the CNN than the IF neuron. The combination of the proposed threshold balancing technique and the soft-reset IF model showed the best performance over the examined latencies.