1. Introduction
Neuromorphic computing, which imitates the principle behind biological synapses with a high degree of parallelism, has recently emerged as a very promising candidate for novel and sustainable computing technologies [
1]. Among these technologies, neuromorphic systems based on hardware neural networks (HNNs) implemented with memristive devices have emerged as a promising solution for building energy-efficient computing frameworks for solving most of the tasks carried out in machine learning [
2,
3,
4]. This is because memristors (1) behave as a resistor with memory that is electrically programmable and matches the functionality of the connections in a software neural network and (2) are efficiently integrated thanks to the crossbar array structure (i.e., aggressive size scaling is possible) [
1,
5,
6].
Focusing on specific implementations that use memristors based on the crossbar array structure, it is worth noting first that this is one common approach found in the literature [
7]. By using this structure, vector-matrix multiplications, which are a fundamental building block in all types of neural networks, are efficiently implemented by following an analog approach (i.e., by adding current flows). The efficiency of the operation is both in terms of (1) power consumption, because the involved currents are small, and (2) computational time [
8], because the whole operation is performed by reading the outputs of the array. Note that a vector-matrix operation in software has a computational time that is
and that individual memristors in the crossbar array play the role of the matrix coefficients or, in terms of neural networks, the weights. The interested reader can find in [
9] a specific sound localization application based on memristor arrays. Energy consumption is reduced a factor of 184 with regard to the existing Application-Specific Integrated Circuit (ASIC) design.
Memristor-based networks can be trained by offline (or ex situ) or online (or in situ)-learning methods. In the first case, which is the focus of this manuscript, the weights are calculated on a precursor software-based network and then imported sequentially into the crossbar circuit. In the second case, training is implemented in situ in hardware and only for small neural networks [
10], so the weights are adjusted in parallel, which is significantly more demanding [
5]. In both cases, a high precision weight import is required to implement complex networks and achieve the expected performance when the network is operating. However, various properties of memristors are known to negatively affect the performance of neuromorphic systems [
1]. Specifically, the conductance response of any real nonvolatile memory (NVM) device exhibits non-idealities that can surface in the form of unreliable performance of the network. Those imperfections include non-linearity, stochasticity, varying maxima, asymmetry between increasing and decreasing responses, and unresponsive devices at low or high conductance [
11,
12,
13,
14]. For example, most memristive devices exhibit a nonlinear weight update, where the conductance gradually saturates [
1]. In addition, related to HNNs from the perspective of high-performance computing, recent trends show a growing interest in hardware that is capable of accelerating both training and inference in neural networks, especially when dealing with deep learning schemes. That is the case, for example, with many Field-Programmable Gate Array (FPGA) implementations [
15], which emphasize the idea of quantized neural network designs due to the nature of FPGA devices. In particular, binary [
16] and also ternary [
17] implementations have been raised as very interesting options. The main motivation of this alternative approach is the reduction of both power consumption and the FPGA specs (required area). Memristor-based neural networks can also benefit from the power and area. However, the operational principles of memristors are completely different to those found in FPGAs, and at the end of the day, all HNN solutions require solving very specific challenges, as far as a straightforward conversion from the ideal (or software-based) model does not exist. As commented above, one of the challenges in memristor-based neural networks, which work from an analog perspective, is the development of reliable weight implementation due to the variability that is common to all nano-electronic devices but is significantly important in memristors [
18].
In that direction, the authors of [
4] stated that many issues still need to be resolved at the material, device, and system levels to simultaneously achieve high accuracy, low variability, high speed, energy efficiency, a small area, low cost, and good reliability. Thus, the first step is to obtain memristor-based networks that are competitive in comparison to software-based networks. In order to achieve that, we need to cope with the hardware. This can be accomplished at the hardware level with more advanced mitigation techniques or at the analgorithmic level by taking into account the non-idealities. In that sense, the authors of [
10] presented a mask technique to capture the sneak path problem, stating that any kind of training incorporating the knowledge of the crossbar array behavior will likely improve the accuracy of memristor-based networks significantly. This idea was explored by several recent works following different strategies. In [
19], for instance, a tailored training method was proposed to address the voltage drop due to the interconnected wire resistance. Basically, the voltage drop is estimated to recompute the weights at the forward propagation stage during the training procedure. In [
20], the authors considered the mapping of neural network weights by analyzing the parasitic resistance effects at the different areas of the crossbar array. By identifying those hardware cells providing higher accuracy as “safe zones”, adaptive weight allocation was performed to properly map the weights to the hardware. In [
21], the authors mathematically modeled the sensitivity of the output of the neural network with respect to hardware impairments. Then, the cost function of the training algorithm was adapted to consider this sensitivity as an additional term (i.e., the weights were calculated to minimize the impact of hardware impairments as well).
The aim of our work is also to consider hardware impairments during the design and training of the memristor-based neural network. To do so, we depart from software models that emulate the behavior of the memristor-based neural network. More specifically, this work is an extension of the work in [
22], and we consider building ternary networks using crossbar arrays. The goal is to achieve performances close to the software models, even when we consider a simple configuration of the memristors operating like ON/OFF switches. It is worth noting that we adopt ternary weights because they have stronger expressive abilities than their binary counterpart [
17]. As shown in
Section 2, a ternary option does not modify the proposed crossbar array architecture, and the hardware remains the same (i.e., two conductance levels at the memristor weights).
The main contributions of this work are as follows:
The behavior of a ternary memristor-based HNN adopting crossbar arrays is emulated;
Practical configuration strategies to tune the crossbar array structure from a system-level designer point of view are proposed;
An offline (ex situ) training mechanism is derived to optimize the neural network’s weights by minimizing the impact of conductance imperfections in the memristors’ hardware.
In what follows,
Section 2 defines the problem under study, including the crossbar array architecture that we are considering to emulate ternary networks.
Section 3 the encompasses configuration issues as well as the algorithm considered to fix the memristors to either the ON or the OFF status. Finally,
Section 4 provides the experimental results, and
Section 5 concludes the paper.
2. Scenario Description and Assumptions
Let us consider a generic feedforward neural network (FFNN) that is dedicated to a classification task, as depicted in
Figure 1. The network has inputs
, where
stands for the matrix transpose and the first input is manually set to 1 in order to accommodate the bias term. The FFNN operates as described next. First, the inputs are linearly combined by means of a matrix multiplication with
, thus generating the values
, i.e.,
(see
Figure 1). Superindex 1 here stands for the first layer of the network. The values in
go through a nonlinear function
f (typically the sigmoid, the hyperbolic tangent or the rectified linear unit) to generate the activations at the first layer (i.e.,
with
). This process is repeated at the subsequent layers of the FFNN, such as the output at the second layer being computed from
by first computing
and then transforming the values in
by using the non-linear function
f again. Finally, at the last layer, also called the output layer,
f is replaced by the softmax function. In this case, the output is normalized (i.e.,
), and the value of
indicates our confidence level in that
corresponds to the
ith class. Therefore, the network takes the output with largest value as the resulting classification.
All the operations described above are computed in floating-point arithmetic. We will refer to it as the software implementation. In this work, we employ the crossbar array to compute the vector-matrix multiplications at the neural network layers (i.e.,
).
Figure 2 depicts the operational principle of a crossbar array. Let us first consider the memristor in its linear zone, where it can be modeled simply as a resistor of conductance value
g (adjustable) so that the memristor current is
when the voltage
v is applied. By scaling this to a crossbar of the size
and arranging the conductance values in the matrix
, we have
with
and
(see
Figure 2). In other words, the collected currents at the output of the crossbar array are in fact a vector-matrix multiplication between the input voltages
and the memristor conductances in
.
Let us briefly comment on the linearity of the memristor we are considering. According to the memdiode model [
23], the I–V characteristic of a memristor reads as
where
is an increasing function of the parameter
(the memory state),
R is the series resistance, and
is a fitting parameter. Notice that Equation (
1) is an implicit equation for the current
I. Let us consider two extreme cases. The first is the high-resistance state (HRS) regime (with
). In this case, for low voltages, we have
, and the potential drop across the series resistance can also be neglected such that
Second, for the low-resistance state (LRS) (with
), the difference is that the potential drop across
R cannot be disregarded, and so
which can be solved as
The linear regime of the memristor corresponds to a case in between these two extreme situations so that the corresponding conductance reads as
which is independent of the voltage (i.e., it behaves as a simple resistor). This is the regime we are considering in our paper.
From an FFNN application point of view, the weights in
in the software model are equivalent to the conductances in
. Usually, both positive and negative weights are represented, even when we consider only two possible values as in binary neural networks [
16]. We may add a third possible value, a zero, as in ternary networks so that a particular input or activation does not affect the net outputs. As shown below, this ternary option does not modify the proposed hardware architecture (based on two crossbar arrays), as the zero weight is built naturally by combining the same conductance levels with opposite polarization. We are also exploiting the advantage of having a higher granularity when compared with its binary counterpart, as proven in [
17]. Note, however, that some differences between the software model (i.e., complementary metal–oxide–semiconductor (CMOS)-based) and the memristor-based model arise. We next list the considerations in this paper:
We need to transform the output currents at the crossbar array to voltages by means of I-to-V converters. The scale factor of the I-to-V converters is defined as .
The input voltages to the different layers shall be in the linear zone of the memristor (i.e., in the range ). Therefore, we need to scale both the inputs and the activations, because these are the inputs to the next network layers.
We use the sigmoid as the nonlinear function f, which ranges from 0 to 1. Therefore, a scale factor of an amplitude equal to is required.
Memristors are set to either LRS, where the conductance is set to , or HRS, where the conductance is set to (ON/OFF).
Since the conductance values are strictly positive, a single crossbar array cannot emulate both positive and negative weights, as we have in the software model. To overcome this, we need a second crossbar array that considers the negative weights as depicted in
Figure 3. Equivalently, the value of each weight in the software model
is emulated by the combination
, where the superindexes + and − distinguish the first and second crossbar arrays at the
jth layer, respectively.
The memristors are programmed ex situ; that is, we first compute in the software the weights of the memristor-based neural network (considering non-idealities), and once obtained, we fix the conductances in the memristors. From that moment on, the crossbar arrays remain unchanged.
The memristors are programmed to either or , but the conductance values actually written add a random additive component. In particular, , where and . and represent the variances of the conductance in the HRS and LRS, respectively. We assume there are uncorrelated random additive components among the memristors.
We consider
to emulate the positive weight, say
,
to emulate the negative weight, say
, and
to emulate the null weight.
Table 1 shows the set-up of the memristors in the positive and negative crossbar arrays and the corresponding weights. Alternatively, the null weight can be
, too. Note that our first option reduces the current and thus the power consumption.
The goals in this work are the following:
To adjust the conductance values in and (i.e., to decide which memristors are set to and which are set to );
To adjust the value of , taking into account that memristors can be configured in the range . Note that all the memristors are programmed to the same value;
To adjust the value of ;
To consider conductance randomness in the training process.
Note that we consider devices operating in the linear regime (i.e., in the low-voltage region), and thus nonlinearities in the I–V characteristic can be disregarded [
23]. Beyond this point, the conductance of the devices may change as we move to the programming region, which is out of the scope of this work. Aside from that, line resistance, which does not affect the linearity of the devices, may also be considered, and the synaptic weights probably need to be recalculated because of the parasitic potential drops. If the devices operate in the low-voltage regime and the array is not too large, these voltage drops can be disregarded as well. This ultimately depends on the integration technology.
The next section describes the algorithm developed for the ex situ training of the memristor-based FFNN.
3. Proposed Algorithm
In this section, we consider the equivalent software model in
Figure 1 in order to train and configure our memristor-based FFNN, depicted in
Figure 3.
3.1. Training of Quantized Neural Networks
Training of the resulting quantized neural network is accomplished using the so-called backpropagation algorithm as described in [
16]. The idea is simple: the forward pass in the backpropagation applies the quantization, whereas the backward pass computes the gradients as usual in order to update the weights. Stochastic and efficient optimization is accomplished by randomly shuffling the data and by training consecutively on small subsets of the data, respectively. Algorithm 1 shows the steps of the training process.
Algorithm 1 Algorithm for training a quantized network. |
Input: Batch of training examples and labels
Output:
Initialization
1: Randomly initialize the weights at the J layers in the FFNN
LOOP Process
2: for to do
3: for to do
4: (q is defined in Equation (6) below)
5: Forward propagation: compute network activations and outputs using
6: Backward propagation: use to compute gradients
7:
8: end for
9: end for
10: return |
3.2. Ternarization
We considered the following quantization function
, which is defined as
We considered two options to fix
. The first one was to set it to a fixed value. The second one was to try to optimize the value of
according to the current weights at time
t in
so that
was updated at each iteration of the algorithm. We followed the work in [
17] to adjust the value of
as
where
is the all-ones column vector and
is the total number of weights in the FFNN. The aim was to adapt the threshold to the current distribution of the weights. Note that
is the same for all network layers in our work, although different thresholds per layer could also be considered.
3.3. Adaptation of
The proposed crossbar structure has two additional parameters to configure. Recall that we assumed an ON/OFF memristor model and that the conductances for the LRS and HRS were common to all memristors in the array. Notwithstanding, memristors can be programmed to different conductance values in the LRS. In this subsection, we develop the tuning of the conductance in
. Recall that in Algorithm 1, we configured the memristors in our network to either the LRS or the HRS, relying on backpropagation. In particular, note that the unquantized weights that are written in the memristor network, as the LRS will generally differ from
. In other words, usually we have
Therefore, we can use the values in
to also update
and reach a consensus value
. Consider the following update rule:
where
is the forgetting factor and
is a masking function that operates element-wise in order to consider only the weights that have influence in
(i.e., not the null weights). When
is applied to a scalar in
, say
, it produces the following output:
Additionally, is the total number of weights whose quantization is different from zero at iteration .
However, note that a single weight in the neural network, say
, once quantized, requires three elements in our hardware model to be represented: two memristors (one in
and one in
) and an I-to-V converter. In other words,
is represented in our physical model as
. Furthermore, we set
(assuming
when
) in order to map the weights
, as is the case in software-based ternary networks [
17]. However, the conductance variance
, which does not depend on the particular value of
, now plays an important role, and the best choice is to set
to the largest value allowed. Note that after division by
in the I-to-V converter, the resulting conductance variance is also downsized.
In short, the discussion above is to point out that the best strategy is to set as large as possible and then fine-tune our memristor-based neural network by adjusting the gain in the I-to-V converter, as we show next.
3.4. Adaptation of
Let us consider that each output at the crossbar array could be adjusted separately (i.e., we have
). In this case, it is not complicated to compute the gradients for these parameters. It is similar to the weight gradients in backpropagation. For example, consider the scores
, where ⊙ stands for the Hadamard product at the output layer of the neural network. If we train it using cross-entropy (assuming a classification task) (i.e.,
, where
is the number of classes,
(0 or 1) are the targets and
are the network outputs), the gradients are found as follows:
where
selects the
ith row of matrix
.
Let us analyze the effect of this gradient in the network, depicted in
Figure 4. Assume that
(so the current example belongs to the
ith class). Unless we get a perfect classification,
and the first term
of the gradient will be negative. Therefore, if
is positive (i.e., we are at the positive side of the sigmoid or softmax), the gradient is negative, and
should be increased according to Equation (
11). The effect is to shrink the sigmoid or softmax in order to increase the value of
. If
is negative, we are on the negative x-axis of the sigmoid or softmax, and the update of
stretches the curve. This increases the value of
and therefore diminishes the classification error. The reader can refer to
Figure 4 for a graphical visualization of the discussion above. The analysis for
is similar and not included here for the sake of brevity.
Having separate conversion gains at all crossbar outputs that are individually adapted is a real possibility. However, since we assume a common converter value, we must build a consensus gradient from all individual gradients such that
where
(i.e., the total number of outputs in the
J layers of the FFNN). We can now apply gradient-descent-based solutions to optimize
.
Another option is to simply consider as a hyperparameter of the neural network (it is a scalar value) and optimize.
3.5. Including Robustness in Perturbed Conductances
The last issue we consider is the perturbation of the conductances that are written to the memristors; that is, we want to set the memristor to a conductance level or , but the level we actually achieve differs by a Gaussian perturbation term (zero-mean and variances and , respectively).
In order to cope with this physical impairment, we adopted an approach that resembles the training of quantized networks. Specifically, in the backward pass of backpropagation, we added a Gaussian term to the weights. The variance of that random contribution was set to , which is a hyperparameter of the network. In other words, we considered the following approach (in algorithmic style). This step substitutes step 7 in Algorithm 1.
|
The approach has a well-established foundation that connects to the regularization methods in neural networks. Primarily used in the context of recurrent neural networks, as described in [
24] (Ch. 7.5), noise injection (i.e., adding random values to the weights) adds robustness to the network in the sense that the model learned is somehow insensitive to small variations in the weights. In other words, our approach can be interpreted as a form of regularization.
4. Experimental Results
In this section, we experimented with the proposed ternary network in order to evaluate the effects of the different adaptation mechanisms (conductance at the LRS and the conversion factor at the I-to-V stage), as well as the effect of quantizing the weights and the incorporation of weight variability during training. We considered two different datasets widely employed as benchmark datasets in machine learning: the Modified National Institute of Standards and Technology(MNIST) dataset [
25] and the fashion MNIST dataset [
26]. Both datasets consist of grayscale images of 28 × 28 pixels. The former contains images of handwritten numbers (from 0 to 9), whereas the latter also contains also different types (or classes) of images all related to clothes (e.g., t-shirts, pullovers or sandals, among others). In both cases, an 80–20 random split for training and testing was conducted.
In terms of neural network architecture, we considered an FFNN with two hidden layers of a size 1000 units/neurons. Taking into account 784 (=28 × 28) values at the input layer and 10 output classes, the whole architecture was 784–1000–1000–10, with a total of 7.85 G weights/parameters to be trained (including bias terms) in a full-software implementation and 15.7 G memristors to be set at either the LRS or HRS in the memristor-based neural network implementation. Training and evaluation were performed by means of Python programming using the Tensorflow library for deep neural networks [
27]. The memristors were modelled in Python and Tensorflow according to the assumptions in
Section 2.
Table 2 summarizes the electrical parameters considered in our experiments. We considered values that were in agreement with the state of the art of the memristive technology [
10,
28], but we also took into account larger conductance deviations in order to accommodate other fabrication technologies. Our goal here was to test the practical importance of synthesizing reliable devices in terms of conductance fluctuations. The conductance values are always relative to
, the quantum conductance.
Finally, ternarization was applied with the adaptive threshold in Equation (
7), and the performance metric employed was classification accuracy, which measured the number of examples (images in this case) correctly classified with respect to the total number of images (i.e., it was the percentage of images correctly classified).
4.1. Adaptation of
Our first experiment dealt with the adjustment of the conductance value for the LRS (i.e.,
). In
Figure 5 and
Figure 6, we considered violin plots that showed the distribution of accuracies obtained after 1000 realizations. Remember that memristor conductances incorporate a random term (i.e.,
and
). Although the weights computed during training remained unchanged, their mapping to memristor conductances changed from one realization to another due to the random term. That aside, we considered here
, the threshold used for the ternarization of the weights set as in Equation (
7), and
.
Figure 5 considers
, and
Figure 6 considers
.
As we can appreciate in
Figure 5, where the conductance perturbations were moderate, the average accuracy was above 95% with all the tested adjustments of
. However, the distribution of accuracy values was spread out significantly more for the case
(ranging from 0.845 to 0.976), whereas the dispersion diminished for the case of
(ranging from 0.939 to 0.977) and practically vanished for
(ranging from 0.965 to 0.978) and more notably for
(ranging from 0.976 to 0.979).
Figure 6 involves experiments with a severe conductance perturbation. In this case, the classification task was more complex, and with equal network configuration, the performance dropped. We appreciated the dispersion in the accuracy distributions for all cases of
, although the dispersion tended to reduce as
increased. In this best case, the accuracy ranged from 0.551 to 0.825, so the difference between the max and min value was 0.274. In this application, it is important to note the mean values for the accuracy. For the two lowest conductance values, the mean accuracy was 25.8% for
and 41.9% for
. This value grew up to 61.1% for
and to 73% for
.
In conclusion, both experiments confirmed that should be adjusted to the highest possible value (depending on the available technology) in order to achieve the best possible performance.
4.2. Adaptation of and Robustness to Perturbed Conductance Values
In
Figure 7, we tested how sensitive the classification accuracy was to adjustment of the value in
and to the weight perturbance introduced during training (i.e.,
), assuming
. We considered both datasets under study (handwritten digits and fashion MNIST), and we plotted the classification accuracy as a function of
, testing different combinations of
and
, particularly
and
. As we can appreciate in the figure, setting
(i.e.,
) gave us a particularly good initial adjustment. In the applications tested, the plots show that the performance could be just slightly increased by choosing the optimal value of
, as long as the sensitivity around the initial adjustment was low. Note that the perturbance introduced during training (i.e., the value in
) had a larger influence on the classification accuracy (i.e., the different accuracy curves became more separated), especially for the fashion MNIST dataset when
and
. Note also that in general, the higher
was, the more variation in performance we observed. As a rule of thumb, setting
to a value in the range
provided a proper adjustment.
4.3. Comparison with the Software-Based Neural Network
We then tested how a memristor-based neural network (MBNN) compared to a software-based neural network (SBNN). For this occasion, we took into account ternarization of the weights as well as mitigation of the conductance perturbations and optimization of . Aside from the reference network architecture (i.e., 784–1000–1000–10), we also considered 784–500–500–10 and 784–100–100–10 for the MBNN and SBNN models. The results reported the empirical cumulative density functions (ecdf). Note that the SBNN suffered no perturbation once the weights were fixed, but the performance slightly varied due to the random initialization of weights in training, too. In order to reflect this issue, we then considered 1000 realizations in total for each model containing 10 different training processes. In other words, the same set of weights was used to perform 100 inferences. Note that in the SBNN, all inferences that used the same set of weights produced the same results, whereas in the MBNN, this was not the case due to conductance fluctuations. We considered here the classification of the fashion MNIST dataset, which is a more complex task than handwritten digit classification, assuming and .
In
Figure 8, we next compare the following methods: (1) SBNNs with 1000, 500 and 100 units in the two hidden layers; (2) an MBNN (1000 units in the hidden layers) with
,
(i.e., we did not consider conductance fluctuations in training), an MBNN with
,
(i.e., a default set-up assuming fluctuations in training) and a fine-tuned MBNN (in this case requires increasing
to
); and (3) the fine-tuned MBNN version with 500 and 100 units in the hidden layers.
The results essentially show two issues when we considered 1000 units at each hidden layer. First of all, there was the importance of considering memristor fluctuations during training. Note the spreading in the ecdf for , which had a maximum value of 0.8724 and a minimum value of 0.6274 (i.e., the gap was 0.245). This gap significantly reduced to 0.042 in the case of the default set-up and practically vanished in the tuned MBNN and also the SBNN. Second, a properly tuned MBNN achieved a performance that was similar to the SBNN counterpart. If we look at the worst performance in all the set-ups, the SBNN archived an accuracy of 0.8788, the MBNN with yielded 0.6274 (a 28.6% reduction with respect to the SBNN), and the MBNN with obtained 0.8905 (a 1.3% increase with respect to the SBNN). This slightly better performance might have been due to the regularization effect produced when we included robustness to perturbed conductance values by means of . For the cases of 500 and 100 hidden units per layer, the MBNN performed close the SBNN for 500 units and suffered a reduction of about 3% in accuracy for the case with 100 hidden units. Therefore, weight quantization affected the performance more as the complexity of the model was further constrained.
4.4. Summary and Extension of Results
To summarize the results so far, we saw that the performance in general depended on the task complexity, the network configuration as well as on the memristor quality, where the larger the
and the lower the
, the better. Regarding task complexity,
Figure 7 plots the results of the exact same models applied to two different tasks. In the top row, the less complex task showed less variability among models such that a proper adjustment of
was less critical. In the bottom row, the more complex task showed more variability and required a good adjustment of
. In order to complete our analysis, now with a lower quality memristor, we could reproduce in
Figure 9 the same experiments while considering
instead of
. As we can appreciate in the figure, the classification accuracy in the different models was far more sensitive to the proper adjustment of
. The worst performing models in the classification of handwritten digits (top row) then achieved accuracies around or below 20%, whereas the worst accuracy in
Figure 7 was above 75%. Something similar occurred in the classification of clothes; the worst performing models achieved values around 20% whereas in
Figure 7, the worst performance was above 65%.
Finally, we tested our memristor-based solution as part of a convolutional neural network (CNN) implementation applied to the classification of the fashion MNIST dataset. The input in this case was grayscale images of 28 × 28 pixels (i.e., 2D data). The configuration of our CNN was as follows:
2D convolutional layer with 32 filters, 3 × 3 kernels and rectified linear unit (ReLU) activation;
2D max pooling 2 × 2 layer;
2D convolutional layer with 64 filters, 3 × 3 kernels and ReLU activation;
2D max pooling 2 × 2 layer;
2D convolutional layer with 128 filters, 3 × 3 kernels and ReLU activation;
Flatten layer (1152 values at output);
Fully connected layer (1000 values at output);
Fully connected layer (1000 values at output);
Fully connected layer (10 values at the output to identify each of the 10 classes in the dataset).
Convolutional, max pooling and flatten layers were implemented in the software. This first block transformed each image into 1152 positive values at the output of the flatten layer (1D). The second block embraced the fully connected layers and was identical to the FFNN tested so far, except for the number of input values (768 before vs. 1152 now). This second block was implemented both in the software and using the proposed memristor-based neural network. In the latter case, the outputs at the flatten layer were scaled to fit the range
. Note that the memristors could also be considered in the first block, but this introduces additional complexity in so far as more peripheral circuitry is required. This point is beyond the scope of the paper. In
Figure 10, we reproduced the results in
Figure 8 using the described CNN approach.
The results show that the performance, in terms of prediction accuracy, increased for all the configurations tested with respect to the FFNN approach. This was due to the high-level features processed at the convolutional and max pooling layers. The second observation is that increasing the value of gave robustness to the system (i.e., the accuracy values were less spread out). This result is coherent with the results obtained so far. Finally, we observed that the MBNN with a proper configuration was close in performance to the SBNN. These preliminary results encourage us to explore the application of memristors to more complex neural network architectures.
5. Conclusions
In this paper, we analyzed the implementation of deep neural networks using crossbar arrays of memristors, and more specifically, we considered the case where these devices can be configured in only two different states: a low-resistance state (LRS) and a high-resistance state (HRS). The natural usage of crossbar arrays in the context of neural networks is in performing vector-matrix multiplications in an analog fashion (i.e., by adding currents), thus reducing the power consumption and computational time. Our approach aims at emulating ternary neural networks, which sets the weights in the neural network to a value in the range of . In order to achieve this behavior, we need to implement two crossbar arrays for each feedforward layer in the network (i.e., one to represent the positive weights and the other one to represent the negative weights). Additionally, some other adaptation issues in relation to software-based neural networks arise: (1) the currents at the output of the crossbar arrays have to be converted to voltages for the next stage, resulting in a conversion factor that can be potentially tuned to boost network performance and (2) memristor device experiment conductance fluctuations that also impinge on performance. Taking these issues into account, we designed an algorithm to train the weights in the network and later map these weights to the network, where memristors are programmed to either the LRS or the HRS.
The results show that the proposed system design and offline training method represent a real alternative to the traditional software-based (i.e., CMOS-based) neural networks. The lessons learned in this work are as follows: (1) the higher the conductance of the memristor in the LRS, the better performance we can achieve; (2) the conversion factor that maps the output currents at one layer to input voltages at the next layer can be fine-tuned, but it is not a sensitive parameter; and (3) it is very important to consider mitigation of the conductance variability, as performance is very sensitive to this. In our experiments, we achieved accuracies that were similar to the software-based counterpart, but without considering conductance variability during training, we observed large gaps in terms of classification accuracy between the worst realizations. This gap could be above 50% in the 10-class classification tasks (handwritten digits and fashion MNIST data) we tested.
Future work could consider additional hardware issues such as nonlinearity, stochasticity, varying maxima, asymmetry between increasing and decreasing responses, non-responsive devices at low or high conductance, mixed time-varying delays or the sneak-path problem in crossbar arrays [
10,
11,
12,
13,
14]. We also need to evaluate the performance using more complex and widely used neural network models, such as convolutional or recurrent networks. The preliminary results have been presented for a CNN here, showing the potential of memristor-based approaches.