Next Article in Journal
A Flexible Thermocouple Film Sensor for Respiratory Monitoring
Next Article in Special Issue
Real-Time Detection of Drones Using Channel and Layer Pruning, Based on the YOLOv3-SPP3 Deep Learning Algorithm
Previous Article in Journal
Performance of Air-Conditioning System with Different Nanoparticle Composition Ratio of Hybrid Nanolubricant
Previous Article in Special Issue
Deep Learning Based Feature Selection Algorithm for Small Targets Based on mRMR
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SCA-Safe Implementation of Modified SaMAL2R Algorithm in FPGA

by
José de Jesús Morales Romero
1,*,†,
Mario Alfredo Reyes Barranca
1,†,
David Tinoco Varela
2,†,
Luis Martin Flores Nava
1,† and
Emilio Rafael Espinosa Garcia
1,†
1
Department of Electrical Engineering, CINVESTAV, Mexico City 07360, Mexico
2
Engineering Department, Superior Studies Faculty-Cuautitlán, National Autonomous University of Mexico, UNAM, Cuautitlán Izcalli 54714, Mexico
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Micromachines 2022, 13(11), 1872; https://0-doi-org.brum.beds.ac.uk/10.3390/mi13111872
Submission received: 11 August 2022 / Revised: 12 October 2022 / Accepted: 26 October 2022 / Published: 30 October 2022
(This article belongs to the Special Issue Artificial Intelligence Integration with Micro-Nano Systems)

Abstract

:
Cryptographic algorithms (RSA, DSA, and ECC) use modular exponentiation as part of the principal operation. However, Non-profiled Side Channel Attacks such as Simple Power Analysis and Differential Power Analysis compromise cryptographic algorithms that use such operation. In this work, we present a modification of a modular exponentiation algorithm implemented in programmable devices, such as the Field Programmable Gate Array, for which we use Virtex-6 and Artix-7 evaluation boards. It is shown that this proposal is not vulnerable to the attacks mentioned previously. Further, a comparison was made with other related works, which use the same family of FPGAs. These comparisons show that this proposal not only defeats physical attack but also reduces the number of resources. For instance, the present work reduces the Look-Up Tables by 3550 and the number of Flip-Flops was decreased by 62,583 compared with other works. Besides, the number of memory blocks used is zero in the present work, in contrast with others that use a large number of blocks. Finally, the clock cycles (latency) are compared in different programmable devices to perform operations.

1. Introduction

Nowadays, most of the confidential information and data of people, governments, and companies, are exchanged via electronic media. Electronic media have the advantage of increasing the speed of information exchange. Moreover, they supply an elevated level of security, confidentiality, and data integrity.
Many researchers have developed various methods for encryption and decryption of secret information, including methods with artificial intelligence, as in Ref. [1]. However, attackers have engaged in these techniques.
Encryption and decryption of the information use a secret key. Cryptographic devices such as Smart Cards, Field Programmable Gate Arrays (FPGA), Application-Specific Integrated Circuits (ASICs), microcontrollers, and others operate storing this secret key.
A device that is very interesting to use for various applications is the FPGA. These devices can be used in various fields, such as artificial intelligence [2,3], power electronics [4], control [5], optimization [6], among others. Due to the versatility of these devices, we use them, in the present work, within the area of cryptography.
Kocher, in the 1990s, realized that anyone could get the secret key through physical measurements in cryptographic devices [7,8].
These attacks do not require highly specialized equipment since the attack can be carried out using accessible equipment, such as an oscilloscope and resistors.
Hence, Kocher proved that physical attacks are possible and not only at the software (or algorithm) level. This type of attack is aimed at devices that anyone uses, such as those used in Internet of Things (IoT) [9,10]. These kinds of devices hold confidential information of the owners, such as email, bank movements, personal identification, and any kind of sensitive and private information.
Therefore, works have been focused on the implementation of secure algorithms. In order to reject or impede access to secret information, approaches to reach this task can be either based on software or hardware [9,10]. This last approach has the disadvantage of being very demanding in resource consumption.
This paper deals with modular exponentiation and its implementation to protect information from some of the Non-Profiled Side Channel Attacks [11]. We propose an algorithm that shows robustness, security, and reliability, compared with other reported implementations, analyzing, in particular, its performance upon the N 1 A t t a c k , which is a type of Non-Profile Side Channel Attack.
We will show that the present work reduces a minimum of 3559 Look-Up Tables (LUT) and 62,583 Flip-Flops (FF) compared to other architectures. In addition, it will be shown that the present work does not require memory blocks compared to the rest of the works cited in this one.
The rest of this work is as follows: Section 2 reviews the Side-Channel Attacks (SCA) in the context of RSA implementations; next, in Section 3, an absorbing attack called the N 1 A t t a c k is reviewed; later, in Section 4, the proposed implementation of modular exponentiation and modular multiplication is reviewed; afterward, in Section 5, the implementation of the proposed modular exponentiation and the modular Multiplication implemented in an FPGA is shown; finally, Section 6 provides the results of the proposed approach in the FPGA.

2. Side-Channel Attacks

Since the first physical assault described by Kocher, researchers have developed many attacks against programmable devices. These attacks are known as Side-Channel Attacks (SCA). In the literature, there are two main categories for SCAs: the first category is related to profile attacks, and the second is related to non-profile attacks [11,12].

2.1. Profiled SCAs

Profile SCA corresponds to those made with machine learning. For this type of attack, a device with encryption algorithms is required, which allows for characterizing the vulnerabilities presented by it and thus obtaining a model.

2.2. Non-Profiled SCAs

Non-profiled SCAs require only the encrypted message. There are two major categories for this type of attack: Active Attacks and Passive Attacks.

2.2.1. Active Attacks

When an adversary performs an attack, it somehow exerts an influence over the behavior of the cryptographic device. Then, the attacker can notice a difference in the signals measured on the cryptographic device. From these waves, an attacker can extract secret information.
However, these attacks are considered invasive attacks. In an invasive attack, the attacker physically modifies the cryptographic device. For example, the attacker can induce a signal in the cryptographic device. This signal gives a change in the memory, inside the computation of the data, or the cryptographic algorithm. Fault Attacks (FA) are examples of these kinds of attacks
Due to the nature of these attacks, cryptographic devices are usually irreversibly damaged.

2.2.2. Passive Attacks

On the other hand, passive attacks are not invasive. These kinds of attacks do not interfere with the operation of the device. Hence, the device is not damaged.
Power Analysis Attacks (PAA) [8], Timing Analysis Attacks (TAA) [7], and Electromagnetic Analysis Attacks (EAA) [10] are examples of Passive Attacks.
In Power Analysis Attacks (PAAs), the attacker takes advantage of the fact that the power consumed is correlated to the algorithm and the data processed by the crypto device.
Besides, the PAA can be classified into two categories: Simple Power Analysis (SPA) and Differential Power Analysis (DPA) [8].
It is possible to get the secret key of a cryptosystem using SPA by making one or a few measurements of the power traces of the power consumption when the device is performing its cryptographic operations.
On the other hand, DPA requires many measurements of the power traces of the power consumption. To get these measures, an attacker or researcher uses different input messages on the cryptographic device. A statistical analysis of the measured power traces is carried out. It should be mentioned that DPA is more powerful and more difficult to prevent than SPA.

2.3. Countermeasures to Passive Attacks

Many cryptographic algorithms, such as RSA [13], use modular exponentiation as the core. Modular exponentiation uses the binary representation of the exponent. SCAs use the measured traces to find the binary form of the exponent.
Before Kocher published his work, the Square-and-multiply Algorithm was the most used for modular exponentiation. This approach has two forms Left-to-right (SaML2R) and Right-to-left (SaMR2L). Algorithm 1 shows the Square-and-Multiply, Left to Right algorithm.
Algorithm 1 Square-and-Multiply, Left to Right
Require:  m, d = ( d n 1 , , d 0 ) 2 , modulus N
Ensure:  S = m d mod N
 1:
S 1
 2:
fori from ( n 1 ) down to 0 do
 3:
    S S 2 mod N
 4:
   if ( d i = 1 ) then
 5:
      S S · m
 6:
   end if
 7:
end for
 8:
returnS
However, this technique is vulnerable to SPA and Timing Analysis Attacks. The SaM algorithm calculates modular multiplication and square if the exponent bit is 1, as shown in steps 4 and 5 in Algorithm 1; or it processes just one square when the exponent bit is 0.
Coron supplied the first technique to defeat the SPA and Timing Analysis Attack [14]. This regular technique is called Square-and-multiply Always (SaMA). Specifically, Algorithm 2 shows the Left to Right version of the SaMA algorithm. This algorithm, in contrast to the Square and Multiply Algorithm, calculates a square followed by a multiplication, independently of the value of the corresponding bit of the exponent. Namely, this algorithm calculates a square followed by a multiplication when the bit of the exponent is 0 or 1.
Algorithm 2 Square-and-Multiply Always, Left to Right
Require:  m, d = ( d n 1 , , d 0 ) 2 , modulus N
Ensure:  S = m d mod N
 1:
S 1
 2:
for i from ( n 1 ) down to 0 do
 3:
    R [ 0 ] S 2 mod N
 4:
    R [ 1 ] S [ 0 ] · m mod N
 5:
    S S [ d i ] mod N
 6:
end for
 7:
return  S
Table 1 shows a brief example of the SaMAL2R algorithm execution. In this example, the parameters are d = 97 10 = 1100001 2 as the exponent and m as the input message.
As shown in the example, in all the steps the same operations are always carried out. However, this algorithm uses dummy operations during its execution. These dummy operations stand for vulnerability because they can be attacked through DPA.
On the other side, Mayima et al. proposed an algorithm to defeat DPA [15], called Binary Expansion with the Random Initial Point (BRIP) algorithm. Algorithm 3. This algorithm protects the input message with a random value, and it does not use dummy operations during cryptographic operations.
Algorithm 3 BRIP
Require:  m, d = ( d n 1 , , d 0 ) 2 , modulus N
Ensure:  S = m d mod N
 1:
S [ 0 ] r
 2:
S [ 1 ] r 1
 3:
S [ 2 ] m · r 1
 4:
for i from ( n 1 ) down to 0 do
 5:
    S [ 0 ] S [ 0 ] 2 mod N
 6:
   if  d i = 0  then
 7:
      S [ 0 ] S [ 0 ] · S [ 1 ] mod N
 8:
   else
 9:
      S [ 0 ] S [ 0 ] · S [ 2 ] mod N
 10:
   end if
 11:
end for
 12:
S S [ 0 ] · S [ 1 ] mod N
 13:
return  S
Table 2 shows a brief example of the execution of the BRIP algorithm. In this example, the parameters are d = 97 10 = 1100001 2 as the exponent and m as the input message.
Nevertheless, this algorithm requires the calculation of random numbers for its execution, which stands for excessive consumption during implementation in programmable devices such as FPGA.
The previously mentioned algorithms were attacked by a special type of SCA.

3. N 1 A t t a c k

There are many kinds of passive attacks. An engrossing attack within passive attacks is the Chosen-message Attack. The Chosen-message Attack uses a specific input message. This input message produces a particular behavior during the performance of an algorithm. One of these kinds of attacks was developed by Yet et al. [16] and implemented by Miyamoto et al. [17]. This attack is called N 1 A t t a c k . This technique is used against the modular exponentiation of the SaMA and BRIP algorithms.
This N 1 A t t a c k s works, considering the two following observations in the RSA context with modulus N:
  • The first observation is that ( N 1 ) 2 1 mod N , this is extended to the fact that ( N 1 ) j 1 mod N for any even integer j;
  • The second observation is similar to the earlier one when ( N 1 ) k ( N 1 ) mod N for any odd integer k.
These two observations in modular exponentiation will give only two values in the intermediate operations: a 1 value when the bit processed of the exponent is 0 and N 1 when the bit is 1. Hence, an attacker will see only two patterns in the power traces of the measurements. Thus, the secret key, being the exponent, can be obtained.
An attacker or researcher may use the N 1 A t t a c k against the BRIP algorithm. In general, with this technique, anyone can attack algorithms that use a random value to protect the input message.

Countermeasures to N 1 A t t a c k

In the literature, there are proposals to defeat N 1 A t t a c k s . Some of these proposals are mentioned below.
The first proposal is to blind the message. The BRIP algorithm shows this countermeasure. The input message, m, is blind using a random value, r. A second proposal is to block the special input message. In this countermeasure, the input message is compared to the value N 1 and is closed up when both are equal. However, before the proposals are vulnerable to the Chosen-message Attack, this technique follows the same idea as the N 1 A t t a c k .
Finally, D. Tinoco [18] made two proposals to countermeasure the N 1 A t t a c k . These countermeasures do not require blinding the message, and they do not have to block the input message when it is N 1 .
To calculate modular exponentiation, Tinoco follows the property shown in (1).
m a = m 2 a + b = ( m a ) 2 · m b
Tinoco used (1) to apply it in modular exponentiation algorithms. Finally, he presented Equation (2). This equation is the representation that works with intermediate-even exponents.
m d = ( ( ( m 2 d n 1 ) 2 · m 2 d n 2 ) 2 m 2 d 2 ) 2 · m 2 d 1 · m d 0
The proposals by D. Tinoco follow the idea of SaMAL2R and BRIP algorithms, and his Equation (2). Algorithm 4 shows the Modified square-and-multiply always, left to right, while Algorithm 5 shows the Modified BRIP Algorithm.
Algorithm 4 Modified Square-and-Multiply Always, Left to Right
Require:  m, d = ( d n 1 , , d 0 ) 2 , modulus N
Ensure:  S = m d mod N
 1:
S 1
 2:
M m · m mod N
 3:
for i from ( n 1 ) down to 1 do
 4:
    R [ 0 ] S 2 mod N
 5:
    R [ 1 ] R [ 0 ] · M mod N
 6:
    S R [ d i ] mod N
 7:
end for
 8:
if  d 0 = 0  then
 9:
    return  S
 10:
else
 11:
    return  S · m mod N
 12:
end if
Algorithm 5 Modified BRIP
Require:  m, d = ( d n 1 , , d 0 ) 2 , modulus N
Ensure:  S = m d mod N
 1:
S [ 0 ] r
 2:
S [ 1 ] r 1
 3:
S [ 2 ] m · m · r 1 mod N
 4:
fori from ( n 1 ) down to 1 do
 5:
    S [ 0 ] S [ 0 ] 2 mod N
 6:
   if  d i = 0  then
 7:
      S [ 0 ] S [ 0 ] · S [ 1 ] mod N
 8:
   else
 9:
      S [ 0 ] S [ 0 ] · S [ 2 ] mod N
 10:
   end if
 11:
end for
 12:
if  d 0 = 1  then
 13:
    S [ 0 ] S [ 0 ] · S [ 1 ] · m mod N
 14:
else
 15:
    S [ 0 ] S [ 0 ] · S [ 1 ] mod N
 16:
end if
 17:
return  S [ 0 ]
Since the algorithms run from n 1 to 1, the last square and the last multiplication by m 2 d 0 are eliminated. To compute the correct value, an if statement replaces the multiplication by m d 0 .
When the exponentiation is executed by MSaMAL2R or BRIP algorithms, the two patterns are not shown. Conversely, only one regular and masked shape in the power traces is observable.
Although these algorithms are implemented in software by the author, they are not implemented in programmable devices such as FPGAs.

4. The Proposed Algorithm

Even when it is possible to implement Algorithm 5 in programmable devices such as FPGAs, it is necessary to generate a random value, r, and calculate its inverse, r 1 . At this point, it is possible to create random values in an FPGA. However, a reduction of the resources used by the FPGA is necessary. Moreover, the algorithm requires three registers to store intermediate values used in the loop.
On the other hand, Algorithm 4 does not require generating one random value. This advantage helps saving resources compared with the MBRIP algorithm. As stated in the MBRIP algorithm, three registers are needed to store intermediate values in the loop, going against the compromise to reduce the device’s resources.
It should be remembered that the main operation of modular exponentiation algorithms is modular multiplication. Modular multiplication was developed and synthesized in the FPGA.

4.1. Montgomery Multiplication

There are several proposals reported in the literature on modular multiplications. P. Montgomery developed an interesting modular multiplication [19]. This proposal, known as Montgomery multiplication, has been implemented in FPGAs [20,21,22]. Besides, there is a technique to improve Montgomery multiplication. This technique avoids the final subtraction of the original algorithm. Algorithm 6 shows the description steps suggested in this work for Montgomery multiplication.
Algorithm 6 Montgomery Multiplication
Require: N = ( N w 1 N 1 N 2 ) b , x = ( x w 1 x 1 x 0 ) b , y = ( y w 1 y 1 y 0 ) b with x , y < 2 N , 2 N < R = b w with gcd ( N , b ) = 1 , and N = N 1 mod b
Ensure:  A = x y R 1 mod N
 1:
A 0   { with A = ( A w 1 A 1 A 0 ) b }
 2:
for i from 0 to ( w 1 )  do
 3:
    u i ( A 0 + x i · y 0 ) · N mod b
 4:
    A ( A + x i · y + u i · N ) / b
 5:
end for
 6:
return  A
An engrossing proposal for implementing Montgomery multiplication is called Systolic Architecture [9,23]. Systolic Architecture uses regular blocks called Processing Elements (PEs), which is an advantage for FPGA implementation.
In this approach, the operands are split into words with size radix b because base b is in the power of 2 with k bits.
Taking this into account, Algorithm 7 shows the version of the Montgomery multiplication for implementation in FPGA.
Algorithm 7 Montgomery multiplication for FPGA
Require:  N, x and y with a length of b w bits, and N of k bits
Ensure:  x y R 1 mod N
 1:
A 0
 2:
for i from 0 to ( n 1 )  do
 3:
   Take the lowest k bits of u i ( A 0 + x i y 0 ) N
 4:
   for j from 0 to ( w 1 )  do
 5:
      ( c j , A j ) = ( A j + x i y + u i N + c j ) k
 6:
   end for
 7:
end for
 8:
return  A
The operation ≫ in step 5 of Algorithm 7 means the division by b in step 6 of Algorithm 6.
However, Montgomery multiplication has a disadvantage. To get the correct result, the x and y operands must be translated into the Montgomery domain. The operands translation is executed using the same technique as Montgomery. Then, x and R 2 mod N are used for x ˜ input and, y and R 2 mod N are used for y ˜ input. Then, the original values are x ˜ and y ˜ .
An added Montgomery multiplication must be done to retrieve the result from the Montgomery domain. It uses the A value and one value of 1 as the inputs. However, it is inexpensive for modular exponentiation. It is done only once during the exponentiation execution.

4.2. Proposed Modular Exponentiation

As previously mentioned, Algorithm 4 has some advantages.
First, we consider the requirements for the Montgomery multiplication. Using this way, we propose to change Algorithm 4. That proposal is suitable for implementation on FPGAs. Algorithm 8 shows this new proposal.
Algorithm 8 Modified Square-and-Multiply Always, Left to Right for FPGA
Require:  N = ( N w 1 N 1 N 0 ) b , m = ( m w 1 m 1 m 0 ) b , d = ( d n 1 d 0 ) 2 with m < 2 N , 2 N < R = b w with gcd ( N , b ) = 1 , and N = N 1 mod b
Ensure:  S = m d mod N
 1:
S R mod N
 2:
M [ 0 ] m · R 2 mod N
 3:
M [ 1 ] M [ 0 ] · M [ 0 ] mod N
 4:
for i from ( n 1 ) downt to 1 do
 5:
    R [ 0 ] S 2 mod N
 6:
    R [ 1 ] R [ 0 ] · M [ 1 ] mod N
 7:
    S R [ d i ] mod N
 8:
end for
 9:
S S · M [ 0 ] mod N
 10:
S S · 1 mod N
 10:
return  S
The values R mod N and R 2 mod N are inputs. As it is mandatory in the Montgomery multiplication algorithm, the input values are settled in the Montgomery domain; step 2 sets the message input in the Montgomery domain. The result is retrieved in step 10.
In the context of RSA, the exponent d is always an odd number. That is, d 0 will always be 1; considering this, step 9 in Algorithm 8 has replaced the last condition in Algorithm 4.
Table 3 shows a brief example of the execution of the proposed algorithm. In this example, the parameters are d = 97 10 = 1100001 2 as the exponent and ( N 1 ) as the input message.
Notice that the intermediate values are always the same. The above example proves the correct operation of the proposed technique.
The proposed algorithm requires one extra register compared with the MSaMAL2R algorithm; this is a disadvantage for programmable devices with reduced resources. However, the extra register is not a drawback, due to the number of resources in the FPGA.

5. Implementation of the Proposed Algorithm in FPGA

The implementation of modular multiplication is essential because, as previously mentioned, it is the core of modular exponentiation. Thus, the implementation of modular multiplication defines the number of resources used by the FPGA. Thus, an optimal design of the implementation can be achieved.

5.1. Implementation of Proposed Modular Multiplication for FPGA

First, we implement the Montgomery multiplication into FPGA. We used the Systolic Architecture with a radix of 16 bits; the operands were chosen accordingly to avoid the final subtraction, as previously mentioned.
With the radix of 16-bits and 1024-bit RSA, we split the operands into 64 words; thus, there are 64 PEs. This operation takes only one clock cycle.
All these PEs are regular, which means that only one PE was developed, and this PE was replicated 64 times.
Figure 1 shows the block diagram of the Montgomery multiplication architecture implemented in FPGA.
The value x is left-shifted 16 bits, and this value is sent to all PEs and the block called u i . The task of this last block is to calculate the value of step 3 in Algorithm 7. As previously mentioned, the least significant bits are the only ones considered.
Figure 2 shows the block diagram of the individual PE implemented. In the first PE, the input-carry value was set to zero. In the last PE, the output carry value was used as the final A w 1 .
We use Digital Signal Processors (DSPs) embedded in the FPGA to reduce resources for the implementation of modular multiplication. These DSPs have internally one pre-adder, one 18 × 25 bits multiplier, and an accumulator. All PEs were set in a pipeline fashion. The implementation of the DSPs, which we use for the multiplier and the pre-adder, is carried out as recommended by the manufacturer Xilinx in its documentation.
The pre-adder receives operands from the multiplexers 2 and 3, as shown in Figure 2. The inputs of multiplexer 2 are the values y j and u i . The inputs of the multiplexer 3 are the values A j and the carry of the earlier PE, c a r r y j 1 . The inputs of the multiplier are the values from the multiplexer 1 and the pre-adder. The accumulator of the DPS adds the earlier result with the value given for the multiplier. The purpose of this accumulator is to perform the additional operation given in step 5 of Algorithm 7.
A Finite State Machine (FSM) controls the operations of the Systolic Architecture. This FSM has five states to manage all PEs and to set the corresponding values into each PE.
Although five states are used and one state consumes one clock cycle, the first one is for idling, and the last state is for resetting. Thus, three clock cycles are used by the algorithm for performing all the operations required to perform the modular multiplication. So, considering that all the 64 PEs and each PE used three clock cycles for processing the data, plus one clock cycle to set up the beginning of the process and one to return the final result, the total clock cycles running for the modular multiplication is 194.
In the present work, we used a clock of 100 MHz for different platforms. Each PE uses three clock cycles, so the latency of each PE is 30 ns. Considering that we have 64 PEs, we have 1920 ns to operate. However, there are two extra clock cycles, one for idling and one for resetting, then the time for executing one modular multiplication is 1940 ns.

5.2. Implementation in FPGA of the Proposed Modular Exponentiation

For the implementation of the modular exponentiation proposed, an FSM, which follows Algorithm 8, was used. Figure 3 shows the block diagram of the proposed method.
All registers were set according to 1024-bit RSA and eight registers of 1024-bits were used for this implementation distributed as follows: one register to store the input message m; one register to store the value R mod N ; one register to store R 2 mod N ; two more registers to store the values of M; two registers to store the intermediate values R; and one register to store the output S.
External RAM blocks were not used to implement the proposed modular exponentiation. Instead of using an external block of RAM, registers were used to store the values.
The FSM controls the steps within the algorithm. This FSM passes the data into all registers of the block Montgomery multiplication. Passing the data into all registers is done considering the d value exponent, as was shown in Algorithm 8.
Figure 3 shows the block Montgomery multiplication.
Then, the output is stored in the register labeled S in the block diagram. In this proposal, the FSM saves the output value in a register instead of in an external RAM block.

6. Results

We used two different families of FPGAs. For the first one, we use a development board Nexys 4 DDR of Digilent®. The FPGA used is the Artix XC7A100T-CSG324 of Xilinx®, and it was implemented and synthesized using Vivado 2022.1. For the last one, we used the FPGA Virtex-6 XC6VLX75T-2FF484, which was only simulated using the post place-and-route results. For the synthesized one, we used ISE 14.7 Web Pack Edition. In both cases, we used a clock of 100 MHz.
One of the purposes of carrying out the implementation of our proposal using two different families of FPGAs is not only to compare it with other architectures, but also to show that the results obtained are similar in consumption and speed, as well.
Table 4 shows all the resources used for the implementation of the modular exponentiation, with N = 1024 bits, proposed in this work using the two families of FPGAs mentioned previously.
In the case of Artix-7, only 9803 Look-Up Tables (LUT) and 10,568 Flip-Flops (FF) were used, which is only 15.46 % and 8.33 % with respect to the total resources available. However, the most important parameter is the number of DSPs in both platforms. Both implementations used 66 DSPs.
These resources also include those used by the implementation of the Montgomery multiplication. These results show that only part of the FPGA resources is used for modular exponentiation.
Table 5 shows the resources used by modular exponentiation and Montgomery multiplication in the case of Artix-7.
Table 5 shows that the implementation of the modular multiplication used 9714 LUT, hence, only 89 extra LUT were needed for the implementation of the modular exponentiation. Similarly, the modular multiplication used 5338 FF, so, 5231 extra FF were needed for the implementation of the modular exponentiation.
As was previously mentioned, the Montgomery multiplication takes 194-clock cycles. Modular multiplication is performed by the proposed modular exponentiation. Algorithm 8 takes 2052 modular multiplications. The total multiplication takes 398,088 clock cycles. However, step 7 of Algorithm 8 requires 1 clock cycle to store the data into S. Besides, the entire loop of encryption or decryption of data takes a total of 399,113 clock cycles.
The value of 399,113 clock cycles is for the modular exponentiation. Nevertheless, the number of clock cycles needed to store and read the data from external communication was not considered.
Table 6 shows a comparison of the State-of-the-Art Modular Exponentiation Architectures. The exponent has a length of 1024 bits in all implementations.
Table 6 compares the different architectures used to perform modular exponentiation, which includes different platforms. All of them have their advantages and disadvantages. The results show that our work has a reduction in the number of LUT compared with the rest of the works.
Our architecture, using Virtex-6, has a reduction of 3550 LUTs compared to Ref. [24] and 115,528 compared to Ref. [21]. In the case of Artix-7, there is a reduction of 4.31 % of LUTs compared to Ref. [9].
The present work requires a smaller number of FFs. In the case of Virtex-6, there is a reduction of 62,583 of FFs compared to Ref. [21]. In the case of Artix-7, it requires a small increase of FFs, only of 1.38 % compared to Ref. [9].
An advantage of the present work is that it does not require the use of memory blocks (BRAM), in contrast with the rest of the works, since they require a large number of memory blocks. For Virtex-6, the work presented in Ref. [24] requires 32 blocks of 32 kb and 2 blocks of 18 kb; for the case of the work presented in Ref. [21] it requires 132 BRAMs from the total that the FPGA has. For Artix-7, our architecture does not require the use of BRAM, while the work presented by Somnath requires 7.81 % of BRAM.
Our proposal, in the case of the Virtex-6 platform, requires a slightly larger number of slices, using only 779, compared with those presented in Ref. [24], however, it has a reduction of 32,885 slices compared with the architectures presented in Ref. [21].
As previously mentioned, the number of DSPs is an important parameter to take into account. Our proposed implementation used only 66 DSP.
The latency or the number of clock cycles that our architecture takes to perform modular exponentiation is greater than those shown in other works [21,24]. However, our proposal shows a reduction in the number of resources.
Finally, earlier results show a great reduction in resources used for both families of FPGA. The foregoing, together with the proposal made in Algorithm 8, produces an increase in security.

7. Discussion

In the present work, we developed an implementation of the MSaMAL2R for FPGA, which is effective against non-profiled side channel attacks.
This implementation requires less consumption of resources used by the FPGA for its operation, as mentioned in the previous section. However, it requires more clock cycles to perform the modular exponentiation. To speed-up the performance, it is necessary to improve the architecture of the Montgomery modular multiplication presented in this work.
Our proposal not only requires a smaller number of DSPs but also a smaller number of slices. This reduction in resources is helpful due to the lower energy consumption during its operation, which makes it less susceptible to non-profiled SCA.
Although it has shown a reduction in resources for the encryption or decryption of information, in addition to being less vulnerable to side-channel attacks, it is necessary to study other types of attacks on this implementation. Some of these attacks may be profiled SCA.
Because we use VHDL for the hardware description, it is possible to scale or use the same code in other types of FPGAs from the same manufacturer, however, it can also be used in FPGAs from other manufacturers with the same characteristics as the FPGA used in the present work. To support the above statement, we implemented the same proposal in two different FPGAs families using the same VHDL code. The first one used a high-cost FPGA such as the Virtex and the second one implemented a low-cost FPGA such as the Artix.

8. Conclusions

This work presents a proposal to defeat the N 1 A t t a c k without cost penalties. One extra register is required, and no generations of random values are needed. Thus, it allows reducing the complexity of the application and reduces the resources in the implementation.
The resources shown in the results section demonstrated that the proposed implementation is a worthy possibility to be implemented in hardware using programmable devices such as FPGA. The proposal made in this paper shows a reduction in resource consumption; specifically, a consumption of LUTs by 3550 LUTs and 62,583 of FFs.
As was mentioned previously, we implemented our proposal using two different families of FPGAs. These two families are Virtex-6 and Artix-7. The former is one of the costliest in price and the last is one of the cheaper options. Both are from the Xilinx family. The implementation of our proposal in both types of FPGA presents similar results in terms of the number of resources used. This shows that the proposal can be implemented in low-cost and low-resource devices, as was the case with Artix-7. This allows it to be used in portable devices such as those used in IoT. It even allows other approaches such as Neural Networks to be implemented in the case of using high-resource FPGAs such as Virtex.
The running time of the modular exponentiation mostly depends on the modular multiplication. Even though there are implementations of modular multiplication with a faster running time than this proposal, nevertheless, the resources needed are higher than in our work.
Table 6 shows different techniques to implement modular exponentiation. All of them are designed to defeat SCAs. However, our proposal has the advantage that fewer resources are needed. This is an advantage over the rest, thus increasing the security due to requiring less power consumption.

Author Contributions

Conceptualization, J.d.J.M.R., D.T.V. and M.A.R.B.; methodology, J.d.J.M.R., D.T.V. and M.A.R.B.; software, L.M.F.N. and E.R.E.G.; validation, M.A.R.B. and L.M.F.N.; formal analysis, J.d.J.M.R. and D.T.V.; investigation, J.d.J.M.R.; writing—review and editing, J.d.J.M.R. and M.A.R.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by CINVESTAV—IPN and the project PAPIME PE105822 of the UNAM.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lin, J.; Luo, Y.; Liu, J.; Bi, J.; Qiu, S.; Cen, M.; Liao, Z. An Image Compression-Encryption Algorithm Based on Cellular Neural Network and Compressive Sensing. In Proceedings of the 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC), Chongqing, China, 27–29 June 2018; pp. 673–677. [Google Scholar] [CrossRef]
  2. Cabrera-Rufino, M.A.; Ramos-Arreguín, J.M.; Rodríguez-Reséndiz, J.; Gorrostieta-Hurtado, E.; Aceves-Fernandez, M.A. Implementation of ANN-Based Auto-Adjustable for a Pneumatic Servo System Embedded on FPGA. Micromachines 2022, 13, 890. [Google Scholar] [CrossRef] [PubMed]
  3. Morales-Romero, J.J.; Reyes-Barranca, M.; Flores-Nava, L. Improved Algorithm for Time-Multiplexing with Digital CNN’s Applied in Image Processing, Synthesized in a FPGA. In Proceedings of the 2019 16th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 11–13 September 2019; pp. 1–6. [Google Scholar] [CrossRef]
  4. Montalvo, V.; Estévez-Bén, A.A.; Rodríguez-Reséndiz, J.; Macias-Bobadilla, G.; Mendiola-Santíbañez, J.D.; Camarillo-Gómez, K.A. FPGA-Based Architecture for Sensing Power Consumption on Parabolic and Trapezoidal Motion Profiles. Electronics 2020, 9, 1301. [Google Scholar] [CrossRef]
  5. Cruz-Miguel, E.E.; García-Martínez, J.R.; Rodríguez-Reséndiz, J.; Carrillo-Serrano, R.V. A New Methodology for a Retrofitted Self-tuned Controller with Open-Source FPGA. Sensors 2020, 20, 6155. [Google Scholar] [CrossRef] [PubMed]
  6. Rodríguez-Abreo, O.; Rodríguez-Reséndiz, J.; Velásquez, F.A.C.; Ortiz Verdin, A.A.; Garcia-Guendulain, J.M.; Garduño-Aparicio, M. Estimation of Transfer Function Coefficients for Second-Order Systems via Metaheuristic Algorithms. Sensors 2021, 21, 4529. [Google Scholar] [CrossRef] [PubMed]
  7. Kocher, P.C. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems. In Proceedings of the Advances in Cryptology—CRYPTO ’96; Koblitz, N., Ed.; Springer: Berlin/Heidelberg, Germany, 1996; pp. 104–113. [Google Scholar]
  8. Kocher, P.; Jaffe, J.; Jun, B. Differential Power Analysis. In Proceedings of the Advances in Cryptology—CRYPTO’ 99; Wiener, M., Ed.; Springer: Berlin/Heidelberg, Germany, 1999; pp. 388–397. [Google Scholar]
  9. Mondal, S.; Patkar, S. Hardware-Software co-implementation of a high performance and light-weight scalable Systolic-Montgomery based modified RSA for portable IoT devices. In Proceedings of the 2021 International Conference on Emerging Smart Computing and Informatics (ESCI), Pune, India, 5–7 March 2021; pp. 439–443. [Google Scholar]
  10. Chen, E.; Kan, J.; Yang, B.Y.; Zhu, J.; Chen, V. Intelligent Electromagnetic Sensors for Non-Invasive Trojan Detection. Sensors 2021, 21, 8288. [Google Scholar] [CrossRef] [PubMed]
  11. Tang, M.; Luo, M.; Zhou, J.; Yang, Z.; Guo, Z.; Yan, F.; Liu, L. Side-channel attacks in a real scenario. Tsinghua Sci. Technol. 2018, 23, 586–598. [Google Scholar] [CrossRef]
  12. Chang, L.; Wei, Y.; He, S.; Pan, X. Research on Side-Channel Analysis Based on Deep Learning with Different Sample Data. Appl. Sci. 2022, 12, 8246. [Google Scholar] [CrossRef]
  13. Rivest, R.L.; Shamir, A.; Adleman, L. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Commun. ACM 1978, 21, 120–126. [Google Scholar] [CrossRef] [Green Version]
  14. Coron, J.S. Resistance Against Differential Power Analysis For Elliptic Curve Cryptosystems. In Proceedings of the Cryptographic Hardware and Embedded Systems; Koç, Ç.K., Paar, C., Eds.; Springer: Berlin/Heidelberg, Germany, 1999; pp. 292–302. [Google Scholar]
  15. Mamiya, H.; Miyaji, A.; Morimoto, H. Efficient Countermeasures against RPA, DPA, and SPA. In Proceedings of the Cryptographic Hardware and Embedded Systems—CHES 2004; Joye, M., Quisquater, J.J., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; pp. 343–356. [Google Scholar]
  16. Yen, S.M.; Lien, W.C.; Moon, S.; Ha, J. Power Analysis by Exploiting Chosen Message and Internal Collisions—Vulnerability of Checking Mechanism for RSA-Decryption. In Proceedings of the Progress in Cryptology—Mycrypt 2005; Dawson, E., Vaudenay, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 183–195. [Google Scholar]
  17. Miyamoto, A.; Homma, N.; Aoki, T.; Satoh, A. Enhanced power analysis attack using chosen message against RSA hardware implementations. In Proceedings of the 2008 IEEE International Symposium on Circuits and Systems, Seattle, WA, USA, 18–21 May 2008; pp. 3282–3285. [Google Scholar]
  18. Varela, D.T. How to avoid the N-1 attack without costly implementations. Int. J. Netw. Secur. Its Appl. 2012, 4, 109. [Google Scholar]
  19. Montgomery, P.L. Modular multiplication without trial division. Math. Comput. 1985, 44, 519–521. [Google Scholar] [CrossRef]
  20. Liu, R.; Li, S. A Design and Implementation of Montgomery Modular Multiplier. In Proceedings of the 2019 IEEE International Symposium on Circuits and Systems (ISCAS), Sapporo, Japan, 26–29 May 2019; pp. 1–4. [Google Scholar]
  21. Venkatesh, K.; Pratibha, K.; Annadurai, S.; Kuppusamy, L. Reconfigurable Architecture to Speed-up Modular Exponentiation. In Proceedings of the 2019 International Carnahan Conference on Security Technology (ICCST), Chennai, India, 1–3 October 2019; pp. 1–6. [Google Scholar] [CrossRef]
  22. Leelavathi, G.; Shaila, K.; Venugopal, K. Maximizing the Efficiency using Montgomery Multipliers on FPGA in RSA Cryptography for Wireless Sensor Networks. Int. J. Future Revolut. Comput. Sci. Commun. Eng. 2017, 3, 227–240. [Google Scholar]
  23. Li, S.; Tian, J.; Zhu, H.; Tian, Z.; Qiao, H.; Li, X.; Liu, J. Research in fast modular exponentiation algorithm based on FPGA. In Proceedings of the 2019 11th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), Qiqihar, China, 28–29 April 2019; pp. 79–82. [Google Scholar]
  24. Dai, W.; Chen, D.; Cheung, R.C.; Koc, C.K. FFT-based McLaughlin’s montgomery exponentiation without conditional selections. IEEE Trans. Comput. 2018, 67, 1301–1314. [Google Scholar] [CrossRef]
Figure 1. Block diagram of the systolic architecture for Montgomery multiplication in FPGA.
Figure 1. Block diagram of the systolic architecture for Montgomery multiplication in FPGA.
Micromachines 13 01872 g001
Figure 2. Block diagram of a processing element in FPGA.
Figure 2. Block diagram of a processing element in FPGA.
Micromachines 13 01872 g002
Figure 3. Block diagram of the proposed Modular Exponentiation.
Figure 3. Block diagram of the proposed Modular Exponentiation.
Micromachines 13 01872 g003
Table 1. Example of the execution of the Algorithm SaMAL2R.
Table 1. Example of the execution of the Algorithm SaMAL2R.
i d i R ValuesResult
61 R [ 0 ] = S 2 mod N = 1 mod N
R [ 1 ] = S [ 0 ] · m mod N = m mod N
S = m mod N
51 R [ 0 ] = m 2 mod N = m 2 mod N
R [ 1 ] = m 2 · m mod N = m 3 mod N
S = m 3 mod N
40 R [ 0 ] = ( m 3 ) 2 mod N = m 6 mod N
R [ 1 ] = m 6 · m mod N = m 7 mod N
S = m 6 mod N
30 R [ 0 ] = ( m 6 ) 2 mod N = m 12 mod N
R [ 1 ] = m 12 · m mod N = m 13 mod N
S = m 12 mod N
20 R [ 0 ] = ( m 12 ) 2 mod N = m 24 mod N
R [ 1 ] = m 24 · m mod N = m 25 mod N
S = m 24 mod N
10 R [ 0 ] = ( m 24 ) 2 mod N = m 48 mod N
R [ 1 ] = m 48 · m mod N = m 49 mod N
S = m 48 mod N
01 R [ 0 ] = ( m 48 ) 2 mod N = m 96 mod N
R [ 1 ] = m 96 · m mod N = m 97 mod N
S = m 97 mod N
Table 2. Example of the execution of the Algorithm BRIP.
Table 2. Example of the execution of the Algorithm BRIP.
i d i R ValuesResult
61 S [ 0 ] = r 2 mod N S [ 0 ] = r · m mod N
51 S [ 0 ] = r 2 · m 2 mod N S [ 0 ] = r · m 3 mod N
40 S [ 0 ] = r 2 · m 6 mod N S [ 0 ] = r · m 6 mod N
30 S [ 0 ] = r 2 · m 12 mod N S [ 0 ] = r · m 12 mod N
20 S [ 0 ] = r 2 · m 24 mod N S [ 0 ] = r · m 24 mod N
10 S [ 0 ] = r 2 · m 48 mod N S [ 0 ] = r · m 48 mod N
01 S [ 0 ] = r 2 · m 96 mod N S [ 0 ] = r · m 97 mod N
Retrieve result
S [ 0 ] = r · m 97 · r 1 mod N S [ 0 ] = m 97 mod N
Table 3. Example of the execution of the proposed algorithm.
Table 3. Example of the execution of the proposed algorithm.
i d i R ValuesResult
61 R [ 0 ] = R 2 mod N = R mod N
R [ 1 ] = R · R mod N = R mod N
S = R mod N
51 R [ 0 ] = R 2 mod N = R mod N
R [ 1 ] = R · R mod N = R mod N
S = R mod N
40 R [ 0 ] = R 2 mod N = R mod N
R [ 1 ] = R · R mod N = R mod N
S = R mod N
30 R [ 0 ] = R 2 mod N = R mod N
R [ 1 ] = R · R mod N = R mod N
S = R mod N
20 R [ 0 ] = R 2 mod N = R mod N
R [ 1 ] = R · R mod N = R mod N
S = R mod N
10 R [ 0 ] = R 2 mod N = R mod N
R [ 1 ] = R · R mod N = R mod N
S = R mod N
Last bit
01 S = S · M [ 0 ] mod N S = M [ 0 ] mod N
Retrieve result
S = S · 1 mod N S = ( N 1 ) mod N
Table 4. Resources used by the FPGA for the proposed Modular Exponentiation.
Table 4. Resources used by the FPGA for the proposed Modular Exponentiation.
Device Utilization
PlatformLUTsFFsSlicesDSPs
Virtex-67834/46,56013,286/93,1204249/11,64066/288
Artix-79803/63,40010,568/126,8004485/15,85066/240
Table 5. Resources used by the FPGA for the proposed Modular Multiplication.
Table 5. Resources used by the FPGA for the proposed Modular Multiplication.
Device Utilization
OperationLUT (63,400)FF (126,800)Slices (15,850)DSP (240)
Exponentiation980310,568448566
Multiplication97145337397066
Table 6. Performance comparison of the State-of-the-Art Modular Exponentiation with N = 1024 .
Table 6. Performance comparison of the State-of-the-Art Modular Exponentiation with N = 1024 .
WorkPlatformArchitectureLUTFFBRAMSlicesDSPsFreq
(MHz)
Latency
(Cycles)
Wangchen [24]Virtex-6FMLE11,834-32/2 1347018-822
per FMLM
Wangchen [24]Virtex-6FMLE
protected
11,884-32/2 1343318-822
per FMLM
Vankatesh [21]Virtex-7Pre-computation
(Montgomery)
128,814
( 42.43 % )
75,839
( 12.49 % )
132 2
( 12.82 % )
37,134
( 48.92 % )
-20014,813
Vankatesh [21]Virtex-7Pre-computation
(Montgomery)
128,814
( 42.43 % )
75,839
( 12.49 % )
132 2
( 12.82 % )
37,134
( 48.92 % )
-20014,813
Proposed
architecture
Virtex-6Systolic7834
( 16 % )
13,286
( 14 % )
0 % 4249
( 36 % )
66100399,113
Somnath [9]Artix-7Systolic 19.77 % 6.95 % 7.81 % --90.9-
Proposed
architecture
Artix-7Systolic9803
( 15.46 % )
10,568
( 8.33 % )
0 % 4485
( 28.3 % )
66100399,113
1 32 blocks of 32 kb and 2 blocks of 18 kb. 2 132 blocks of the total. - Not mentioned.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Morales Romero, J.d.J.; Reyes Barranca, M.A.; Tinoco Varela, D.; Flores Nava, L.M.; Espinosa Garcia, E.R. SCA-Safe Implementation of Modified SaMAL2R Algorithm in FPGA. Micromachines 2022, 13, 1872. https://0-doi-org.brum.beds.ac.uk/10.3390/mi13111872

AMA Style

Morales Romero JdJ, Reyes Barranca MA, Tinoco Varela D, Flores Nava LM, Espinosa Garcia ER. SCA-Safe Implementation of Modified SaMAL2R Algorithm in FPGA. Micromachines. 2022; 13(11):1872. https://0-doi-org.brum.beds.ac.uk/10.3390/mi13111872

Chicago/Turabian Style

Morales Romero, José de Jesús, Mario Alfredo Reyes Barranca, David Tinoco Varela, Luis Martin Flores Nava, and Emilio Rafael Espinosa Garcia. 2022. "SCA-Safe Implementation of Modified SaMAL2R Algorithm in FPGA" Micromachines 13, no. 11: 1872. https://0-doi-org.brum.beds.ac.uk/10.3390/mi13111872

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop