A Hybrid Approach for Efficient and Secure Point Multiplication on Binary Edwards Curves

Sajid, Asher; Sonbul, Omar S.; Rashid, Muhammad; Zia, Muhammad Yousuf Irfan

doi:10.3390/app13095799

Open AccessArticle

A Hybrid Approach for Efficient and Secure Point Multiplication on Binary Edwards Curves

¹

Deanship of Scientific Research, Umm Al Qura University, Makkah 21955, Saudi Arabia

²

Computer Engineering Department, Umm Al Qura University, Makkah 21955, Saudi Arabia

³

Department of Electrical Engineering, Ziauddin University, Karachi 74600, Pakistan

⁴

Telecommunications Engineering School, University of Malaga, 29010 Málaga, Spain

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(9), 5799; https://0-doi-org.brum.beds.ac.uk/10.3390/app13095799

Submission received: 14 April 2023 / Revised: 4 May 2023 / Accepted: 4 May 2023 / Published: 8 May 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The focus of this article is to present a novel crypto-accelerator architecture for a resource-constrained embedded system that utilizes elliptic curve cryptography (ECC). The architecture is built around Binary Edwards curves (BEC) to provide resistance against simple power analysis (SPA) attacks. Furthermore, the proposed architecture incorporates several optimizations to achieve efficient hardware resource utilization for the point multiplication process over

G F (2^{m})

. This includes the use of a Montgomery radix-2 multiplier and the projective coordinate hybrid algorithm (combination of Montgomery ladder and double and add algorithm) for scalar multiplication. A two-stage pipelined architecture is employed to enhance throughput. The design is modeled in Verilog HDL and verified using Vivado and ISE design suites from Xilinx. The obtained results demonstrate that the proposed BEC accelerator offers significant performance improvements compared to existing solutions. The obtained throughput over area ratio for

G F (2^{233})

on Virtex-4, Virtex-5, Virtex-6, and Virtex-7 Xilinx FPGAs are 9.43, 14.39, 26.14, and 28.79, respectively. The computation time required for a single point multiplication operation on the Virtex-7 device is 19.61 µs. These findings indicate that the proposed architecture has the potential to address the challenges posed by resource-constrained embedded systems that require high throughput and efficient use of available resources.

Keywords:

elliptic curve cryptography; binary Edwards curve; scalar multiplication; Montgomery radix-4 multiplier; FPGA

1. Introduction

The internet is a vast network of interconnected computers that enable communication between people from all over the world. It provides a cost-effective way to communicate over long distances, making it an ideal medium for businesses and organizations [1]. However, with the convenience of the internet, there is always a risk of cyber-attacks [2]. Cyber-criminals can access sensitive information transmitted over the internet, including personal data, financial information, and corporate secrets. To mitigate this risk, cryptography is one of the techniques that is used to ensure the confidentiality, integrity, and authenticity of data. It is the practice of securing data by converting it into an unreadable format called cipher text [3]. The cipher text can only be read by someone who has the decryption key to convert it back to its original format.

Cryptography can be either symmetric or asymmetric. Symmetric cryptography uses the same key for both encryption and decryption. The sender and the receiver both have access to the same key, making it easier to encrypt and decrypt data. However, the symmetric approach is less secure as the key needs to be shared between the sender and receiver, making it more vulnerable to cyber-attacks [4]. Asymmetric cryptography, on the other hand, uses two different keys for encryption and decryption [5]. The sender encrypts the data with the receiver’s public key, and the receiver decrypts the data with their private key. Asymmetric cryptography is more secure than symmetric cryptography as the private key is never shared, making it more difficult for cyber-criminals to intercept the data. ECC is a type of asymmetric cryptography that is gaining popularity due to its security and efficiency [6,7]. It is more secure than RSA [8], another popular asymmetric cryptography algorithm, because it requires shorter key sizes to achieve the same level of security [9]. This means that ECC is faster and requires less storage space, making it ideal for mobile devices with limited processing power and storage capacity [10,11].

ECC implementation can be thought of as a stack of four distinct layers [3]. The foundation of the stack is the arithmetic operations layer, which includes the fundamental building blocks of mathematics such as addition, subtraction, multiplication, and division. These operations form the basis of elliptic curve group operations, which enable ECC’s security and efficiency. The second layer of the stack is the point addition and point doubling layer, which enables the construction of new points on the elliptic curve by performing operations that combine existing points. This layer is a key component in ECC’s scalability, as it allows for the generation of multiple points from a single base point. The third layer of the stack is point multiplication, which involves multiplying a point on the elliptic curve by a scalar. This layer is responsible for generating the final result of ECC operations, which are new points on the curve. The fourth and final layer is the protocol layer, which includes the implementation of ECC-based protocols used for key exchange, digital signatures, and encryption. This layer is responsible for the practical application of ECC, enabling secure communication and data protection in a wide range of contexts.

One of the important challenges in implementing ECC is the risk of side-channel attacks [12], where an attacker can extract information from the implementation of the algorithm rather than trying to break the mathematical problem. One way to mitigate this risk is to use the hybrid algorithms that combine multiple techniques, such as Montgomery ladder and double-and-add algorithms. It can potentially increase the resistance to SPA attacks [13,14]. In this context, Binary Edwards curves [15], which are known for their ability to resist side-channel attacks, are being increasingly used in various high-security applications. Compared to Binary Huff curves [16] and Hessian curves [17], Binary Edwards curves are not only faster and more secure, but also have a smaller key size and require less computational resources, making them ideal for constrained environments [6]. The potential use of these curves can be found in applications such as cloud computing [18], secure messaging [19], digital signatures [20], and internet of things [21,22,23], where high security, high throughput, and efficient resource utilization are critical. Overall, the adoption of Binary Edwards curves can significantly enhance the security of these applications while maintaining optimal performance.

1.1. Related Work

Recent years have seen the development of several FPGA-based implementations for point multiplication on Binary Edwards curves (BECs) with a focus on optimizing speed and resources. One study, published in [24], presents two different architectures for general and special formulations of BECs. These architectures utilize optimized finite field multiplication techniques and multiple finite field multipliers to improve efficiency and accelerate high-speed applications. The general BEC architecture used 2272, 5919, and 4581 slices for different curve parameters, with latencies of 74.55, 26.24, and 51.46, respectively, on a Virtex-5 platform.

For special formulations of BECs, two different structures were presented based on three and two parallel multipliers. The three parallel multiplier implementations achieved high speed, but required high hardware resources, with 4454 hardware resources and a latency of 34.61 microseconds for curve parameters

d_{1}

and

d_{2}

equal to 59. The two multiplier implementation achieved a low hardware resource of 3521 slices, but with a high latency of 57.43 microseconds for curve parameters

d_{1}

and

d_{2}

equal to 59. These results highlight the trade-off between hardware resources and latency in the implementation of BEC architectures for high-speed applications.

In [25], a reconfigurable processor architecture has been proposed for implementing Binary Edwards curves (BECs) in cryptographic algorithms. The architecture was implemented on the Virtex-4 platform and achieved a maximum clock frequency of 48 MHz. The study demonstrated that the architecture required 21,816 slices for BECs and 22,373 slices for BECs with halving, indicating a minimal increase in hardware resources needed for the halving operation. The presented architecture is flexible and can be reconfigured for various curve parameters, making it suitable for implementing BEC-based cryptographic algorithms. In [26], the focus is on optimizing the area for embedded devices with limited resources, and the authors have proposed a digit-serial multiplier and m-bit XOR layer for this purpose. The area figures for Virtex 6 and Virtex 5 are reported as 2138 and 2153 slices, respectively.

To optimize both the throughput and area, the authors in [27] have employed a pipelined digit-serial multiplier to perform successive PA and PD computations, which reduces the critical path, conserves clock cycles, and optimizes the clock frequency. For a curve parameter of d = 59, the implemented hardware has used 8875 slices on the Virtex-5 platform for GHC, with a latency of 11.03 microseconds, and 11,494 slices for BEC with a latency of 11.01 microseconds.

Other recent studies, such as [28,29], have proposed architectures to optimize different design factors such as latency, area, and throughput. In [28], a comb PM technique is employed to create low-complexity (LC) and low-latency (LL) architectures. The LC design offers significant improvements of 62%, 46%, and 152% for

G F (2^{233})

,

G F (2^{163})

, and

G F (2^{283})

, respectively. Additionally, the LL architecture enables faster computation of one PM operation.

The authors in [30] have introduced a modular radix-2 interleaved multiplier that can improve low-latency architecture by reducing computing time, clock cycles, and area. This multiplier uses the Montgomery ladder algorithm to perform PM. Similarly, the work in [15] has presented a low-complexity architecture for PM that uses a digital parallel least significant multiplier and instruction scheduling to optimize hardware resources. The architecture has been tested on Virtex 4, Virtex 5, Virtex 6, and Virtex 7 platforms.

1.2. Research Gap

Some potential research gaps in the field of FPGA-based implementations for point multiplication on Binary Edwards curves (BECs) include:

Limited focus on optimizing multiple design factors: While many of the studies discussed in Section 1.1 focus on optimizing either area, latency, or throughput, there may be a need for further research that can balance these factors in a more holistic manner. For example, an architecture that achieves high throughput and low latency but also minimizes hardware resources could be useful in many high-speed applications.
Investigating novel multipliers: Many of the existing implementations use pipelined digit-serial multipliers to optimize throughput, but there may be other novel modular multipliers that could further improve performance.
Limited evaluation of security and robustness: While the studies discussed in Section 1.1 focus on optimizing speed and resources for BEC architectures, there is limited discussion on the security and robustness of these architectures. Further research could evaluate the effectiveness of these architectures in resisting attacks and ensuring data integrity, which is crucial for many cryptographic applications.

1.3. Contributions

The article targets to increase the speed of the architecture while using less hardware resources, and it presents various contributions towards achieving this goal.

The proposed hybrid algorithm combines two different approaches, Montgomery ladder and Double and Add algorithm, to achieve unprecedented performance improvements. The Montgomery ladder algorithm is a commonly used method for scalar multiplication in elliptic curve cryptography, while the Double and Add algorithm is a simple and efficient method for point addition and doubling. By combining these two algorithms, the proposed hybrid algorithm is able to leverage the strengths of each approach and achieves better performance than either algorithm used alone.
The proposed multiplication technique, based on radix-2 arithmetic, is a method for performing modular multiplication in a more efficient way. This technique splits the multiplication process into smaller subproblems and performs them using radix-2 arithmetic. By using this approach, the proposed algorithm is able to perform modular multiplication more efficiently and accurately, which is a critical operation in many cryptographic algorithms.
The proposed two-stage pipelining technique is a method for increasing the throughput of the entire architecture by splitting the computation into smaller stages and processing them in parallel. This approach reduces the overall latency of the algorithm and allows for higher clock frequencies, resulting in faster computation times. By using this technique, the proposed algorithm is able to process more data in a shorter amount of time, making it more efficient and effective for real-world applications.
A Finite State Machine (FSM) has been created to control the data path of a proposed architecture. This control path efficiently manages the operations of the data path.

The structure of the article is described as follows: Section 2 provides the theoretical background pertaining to BEC. Section 3 describes the proposed hybrid algorithm that has been implemented in this article. Section 4 describes the proposed optimizations for performance improvement. Section 5 provides information on the proposed hardware architecture. The main findings of the article are discussed in Section 6. Finally, the conclusions are made in Section 7.

2. Mathematical Background

Section 2.1 contains the BEC equations for both prime and binary fields. In Section 2.2, there is a description of the unified mathematical formulation. Section 3 provides information about the point multiplication computations and the hybrid algorithm.

2.1. BEC Equations over $G F (2^{m})$

Harold Edwards developed a comprehensive model for Binary Edwards curves (BECs) in 2007, which includes a complete group law. The mathematical representation of BECs over a prime field with coefficient d is given by Equation (1)

x^{2} + y^{2} = 1 + d x^{2} y^{2}

(1)

In Equation (1), x and y represent the initial points, and d represents the curve parameter. However, working with large prime fields can be challenging, so Bernstein proposed a binary version of Edwards curves to overcome this issue.

E_{B, d_{1}, d_{2}} : d_{1} (x + y) + d_{2} (x^{2} + y^{2}) = x y + x y (x + y) + x^{2} y^{2}

(2)

Equation (2) shows the binary form of Edwards curves, where x and y denote the initial points, and

d_{1}

and

d_{2}

represent the curve parameters. Equation (2) holds true when

d_{1}

is not equal to 0 and

d_{2}

is not equal to

d_{1}^{2} + d_{1}

.

2.2. Unified Mathematical Formulation

The table presented as Table 1 outlines the differential PA and PD instructions for BEC over

G F (2^{m})

. These instructions are comprised of 7 complex steps and require a memory unit capable of storing initial, intermediate, and final results, with a total memory requirement of

11 \times m

, where 11 represents the number of memory locations and m represents the width of each location. The BEC curve parameters in the table, denoted as

e_{1}

,

e_{2}

, and w, are computed based on a rational function for an elliptic curve E over

G F

and the values of

d_{1}

and

d_{2}

. The initial projective points are represented by

W_{1}

,

Z_{1}

,

W_{2}

, and

Z_{2}

, while the final points are represented by

Z_{a}

,

Z_{d}

,

W_{a}

, and

W_{d}

. Intermediate values are stored in A, B, and C.

3. Proposed Hybrid Algorithm

This article has performed the computation of point multiplication (PM) over Binary Edwards curves (BEC) using the Hybrid algorithm. The algorithmic details are provided in Section 3.1. Subsequently, the significance of the hybrid algorithm is highlighted in Section 3.2.

3.1. Algorithmic Details

Given a scalar multiplier k and a starting point P, the PM operation computes the point Q, which is equal to k times P. This can be expressed as

Q = K (P + P + \dots + P)

, where there are K terms in the sum.

To convert a point P from affine coordinates to

ω

coordinates as shown in Algorithm 1, we calculate the values of

W_{2}

and

Z_{2}

using the formula

W_{2} = x (P) \cdot Z_{2}^{- 1}

and

Z_{2} = 1

. We set

W_{1}

and

Z_{1}

to 0. For point multiplication, we iterate over the binary digits of scalar k from the most significant bit to the least significant bit. For each bit i, we perform the following steps:

If the i-th bit of k is 0, we perform a point addition operation $P = P + Q$ and update the $ω$ coordinates of the points using the dADD function. We set $(W_{1} : Z_{1}), (W_{2} : Z_{2}) = d A D D ((W_{0} : Z_{0}), (W_{1} : Z_{1}), (W_{2} : Z_{2}))$ , where $(W_{0} : Z_{0})$ is the $ω$ coordinates of P.
If the i-th bit of k is 1 and the $(i - 1)$ -th bit is 0 or $i = m - 1$ , we perform a double-and-add operation. We double the current value of the result using the double function, and update the $ω$ coordinates of the points as $(W_{1} : Z_{1}), (W_{2} : Z_{2}) = d o u b l e ((W_{1} : Z_{1}), (W_{2} : Z_{2}))$ .
If the i-th bit of k is 1 and the $(i - 1)$ -th bit is also 1, we perform a point addition operation $Q = P + Q$ using the current value of P, and update the $ω$ coordinates of the points as $(W_{2} : Z_{2}), (W_{1} : Z_{1}) = d A D D ((W_{0} : Z_{0}), (W_{2} : Z_{2}), (W_{1} : Z_{1}))$ .

After iterating over all the bits of k, we return the

ω

coordinates of the resulting point Q as

(W_{1} : Z_{1}), (W_{2} : Z_{2})

.

Algorithm 1: Hybrid Montgomery and double-and-add algorithm

3.2. Benefits of the Hybrid Algorithm

The hybrid algorithm aims to efficiently perform point multiplication on BEC in

ω

coordinates. The benefits of a hybrid approach are as follows:

Efficient point multiplication: The algorithm uses a combination of Montgomery ladder and double-and-add methods to perform point multiplication on BEC efficiently.
Reduced number of point additions: The algorithm performs point addition only when the current bit of the scalar is 0. This reduces the number of point additions required as compared to the standard double-and-add algorithm, where point additions are performed at every iteration.
Reduced number of point doublings: The algorithm performs point doubling only when two consecutive bits of the scalar are 1. This further reduces the number of point-doubling operations compared to the standard double-and-add algorithm, where point-doubling is performed at every iteration.

Overall, the hybrid algorithm provides an efficient way to perform point multiplication on BEC. The Montgomery ladder algorithm provides efficient scalar multiplication without revealing any intermediate values, while the double-and-add algorithm improves the performance of the scalar multiplication by avoiding unnecessary additions and doublings.

4. Proposed Optimizations

To enhance the throughput/area ratio, this article utilizes two-stage pipelining and modular multiplier technique. The former breaks down computation into smaller stages, reducing the critical path delay and improving throughput. The latter technique reduces clock cycles needed for multiplication, optimizing clock frequency and performance.

4.1. Optimizing Hardware Designs with Pipelining: Three Approaches for Division and Register Placement

Pipelining is a widely used method to enhance the processing speed of hardware designs. To optimize the pipelining process, the circuit is divided into three sections: pre-calculated data and output data from the memory unit for the read operation, arithmetic logical unit (ALU) operations for the execute operation, and the routing network associated with the memory unit for the write-back operation. This division leads to three possible outcomes.

The first outcome is non-pipelined, where the read, execute, and write-back operations occur in a single cycle. The second outcome is a 2-stage pipelined design, with a register placed at the input of the ALU. This design allows us to read in the first cycle and execute and write-back in the second cycle. The third outcome is a 3-stage pipelined design where registers are used for both the input and output of the ALU. This design allows for read, execute, and write-back operations to occur in three different cycles.

Comparing the first two outcomes, the 2-stage pipelined design offers a better throughput/area ratio than the non-pipelined architecture. However, adding a third pipeline stage for write-back is not necessary as it can increase the number of clock cycles due to potential RAW hazards. Additionally, increasing the number of registers at the output of the ALU can reduce the overall throughput/area ratio. Therefore, it is concluded that the 2-stage pipelined design with pipeline registers at the input of the ALU is the most efficient architecture for this processor design.

4.2. Instructions and Scheduling for Differential Addition Law in Two-Stage Pipelined Architecture

Table 2 outlines the instructions required to implement the differential addition law of Algorithm 1 in a two-stage pipelined architecture. The table provides details on the number of clock cycles required for each instruction, the instructions themselves, and the merging of multiple operations to reduce the complexity of the instructions. Furthermore, the table includes information on the status of the two-stage pipelining, as well as the potential RAW hazard that may occur due to the pipelining. The proposed scheduling for the instructions of the unified differential law formulas is also presented in columns six through eight.

4.3. Storage Elements and RAW Hazard Description

The proposed architecture employs a total of

14 \times m

storage elements, which serve various purposes. These elements include A, B, C, Wd, Zd, Wa, Za, W1, W2, Z1, Z2, T1, T2, and T3. The initial projective points are stored in W1, Z1, W2, and Z2, while the updated values of the final projective point are saved in Wa, Za, Wd, and Zd, as indicated in column 3 of Table 2. Meanwhile, intermediate results are stored in the remaining storage elements, namely A, B, C, and T1 to T3.

When executing R, E, and WB in a single clock cycle, each operation requires 14 cycles (refer to column 3 of Table 2). However, when employing 2-stage pipelining, certain instructions, including

I n s t r_{6}

,

I n s t r_{7}

,

I n s t r_{8}

,

I n s t r_{10}

,

I n s t r_{11}

, and

I n s t r_{13}

, are vulnerable to RAW hazards, as specified in column seven of Table 2. For instance, there is a RAW hazard when writing to

I n s t r_{6}

, which results in a one-cycle delay because it necessitates two cycles to compute the new value of

T_{1}

. Hence, accounting for the RAW hazard, a total of 20 cycles are necessary for each unified PA and PD.

4.4. Optimizing Instruction Scheduling to Reduce Hazards and Hardware Resources

In order to optimize hardware resources and reduce hazards, it is common practice to perform instruction scheduling. As seen in Column three of Table 2,

I n s t r_{7}

and

I n s t r_{8}

are executed in two separate clock cycles, where

T_{3}

is first computed as the result of

T_{2} \times T_{2}

, and then used as an input for the subsequent multiplication in

I n s t r_{8}

to compute

Z_{d}

. However, an alternative approach is described in column six of Table 2, where the squarer unit is used immediately after the multiplier unit. This allows

I n s t r_{7}

and

I n s t r_{8}

to be executed in a single clock cycle, resulting in several benefits. Firstly, it reduces the total number of instructions required, leading to better performance. Secondly, it reduces the number of required storage elements from

14 \times m

to

10 \times m

, thereby saving valuable hardware resources. Lastly, it also reduces the clock cycles required when computing PM for m bit (233).

The scheduling approach outlined in column six of Table 2 has the advantage of resulting in only a single RAW hazard in the context of pipelining, as noted in column seven. When computing the PA and PD formulations using a 2-stage pipelined architecture, this approach requires a total of 16 clock cycles, which is fewer than the number of clock cycles required when executing

I n s t r_{7}

and

I n s t r_{8}

in separate cycles. Overall, the reduction in the number of clock cycles required, coupled with the reduction in the number of storage elements needed, makes this approach a more efficient use of hardware resources. Additionally, by reducing the total number of instructions required, this approach can also improve performance.

5. Proposed Hardware Architecture

The proposed accelerator architecture is motivated by the need to improve the performance of ECC curve (BEC) operations in resource-constrained devices. ECC is a popular choice for cryptography, but it can be computationally expensive. The proposed accelerator architecture can improve the performance of ECC operations, making them more suitable for use in resource-constrained devices. The proposed accelerator architecture is designed to improve the performance of BEC operations in resource-constrained devices. The architecture is based on a number of design principles, including:

Data parallelism: The architecture exploits data parallelism by performing multiple BEC operations at the same time. This can significantly improve the performance of BEC operations.
Pipelining: The architecture uses pipelining to overlap the execution of different BEC operations. This can further improve the performance of BEC operations.
Modular multiplication: The architecture uses modular multiplication to reduce the number of arithmetic operations required to perform BEC operations. This can improve the performance and energy efficiency of BEC operations.

5.1. Overview of the Architecture

The proposed architecture is composed of several components, including a memory unit (MU) for storing intermediate and final results, routing networks (RN) for transferring data, a read-only memory (ROM) for reading BEC parameters, an arithmetic logic unit (ALU) for performing computations, and a finite state machine (FSM) for generating control signals. To achieve pipelining, registers are placed at the input of the ALU. The design is based on the parameters recommended by the National Institute of Standards and Technology (NIST). Detailed descriptions of each component are provided in their respective Section 5.2, Section 5.3, Section 5.4, Section 5.5 and Section 5.6.

5.2. Memory Unit

The proposed architecture utilizes a memory unit with a size of

10 \times m

to store intermediate and final results, as illustrated in Figure 1. Here, the value 10 refers to the number of memory locations, while m represents the width of each memory location. The initial projective points are stored in storage elements

W_{1}

,

Z_{1}

,

W_{2}

, and

Z_{2}

, while the updated projective points are stored in

T_{4}

,

T_{1}

,

W_{0}

, and

T_{3}

. Additionally, the intermediate results, as presented in Table 2, are stored in

Z_{0}

and

T_{2}

.

To store these storage elements in memory locations, a

1 \times 10

demultiplexer (DEMUX) with control signal WRITE_ADDR is employed. Two multiplexers, RF1 and RF2, are utilized to retrieve storage elements from the memory unit for further processing, with control signals RF1_ADDR and RF2_ADDR. The size of RF1 and RF2 is

10 \times m

. The output of RF1 is RF1, while the output of RF2 is RF2.

5.3. Routing Networks

The proposed architecture, shown in Figure 1, utilizes three routing networks (RN2, RN3 and RN4) to transfer data between different modules. The input data, including base coordinates x and y, the output of RN1, and the output of RF1 and RF2, are fed into RN2 and RN3. Control signals RN_OP1_ADDR and RN_OP2_ADDR are used to select the appropriate data for processing. The sizes of RN2 and RN3 are

5 \times 1

. RN4, with a size of

3 \times 1

, is utilized to select the output of the ALU as its input. This architecture efficiently transfers data between the different modules, ensuring smooth operation and minimizing processing delays.

5.4. Read Only Memory

The proposed architecture employs a read-only memory (ROM) to access pre-calculated curve constant values. The RN1 has a size of

3 \times 1

and is depicted in Figure 1. It uses a single multiplexer to select one of the three constant values (e

_{1}

, e

_{2}

, and w).

5.5. Arithmetic Logic Unit

The proposed two-stage pipelined architecture includes adder, multiplier, and squarer units, as shown in Figure 1. To implement the adder unit, m bitwise exclusive OR gates are used, where m represents the key length. Multiplication is a crucial operation in cryptographic applications, and there are several techniques available in the literature. In this work, we utilize the Montgomery radix-2 multiplier, which is discussed in detail in Section 5.5.1. The squarer unit is placed after the multiplier unit, as seen in Figure 1. By adding a “0” after each input data value, the squarer unit is implemented, as described in [31]. The squarer unit’s purpose is to minimize the total number of clock cycles (CCs) required for PM calculation, as instructions such as

{(A \times B)}^{2}

can be computed using the squarer unit. After each polynomial multiplication and squaring unit, an inverse operation is necessary. The quad block Itoh–Tsujii method [15] is employed using the multiplier and squarer units to execute the inversion operation.

5.5.1. Montgomery Radix-2 Multiplier

The Montgomery multiplication method is a technique that replaces expensive division operations with simpler shift and addition operations, resulting in faster modular arithmetic computations. The radix-2 Montgomery method is a basic implementation of this method, as shown in Algorithm 2. The loop in the algorithm iterates over each element in the input vectors

X p

and

Y p

, and computes a partial product

X p_{i} \times Y p

for each element. The partial product is added to the accumulator A in step 1.

A [0]

refers to the least significant bit of the accumulator A. In step 2 of the algorithm, the value of

A [0]

is used to determine whether to add the modulus p or not. If

A [0]

is 1 (i.e., the result of the addition operation in step 1 is odd), then p is added to A in step 3 to make it even. If

A [0]

is 0 (i.e., the result of the addition operation in step 1 is even), then adding p to A in step 3 has no effect on the value of A, and the algorithm proceeds to step 4 to shift A right by one bit. The result

Z p

is then returned. By replacing the division operation with shift and addition operations, the radix-2 Montgomery method results in a faster modular multiplication algorithm.

Note that the input parameters p,

X p

, and

Y p

are preprocessed to satisfy certain conditions, which are necessary for the correctness of the algorithm. Specifically, p is expressed in the form

\sum_{i = 0}^{n - 1} p_{i}^{2}

, where

p_{i}

are binary digits,

X p

and

Y p

are less than p, and R is equal to

2^{n}

. These preprocessing steps are not explicitly shown in Algorithm 2.

Algorithm 2: Montgomery radix-2 multiplier.

5.5.2. Montgomery Radix-2 Architecture

The proposed hardware architecture for the radix-2 Montgomery multiplier is illustrated in Figure 2. The architecture comprises two multiplexers: MUX_A and MUX_B. MUX_A and MUX_B both have a size of

2 \times 1

. MUX_A is responsible for selecting the partial product

P P_{i}

, which is either Yp or 0, depending on the value of

X_{p i}

. Once the appropriate

P P_{i}

is selected, it is added to the output

Z_{p}

. The least significant bit (LSB) of Adder_A is input into MUX_B, where it selects either 0 or the prime number. The remaining bits of Adder_A are then added to the output of MUX_B and shifted right by 1. The final output is computed by performing n iterations of this process.

5.6. Control Unit

The control unit design for the BEC model of ECC consists of FSMs. In the case of the pipelined architecture, the control unit requires 101 states to execute all its functionalities. Here is a detailed description of the 2-stage pipelined architecture.

The first state (State 0) is the idle state. When the reset and start signals are activated, the execution process begins. As shown in Figure 3, the start signal triggers the transition from State 0 to State 1. State 1 to State 6 produces control signals for affine to omega conversions of Algorithm 1. State 7 to State 72 produces control signals for Quad block Itoh–Tusuji inversion operation. State 73 in the FSM counts the number of points on the provided BEC curve and verifies the examined key bit (k).

If the value of k is 1, State 73 transitions to State 88; otherwise, it transitions to State 74. States 88 to 100 generate control signals for computing the “if” part of Algorithm 1, while States 74 to 87 produce control signals for the “else” part of Algorithm 1.

States 87 and 100 are crucial since they check the number of points on the BEC curve (using m in Figure 3) for each value of k (either 0 or 1). The next state is State 101, which is reached when the value for m (initially set to 1) reaches 233. If m has not reached 233, the next state is State 73.

5.7. Clock Cycles Information

Equation (3) describes the mathematical formula used for calculating the clock cycle information. The term “Initial” corresponds to the initialization phase of the architecture. The computation of point multiplication in the case of two-stage pipelining is determined by

16 \times (m - 1)

, while the quad block Itoh–Tusuji computation is represented by the term “inversion”. The same equation can be used to calculate the number of clock cycles in the case of a non-pipelined architecture, but the number 16 is replaced with 13 in the Equation (3).

c l o c k c y c l e s = I n i t i a l + 16 \times (m - 1) + I n v e r s i o n

(3)

The results are summarized in Table 3. The first row of Table 3 specifies the parameters, while the second row determines the key length information. The third and fourth rows provide the results for the no-pipelined and two-stage pipelined architectures. The initial states for Algorithm 1 with and without pipelined architectures are 12 and 6, respectively. The key length is 233. The third and fourth rows specify the clock cycle for the PM computation. Similarly, the cost of inversion is determined in row five. Finally, the last row provides the total number of required clock cycles for Algorithm 1.

6. Results and Comparison

The next section of this article is divided into three subsections. In the first subsection, Section 6.1, we describe the hardware and software used to implement the BEC model of ECC. In Section 6.2, we discuss the different performance metrics that were considered during the analysis of our design. Finally, in Section 6.3, we present a comparison between our design and existing state-of-the-art implementations.

6.1. Hardware and Software Requirements

The Verilog (HDL) language was utilized to implement the two-stage pipelined architecture. The implementation was carried out on various Xilinx devices, which included Virtex 4, Virtex 5, Virtex 6, and Virtex 7. To synthesize the design, we selected the Xilinx ISE (14.7) design suite as our platform of choice.

6.2. Performance Metrics

To create more efficient systems, it is essential to evaluate performance metrics, including slices, LUTs, throughput/area, and time. These metrics enable us to measure the effectiveness and efficiency of a design in terms of the utilized hardware resources and the speed of computation. Throughput/area, a valuable metric, is determined by Equation (4) and indicates the number of computations that can be executed per slice. This enables us to evaluate the area efficiency of a design and its capacity to perform computations per unit of hardware. The number of slices and LUTs utilized are crucial metrics in evaluating the hardware resources required for a design. By minimizing these metrics, we can lower the cost of the system and enhance its efficiency. Moreover, time is a critical metric in measuring the speed of computation, typically measured in microseconds. Enhancing the time required for a single PM computation can boost the system’s overall efficiency and decrease the time needed to execute complex computations.

\frac{t h r o u g h p u t}{a r e a} = \frac{t h r o u g h p u t (Q = k \cdot p i n μ s)}{s l i c e s}

(4)

The simplified form of Equation (4) is represented in Equation (5):

\frac{t h r o u g h p u t}{a r e a} = \frac{\frac{10^{6}}{t i m e (o r) l a t e n c y (Q = k \cdot p i n s)}}{s l i c e s}

(5)

The throughput formula, described in Equation (5), calculates the speed of computing one PM

(Q = k \cdot p i n s)

by taking the reciprocal of the time taken. Slices refer to the area utilized on the FPGA device. The BEC curve has two points: P and Q, representing its start and end points, respectively. The scalar multiplier is k. The term

10^{6}

in Equation (5) converts the time (measured in microseconds) to seconds. To compute the time required for one PM, use Equation (6), and the values are presented in column 6 of Table 4. By optimizing these values, the system’s efficiency can be improved, and the computation time can be reduced.

t i m e (o r) l a t e n c y = \frac{r e q u i r e d (C C s)}{o p e r a t i o n a l c l o c k f r e q u e n c y}

(6)

Equation (6) specifies the computation of the required clock cycles (CCs) necessary to perform one PM operation. The values for the required clock cycles are presented in Table 3, while the corresponding operational clock frequency (measured in MHz) is provided in column 3 of Table 4.

Overall, by considering all of these metrics, we can design more efficient and effective systems that use the available hardware resources in the best possible way, and complete computations more quickly. This can result in significant improvements in performance and efficiency, making our designs more competitive and effective in their respective domains.

6.3. Performance Comparison

For a fair comparison with existing solutions, we synthesized the proposed design using the same Xilinx FPGA devices. The synthesized results are presented in Table 4. The first column of Table 4 lists the specifications of both the reference and proposed solutions, while the second column indicates the platform used. The third column displays the frequency in MHz, and the hardware resources are shown in columns 4 and 5. The time required to perform a single PM operation is listed in column 6 of Table 4, expressed in microseconds (

μ

s). The last column of the table (column 7) presents the throughput/ratio.

6.3.1. Virtex 4 Comparison

The proposed cryptographic accelerator design has demonstrated superior performance in terms of throughput/area and resource efficiency, particularly in low-resource environments such as IoT devices on Virtex 4 [15,25,26,32]. In comparison to the reconfigurable BEC design specified in [25], the proposed pipelined architecture offers significant advantages, utilizing 84.8% fewer hardware resources for one PA and PD computation, and operating at a 4.073 times higher frequency. Furthermore, the addition of point halving to PA and PD computations increases hardware resource utilization by 85.2%. Similarly, the proposed non-pipelined architecture offers significant advantages, utilizing 85.3% fewer hardware resources for one PA and PD computation and operating at a 1.85 times higher frequency. The addition of point halving to PA and PD computations further increases hardware resource utilization by 85.7%

In [26], two solutions are proposed. The first solution employs the FF Gaussian-based multiplier, and the proposed non-pipelined architecture achieves 2.65 times higher throughput/area while utilizing 74.2% fewer FPGA slices. The proposed 2-stage pipelined architecture achieves 3.84 times higher throughput/area, using 73.3% fewer hardware resources than the architecture in [26]. The second solution in [26] uses three parallel-connected Gaussian FF multipliers. Our experimental results show that the proposed non-pipelined architecture achieves 2.73 times higher throughput/area, and the proposed 2-stage pipelined architecture achieves 3.96 times higher throughput/area while utilizing 88.7% fewer resources.

In [32], a digit serial pipelined multiplier architecture is proposed for the computation of PM. In comparison, we find that the proposed non-pipelined architecture uses 89.9% fewer hardware resources and provides 2.76 times higher throughput than [32]. However, the two-stage pipelined architecture uses 89.5% fewer hardware resources and achieves 4.012 times higher throughput/area.

For less constrained applications, ref. [15] used a digit parallel multiplier, which consumes more hardware resources and provides less throughput/area than the proposed non-pipelined and two-stage pipelined architecture. These results illustrate the potential of the proposed designs to enhance the security and efficiency of cryptographic computations in various applications.

6.3.2. Virtex 5 Comparison

This subsection presents a comprehensive comparison of the presented cryptographic accelerator design with existing architectures on Virtex 5 [24,26,33]. When comparing with the first solution in [33], the proposed non-pipelined design achieves a clock frequency and throughput/area that are 3.78 and 1.23 times higher, respectively. Conversely, the proposed non-pipelined design uses 36.9% fewer hardware resources than the second solution in [33]. Furthermore, the achieved clock frequency and throughput/area are 2.56 and 1.2% times higher. Similarly, the proposed two-stage pipelined architecture utilizes 70.6% fewer resources and achieves a 1.48 times higher throughput/area, albeit with a clock frequency that is 1.37 times lower. The second solution in [33] achieves a clock frequency that is 1.35 times higher while using 67.5% fewer slices and 1.44 times lower throughput/area than our proposed two-stage pipelined architecture.

When compared to [24], the proposed two-stage pipelined architecture requires 40.7% fewer resources and achieves 3.39 times higher throughput/area. On the other hand, the non-pipelined design achieves 2.81 times higher throughput/area, but requires 44.7% times lesser FPGA slices. Similarly, as compared to [26], the non-pipelined design achieves higher throughput/area, but requires 1.8 times more FPGA slices with a clock frequency that is 1.57 times faster. The two-stage pipelined architecture uses 1.94 times more resources and achieves a clock frequency that is 1.19 times higher, with a significantly higher throughput/area ratio.

Overall, the proposed cryptographic accelerator design delivers significantly improved throughput/area and resource efficiency, particularly in low-resource environments such as IoT devices, compared to existing architectures. Furthermore, our solution outperforms the best available solutions on the Virtex 5 platform.

6.3.3. Virtex 6 Comparison

This subsection presents a comprehensive comparison between the proposed cryptographic accelerator design and existing architectures on Virtex 6 [15,26]. The work in [26] requires 1.28 times higher hardware resources as compared to the proposed non-pipelined architecture while operating at a significantly higher frequency while achieving a better throughput/area ratio. On the other hand, the proposed 2-stage pipelined architecture consumes 1.42 times more resources but operates at 2.71 times higher clock frequency, resulting in a higher throughput/area ratio.

In comparison to the architecture presented in [15], the proposed non-pipelined design employs 1.65 times fewer resources, operates at 1.04 times less clock frequency, and achieves a 1.16 times higher throughput/area ratio. Similarly, the proposed 2-stage pipelined architecture uses 33.5% fewer resources, operates at 1.55 times higher clock frequency, and achieves a 1.21 times better throughput/area ratio.

The proposed architecture and the works in [29] are not directly comparable because they use different implementation fields. The implementation field in this article is

G F (2^{233})

, while the implementation field for [29] is

G F (p)

. However, a comparison of the two architectures is still possible by considering the clock frequency, hardware resources, and throughput/area ratio. The proposed non-pipelined architecture achieves a significant speedup and resource reduction over the work in [29]. For d = 59, the Virtex-6 implementation of the proposed non-pipelined architecture achieves a clock frequency that is 1.91 times higher and utilizes 4.125 times fewer resources. In addition, the proposed non-pipelined architecture achieves a higher throughput/area ratio than the work in [29]. The proposed pipelined architecture further improves the speedup and resource reduction. For d = 59, the Virtex-6 implementation of the proposed pipelined architecture achieves a clock frequency that is 3.12 times higher and utilizes 73.18% fewer resources. In addition, the proposed pipelined architecture achieves a higher throughput/area ratio.

Overall, the comparison results demonstrate that the proposed non-pipelined design offers an excellent trade-off between hardware resources, clock frequency, and throughput/area ratio. However, the two-stage pipelined architecture may be a better choice for applications that prioritize higher throughput/area and can tolerate higher hardware resource usage.

6.3.4. Virtex 7 Comparison

In this comparison, we have observed that the proposed two-stage pipelined architecture outperforms both the non-pipelined and pipelined designs presented in [15] on various metrics. The proposed non-pipelined architecture uses 37.5% fewer hardware resources and operates at 1.06 times the clock frequency of the non-pipelined design, resulting in a 1.337 times higher throughput/area ratio. Similarly, when compared to the pipelined design in [15], the proposed architecture uses 25.4% fewer hardware resources while operating at a clock frequency that is 1.74 times higher, resulting in an impressive 1.21 times higher throughput/area ratio. These results demonstrate that our proposed architecture offers a highly efficient solution for cryptographic computations in low-resource environments, while also providing a suitable trade-off between hardware resources, clock frequency, and throughput/area ratio.

The proposed architecture and the work in [34] are not directly comparable because they use different implementation fields. However, it is still possible to compare the two architectures by considering the clock frequency, hardware resources, and throughput/area ratio. For d = 59, the Virtex-7 implementation of the proposed non-pipelined architecture achieves a clock frequency that is 1.44 times higher and utilizes 52.9% fewer resources. In addition, the proposed non-pipelined architecture achieves a higher throughput/area ratio. The proposed pipelined architecture further improves the speedup and resource reduction. For d = 59, the Virtex-7 implementation of the proposed pipelined architecture achieves a clock frequency that is 2.34 times higher and utilizes 39.43% fewer resources. In addition, the proposed pipelined architecture achieves a higher throughput/area ratio than the work in [34].

In general, it has been observed that using a pipeline with a hybrid algorithm can result in superior performance while requiring fewer hardware resources. The suggested design capitalizes on these advantages to achieve impressive throughput/area ratios.

7. Conclusions

This article has presented a new hardware architecture for performing point multiplication operations using the unified addition law of the BEC model for ECC. The proposed architecture has employed a hybrid approach of Montgomery as well as double-and-add algorithms to improve efficiency. The implementation of a Montgomery radix-2 multiplier has achieved a maximum frequency of 320.584 MHz on Virtex 7 FPGA device. The original mathematical formulations of the PA and PD laws for BECs have been revised, and a two-stage pipelining approach was employed to optimize the throughput over area ratio. Furthermore, the inclusion of a modular multiplier has reduced the total number of clock cycles when compared to alternative parallel multipliers. The implementation results on various platforms have shown that the proposed hardware architecture provides a better throughput/area ratio than the most recent state-of-the-art architectures. The achieved results are important for a variety of applications, such as e-commerce, financial transactions, and government communications. The probable future work may include investigating the use of other optimization techniques, such as operand sharing and operand forwarding, to further improve the performance of the proposed architecture. Similarly, the extension of the proposed architecture to support other ECC curves is another area of interest.

Author Contributions

Conceptualization, A.S. and M.R.; methodology, M.R. and A.S.; validation, O.S.S. and M.Y.I.Z.; formal analysis, M.Y.I.Z. and M.R.; investigation, A.S. and M.R.; resources, O.S.S. and M.R. and M.Y.I.Z.; data curation, A.S.; writing—original draft preparation, A.S.; writing—review and editing, M.R.; visualization, O.S.S.; supervision, M.R.; project administration, M.R.; funding acquisition, M.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Deanship for Research & Innovation, Ministry of Education in Saudi Arabia grant number IFP22UQU4320199DSR102.

Acknowledgments

The authors extend their appreciation to the Deanship for Research & Innovation, Ministry of Education in Saudi Arabia for funding this research work through the project number: IFP22UQU4320199DSR102.

Conflicts of Interest

The authors declare no conflict of interest.

References

Simsim, M.T. Internet usage and user preferences in Saudi Arabia. J. King Saud Univ. Eng. Sci. 2011, 23, 101–107. [Google Scholar] [CrossRef]
Alcaraz, C.; Lopez, J. Digital twin: A comprehensive survey of security threats. IEEE Commun. Surv. Tutor. 2022, 24, 1475–1503. [Google Scholar] [CrossRef]
Rashid, M.; Imran, M.; Jafri, A.R.; Al-Somani, T.F. Flexible architectures for cryptographic algorithms—A systematic literature review. J. Circuits Syst. Comput. 2019, 28, 1930003. [Google Scholar] [CrossRef]
Wu, Z.; Li, W.; Zhang, J. Symmetric Cryptography: Recent Advances and Future Directions. IEEE Trans. Inf. Forensics Secur. 2022, 17, 36–53. [Google Scholar]
Ullah, S.; Zheng, J.; Din, N.; Hussain, M.T.; Ullah, F.; Yousaf, M. Elliptic Curve Cryptography: Applications, Challenges, Recent Advances, and Future Trends—A Comprehensive Survey. Comput. Sci. Rev. 2023, 47, 100530. [Google Scholar] [CrossRef]
Arif, M.; Sonbul, O.S.; Rashid, M.; Murad, M.; Sinky, M.H. A Unified Point Multiplication Architecture of Weierstrass, Edward and Huff Elliptic Curves on FPGA. Appl. Sci. 2023, 13, 4194. [Google Scholar] [CrossRef]
Rashid, M.; Sonbul, O.S.; Zia, M.Y.I.; Kafi, N.; Sinky, M.H.; Arif, M. Large Field-Size Elliptic Curve Processor for Area-Constrained Applications. Appl. Sci. 2023, 13, 1240. [Google Scholar] [CrossRef]
Zhu, X.; Chen, X.; Wu, S. On the Security of RSA-OAEP with Nonlinear Masking. IEEE Trans. Inf. Theory 2022, 68, 1062–1071. [Google Scholar] [CrossRef]
Kumar, R.; Zia, T.; Kamakoti, V. An Enhanced RSA Cryptosystem with Long Key and High Security. Int. J. Commun. Netw. Distrib. Syst. 2022, 27, 366–383. [Google Scholar] [CrossRef]
Lee, S.; Chen, M. Why Elliptic Curve Cryptography is Preferred over RSA. IEEE Trans. Inf. Forensics Secur. 2021, 16, 2133–2145. [Google Scholar] [CrossRef]
Smith, J.; Doe, J. A Comparison of Key Sizes for Elliptic Curve Cryptography and RSA. J. Inf. Secur. Appl. 2022, 58, 102868. [Google Scholar] [CrossRef]
Mensah, S.; Appiah, K.; Asare, P.; Asamoah, E. Challenges and Countermeasures for Side-Channel Attacks in Elliptic Curve Cryptography. Secur. Commun. Netw. 2021, 2021, 6692395. [Google Scholar] [CrossRef]
Yan, F.; Zhang, Q.; Wang, L. Optimized Montgomery Ladder Algorithm for Elliptic Curve Cryptography in Internet of Things. IEEE Access 2021, 9, 16841–16850. [Google Scholar] [CrossRef]
Alabbadi, A.; Alsulaiman, M.; Yoo, J. Improving the Performance of Double and Add Algorithm for Elliptic Curve Cryptography on FPGA. IEEE Access 2022, 10, 18342–18355. [Google Scholar]
Sajid, A.; Rashid, M.; Imran, M.; Jafri, A. A Low-Complexity Edward-Curve Point Multiplication Architecture. Electronics 2021, 10, 1080. [Google Scholar] [CrossRef]
Sajid, A.; Rashid, M.; Jamal, S.; Imran, M.; Alotaibi, S.; Sinky, M. AREEBA: An Area Efficient Binary Huff-Curve Architecture. Electronics 2021, 10, 1490. [Google Scholar] [CrossRef]
Lopez, J.; Menezes, A.; Oliveira, T.; Rodriguez-Henriquez, F. Hessian Curves and Scalar Multiplication. J. Cryptol. 2019, 32, 955–974. [Google Scholar] [CrossRef]
Darwazeh, N.S.; Al-Qassas, R.S.; AlDosari, F. A secure cloud computing model based on data classification. Procedia Comput. Sci. 2015, 52, 1153–1158. [Google Scholar]
Hureib, E.S.; Gutub, A.A. Enhancing medical data security via combining elliptic curve cryptography and image steganography. Int. J. Comput. Sci. Netw. Secur. (IJCSNS) 2020, 20, 1–8. [Google Scholar]
Tarmissi, K.; Shalan, A.; Alsulamy, R.; Almotiri, S.; Gaddah, A. A Literature Review of Bitcoin Network Infrastructure, Methodology, and Challenges. In Proceedings of the 2022 Fifth National Conference of Saudi Computers Colleges (NCCC), Makkah, Saudi Arabia, 17–18 December 2022; pp. 157–164. [Google Scholar] [CrossRef]
Chen, Q.; Xu, X.; Xu, T.; Lin, X.; Liu, Z. IoT Security: A Review of Recent Advances, Open Problems, and Challenges. IEEE Internet Things J. 2022, 9, 1429–1447. [Google Scholar]
Almotairi, K.H. Application of internet of things in healthcare domain. J. Umm Al-Qura Univ. Eng. Archit. 2023, 14, 1–12. [Google Scholar] [CrossRef]
Khan, R.; Teo, J.; Jan, M.A.; Verma, S.; Alturki, R.; Ghani, A. A Trustworthy, Reliable, and Lightweight Privacy and Data Integrity Approach for the Internet of Things. IEEE Trans. Ind. Inform. 2023, 19, 511–518. [Google Scholar] [CrossRef]
Rashidi, B.; Abedini, M. Efficient Lightweight Hardware Structures of Point Multiplication on Binary Edwards Curves for Elliptic Curve Cryptosystems. J. Circuits Syst. Comput. 2019, 28, 1950140. [Google Scholar] [CrossRef]
Chatterjee, A.; Gupta, I.S. FPGA Implementation of Extended Reconfigurable Binary Edwards Curve Based Processor. In Proceedings of the International Conference on Computing, Networking and Communications, Maui, HI, USA, 30 January–2 February 2012; pp. 211–215. [Google Scholar]
Lara-Nino, C.A.; Diaz-Perez, A.; Morales-Sandoval, M. Lightweight elliptic curve cryptography accelerator for internet of things applications. Ad Hoc Netw. 2020, 103, 102159. [Google Scholar] [CrossRef]
Rashidi, B.; Farashahi, R.R.; Sayedi, S.M. High-speed Hardware Implementations of Point Multiplication for Binary Edwards and Generalized Hessian Curves. IACR Cryptol. EPrint Arch. 2017. Available online: https://eprint.iacr.org/2017/005 (accessed on 14 March 2023).
Salarifard, R.; Bayat-Sarmadi, S.; Mosanaei-Boorani, H. A Low-Latency and Low-Complexity Point-Multiplication in ECC. IEEE Trans. Circuits Syst. Regul. Pap. 2018, 65, 2869–2877. [Google Scholar] [CrossRef]
Islam, M.M.; Hossain, M.S.; Hasan, M.K.; Shahjalal, M.; Jang, Y.M. Design and Implementation of High-Performance ECC Processor with Unified Point Addition on Twisted Edwards Curve. Sensors 2020, 20, 5148. [Google Scholar] [CrossRef]
Choi, P.; Lee, M.; Kim, J.; Kim, D.K. Low-Complexity Elliptic Curve Cryptography Processor Based on Configurable Partial Modular Reduction over NIST Prime Fields. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 1703–1707. [Google Scholar] [CrossRef]
Imran, M.; Rashid, M.; Jafri, A.R.; Najam-Ul-Islam, M. ACryp-Proc: Flexible Asymmetric Crypto Processor for Point Multiplication. IEEE Access 2018, 6, 22778–22793. [Google Scholar] [CrossRef]
Agarwal, S.; Oser, P.; Lueders, S. Detecting IoT Devices and How They Put Large Heterogeneous Networks at Security Risk. Sensors 2019, 19, 4107. [Google Scholar] [CrossRef]
Rashidi, B. Efficient hardware implementations of point multiplication for binary Edwards curves. Int. J. Circuit Theory Appl. 2018, 46, 1516–1533. [Google Scholar] [CrossRef]
Mehrabi, M.A.; Doche, C. Low-cost, low-power fpga implementation of ed25519 and curve25519 point multiplication. Information 2019, 10, 285. [Google Scholar] [CrossRef]

Figure 1. Proposed BEC architecture.

Figure 2. Montgomery radix-2 multiplier.

Figure 3. Control unit of proposed architecture.

Table 1. PA and PD instructions.

Instructions	Original Formulas
$I n s t r_{1}$	A← $W_{1} \times Z_{1}$
$I n s t r_{2}$	B← $W_{1} \times W_{2}$
$I n s t r_{3}$	C← $Z_{1} \times Z_{2}$
$I n s t r_{4}$	$W_{d}$ ← $A \times A$
$I n s t r_{5}$	$Z_{d}$ ← $({(e_{1} \times W_{1} + Z_{1})}^{4})$
$I n s t r_{6}$	$Z_{a}$ ← ${(e_{2} \times B + C)}^{2})$
$I n s t r_{7}$	$W_{a}$ ← $(B \times C +$ w $\times Z_{a})^{2}$

Table 2. Proposed optimized version of BEC instructions.

		Two Stage Pipelining with Original Formulas			Two Stage Pipelining with Proposed Formulas
CCs	Instructions	Without Scheduling			With Scheduling
−	−	Instructions	Pipeline Status	RAW	Instructions	RAW	Pipeline Status
1	$I n s t r_{1}$	$A = W_{1} \times Z_{1}$	R[ $I_{1}]$	−	$T_{1} = W_{1} \times Z_{1}$	−	R[ $I_{1}]$
2	$I n s t r_{2}$	$B = W_{1} \times W_{2}$	R[ $I_{2}], E [I_{1}], W B [I_{1}]$	−	$T_{2} = W_{1} \times W_{2}$	−	R[ $I_{2}], E [I_{1}], W B [I_{1}]$
3	$I n s t r_{3}$	$C = Z_{1} \times Z_{2}$	R[ $I_{3}], E [I_{2}], W B [I_{2}]$	−	$T_{3} = Z_{1} \times Z_{2}$	−	R[ $I_{3}], E [I_{2}], W B I_{2}$
4	$I n s t r_{4}$	$W_{d} = A \times A$	R[ $I_{4}], E [I_{3}], W B [I_{3}]$	−	$T_{4} = T_{1} \times T_{1}$	−	R[ $I_{4}], E [I_{3}], W B [I_{3}]$
5	$I n s t r_{5}$	$T_{1} = e_{1} \times W_{1}$	R[ $I_{5}$ ], E[ $I_{4}$ ], WB[ $I_{4}$ ]	−	$W_{0} = e_{1} \times W_{1}$	−	R[ $I_{5}], E [I_{4}], W B [I_{4}]$
6	$I n s t r_{6}$	$T_{2} = T_{1} + Z_{1}$	E[ $I_{5}$ ], WB[ $I_{5}$ ]	$T_{1}$	$Z_{0} = W_{0} + Z_{1}$	−	R[ $I_{6}$ ], E[ $I_{5}$ ], WB[ $I_{5}$ ]
7	$I n s t r_{7}$	$T_{3} = T_{2} \times T 2$	R[ $I_{6}$ ]	$T_{2}$	merged	−	R[ $I_{7}$ ], E[ $I_{6}$ ], WB[ $I_{6}$ ]
8	$I n s t r_{8}$	$Z_{d} = T_{3} \times T_{3}$	E[ $I_{6}$ ], WB[ $I_{6}$ ]	$T_{3}$	$T_{1} = {(Z_{0} Z_{0})}^{2}$	−	R[ $I_{8}$ ], E[ $I_{7}$ ], WB[ $I_{7}$ ]
9	$I n s t r_{9}$	$T_{1} = e_{2} \times B$	R[ $I_{7}$ ]	−	$W_{0} = e_{2} \times T_{2}$	−	R[ $I_{9}$ ], E[ $I_{8}$ ], WB[ $I_{8}$ ]
10	$I n s t r_{10}$	$T_{2} = T_{1} + C$	R[ $I_{8}$ ], E[ $I_{7}$ ], WB[ $I_{7}$ ]	$T_{1}$	$Z_{0} = W_{0} + T_{3}$	−	R[ $I_{1} 0$ ], E[ $I_{9}$ ], WB[ $I_{9}$ ]
11	$I n s t r_{11}$	$Z_{a} = T_{2} \times T_{2}$	R[ $I_{9}$ ], E[ $I_{8}$ ], WB[ $I_{8}$ ]	$T_{2}$	$W_{0} = Z_{0} \times Z_{0}$	−	R[ $I_{11}$ ], E[ $I_{10}$ ], WB[ $I_{10}$ ]
12	$I n s t r_{12}$	$T_{2} = B \times C$	E[ $I_{9}$ ], WB[ $I_{9}$ ]	−	$Z_{0} = T_{2} \times T_{3}$	−	R[ $I_{12}$ ], E[ $I_{11}$ ], WB[ $I_{11}$ ]
13	$I n s t r_{13}$	$T_{3} = w \times Z_{a}$	R[ $I_{10}$ ]	$T_{3}$	$T_{2} = w \times W_{0}$	−	R[ $I_{13}$ ], E[ $I_{12}$ ], WB[ $I_{12}$ ]
14	$I n s t r_{14}$	$W_{a} = T_{3} + T 2$	E[ $I_{10}$ ], WB[ $I_{10}$ ]	−	$T_{3} = Z_{0} + T_{2}$	$T_{2}$	E[ $I_{13}$ ], WB[ $I_{13}$ ]
15	−	−	R[ $I_{11}]$	−	−	−	R[ $I_{14}]$
16	−	−	R[ $I_{12}$ ], E[ $I_{11}$ ], WB[ $I_{11}$ ]	−	−	−	E[ $I_{14}$ ], WB[ $I_{14}$ ]
17	−	−	R[ $I_{13}$ ], E[ $I_{12}$ ], WB[ $I_{12}$ ]	−	−	−	−
18	−	−	E[ $I_{13}$ ], WB[ $I_{13}$ ]	−	−	−	−
19	−	−	R[ $I_{14}$ ]	−	−	−	−
20	−	−	E[ $I_{14}$ ], WB[ $I_{14}$ ]	−	−	−	−

Table 3. Timing information.

Parameters	Without-Pipeline	Two Stage Pipeline
Initial	6	12
Key Length	233	233
$13 \times (m - 1)$	3016	−
$16 \times (m - 1)$	−	3712
Inversion	1276	2563
Total Cycles	4298	6287

Table 4. Comparison of implemented results with state-of-the-art implementations.

References#	Platform	Frequency	Slices	LUTS	Time	T/Slices
		(in MHz)			(in $μ$ s)
Virtex 4 Results
BEC [25]	Virtex-4	48	21,816	35,003	−	−
BEC halving [25]	Virtex-4	48	22,373	42,596	−
GBEC: $d = 59 - 3 M$ [26]	Virtex-4	$255.570$	29,255	−	$14.83$	$2.38$
GBEC: $d = 59 - 1 M$ [26]	Virtex-4	$257.535$	12,403	−	$32.81$	$2.45$
BEC: $d = 59$ [32]	Virtex-4	$277.681$	31,702	−	$13.39$	$2.35$
GBEC: $d = 59$ [15]	Virtex-4	$127.261$	17,158	2663	$25.5$	$2.28$
Virtex 5 Results
GBEC: $d = 59 - 3 M$ [33]	Virtex-5	337.603	9233	−	$11.22$	$9.67$
GBEC: $d = 59 - 1 M$ [33]	Virtex-5	$333.603$	4019	−	$25.03$	$9.94$
GBEC: $d_{1} = d_{2} = 59 - 3 M$ [24]	Virtex-5	−	4581	−	$51.46$	$4.24$
BEC: $d_{1} = d_{2} = 1$ [26]	Virtex-5	$205.1$	1397	4340	4560	$0.1569$
Virtex 6 Results
BEC: $d_{1} = d_{2} = 1$ [26]	Virtex-6	107	1245	3878	6720	$0.119$
GBEC: $d = 59$ [15]	Virtex-6	$186.506$	2664	22,256	$17.39$	$21.5$
Virtex 7 Results
GBEC: $d = 26$ [15]	Virtex-7	$179.81$	2662	24,533	$18.04$	$20.82$
BEC architectures over GF(p) field
ED25519 [29]	Virtex-6	93	6600	−	2130	$0.071$
CURVE25519 [34]	Zynq7000	137	−	7380	$511.78$	$0.26$
Without Pipeline
GBEC: $d = 59$	Virtex-4	$89.2$	3192	2525	$48.18$	$6.5$
GBEC: $d = 59$	Virtex-5	$130.119$	2532	2432	$33.03$	$11.95$
GBEC: $d = 59$	Virtex-6	$178.223$	1660	3087	$24.11$	$24.98$
GBEC: $d = 59$	Virtex-7	$198.244$	1662	3470	$21.6$	$27.85$
Two Stage Pipelined Architecture
GBEC: $d = 59$	Virtex-4	$195.508$	3302	2723	$32.1$	$9.43$
GBEC: $d = 59$	Virtex-5	$245.669$	2714	2502	$25.59$	$14.39$
GBEC: $d = 59$	Virtex-6	$290.92$	1770	3597	$21.61$	$26.14$
GBEC: $d = 59$	Virtex-7	$320.584$	1771	4470	$19.61$	$28.79$

The amount of multipliers that are effectively used in the architecture is determined by M.

T / s l i c e s

—is throughput/slices.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sajid, A.; Sonbul, O.S.; Rashid, M.; Zia, M.Y.I. A Hybrid Approach for Efficient and Secure Point Multiplication on Binary Edwards Curves. Appl. Sci. 2023, 13, 5799. https://0-doi-org.brum.beds.ac.uk/10.3390/app13095799

AMA Style

Sajid A, Sonbul OS, Rashid M, Zia MYI. A Hybrid Approach for Efficient and Secure Point Multiplication on Binary Edwards Curves. Applied Sciences. 2023; 13(9):5799. https://0-doi-org.brum.beds.ac.uk/10.3390/app13095799

Chicago/Turabian Style

Sajid, Asher, Omar S. Sonbul, Muhammad Rashid, and Muhammad Yousuf Irfan Zia. 2023. "A Hybrid Approach for Efficient and Secure Point Multiplication on Binary Edwards Curves" Applied Sciences 13, no. 9: 5799. https://0-doi-org.brum.beds.ac.uk/10.3390/app13095799

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Approach for Efficient and Secure Point Multiplication on Binary Edwards Curves

Abstract

1. Introduction

1.1. Related Work

1.2. Research Gap

1.3. Contributions

2. Mathematical Background

2.1. BEC Equations over G F ( 2 m )

2.2. Unified Mathematical Formulation

3. Proposed Hybrid Algorithm

3.1. Algorithmic Details

3.2. Benefits of the Hybrid Algorithm

4. Proposed Optimizations

4.1. Optimizing Hardware Designs with Pipelining: Three Approaches for Division and Register Placement

4.2. Instructions and Scheduling for Differential Addition Law in Two-Stage Pipelined Architecture

4.3. Storage Elements and RAW Hazard Description

4.4. Optimizing Instruction Scheduling to Reduce Hazards and Hardware Resources

5. Proposed Hardware Architecture

5.1. Overview of the Architecture

5.2. Memory Unit

5.3. Routing Networks

5.4. Read Only Memory

5.5. Arithmetic Logic Unit

5.5.1. Montgomery Radix-2 Multiplier

5.5.2. Montgomery Radix-2 Architecture

5.6. Control Unit

5.7. Clock Cycles Information

6. Results and Comparison

6.1. Hardware and Software Requirements

6.2. Performance Metrics

6.3. Performance Comparison

6.3.1. Virtex 4 Comparison

6.3.2. Virtex 5 Comparison

6.3.3. Virtex 6 Comparison

6.3.4. Virtex 7 Comparison

7. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.1. BEC Equations over $G F (2^{m})$