Joining Federated Learning to Blockchain for Digital Forensics in IoT

Almutairi, Wejdan; Moulahi, Tarek

doi:10.3390/computers12080157

Open AccessArticle

Joining Federated Learning to Blockchain for Digital Forensics in IoT

by

Wejdan Almutairi

^* and

Tarek Moulahi

^*

Department of Information Technology, College of Computer, Qassim University, Buraidah 52571, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

Computers 2023, 12(8), 157; https://0-doi-org.brum.beds.ac.uk/10.3390/computers12080157

Submission received: 19 June 2023 / Revised: 25 July 2023 / Accepted: 28 July 2023 / Published: 3 August 2023

(This article belongs to the Special Issue Using New Technologies on Cyber Security Solutions)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In present times, the Internet of Things (IoT) is becoming the new era in technology by including smart devices in every aspect of our lives. Smart devices in IoT environments are increasing and storing large amounts of sensitive data, which attracts a lot of cybersecurity threats. With these attacks, digital forensics is needed to conduct investigations to identify when and where the attacks happened and acquire information to identify the persons responsible for the attacks. However, digital forensics in an IoT environment is a challenging area of research due to the multiple locations that contain data, traceability of the collected evidence, ensuring integrity, difficulty accessing data from multiple sources, and transparency in the process of collecting evidence. For this reason, we proposed combining two promising technologies to provide a sufficient solution. We used federated learning to train models locally based on data stored on the IoT devices using a dataset designed to represent attacks on the IoT environment. Afterward, we performed aggregation via blockchain by collecting the parameters from the IoT gateway to make the blockchain lightweight. The results of our framework are promising in terms of consumed gas in the blockchain and an accuracy of over 98% using MLP in the federated learning phase.

Keywords:

IoT; blockchain; digital forensics; federated learning; privacy-preservation

1. Introduction

The age of major technology integration into our lives has led us to the era of the Internet of Things (IoT) that connects millions of physical devices to the Internet, including smart homes environments, smart cities, and smart factories and is still increasing, as mentioned in [1]. The number of IoT devices is likely to reach approximately 75 billion devices by 2025. However, with such technologies come major security risks. Data in IoT devices are hard to secure due to their dynamic nature. Due to this, the security of IoT has become an important area of research, with many solutions proposed to ensure a secure environment. However, due to the amount of sensitive data contained in these devices, this environment will attract cyber threats and attackers will find ways to bypass such security solutions and access all the sensitive data stored in IoT devices [2]. This leads to the need for digital forensics to investigate crime scenes and critique such attacks.

Digital forensics (DF) is the process of collecting, analyzing, and presenting digital evidence that can be used to identify when an attack happened, who is responsible for it, and where it happened, to be later presented on a report that can be used in court or to prevent such incidents in the future. However, it is challenging to conduct a digital forensics investigation due to the availability of data in multiple locations, traceability, and integrity, as well as difficulty accessing data from multiple sources and transparency in the process of collecting evidence [3].

To solve these challenges, blockchain technology can be used to conduct a forensics framework for the IoT environment that can fulfill digital forensics requirements. Moreover, lightweight blockchain can provide even better performance in IoT environments, ensuring the devices in IoT can maintain and save as much energy as possible while maintaining the same level of integrity and traceability [4]. Existing solutions that addressed the use of blockchain technology failed to address privacy issues related to the use of blockchain that may lead to the leakage of sensitive data, causing problems beyond financial loss that may occur in the transmission or storage of the data. Moreover, existing solutions addressing this challenge used methods such as k-anonymity, assuming that attackers did not have the knowledge nor the ability to compromise the system and gain access to the data [5].

To address the privacy issue related to blockchain, the report suggests the use of federated learning, which is based on decentralized training of models along the IoT environment, before sharing any raw data with other parties. The trained models can be further uploaded to the blockchain to be used in the digital forensics investigation while preserving privacy [6].

Our contribution is as follows:

This study will help provide a framework that combines the advantages of blockchain and federated learning to achieve a high level of privacy and performance to conduct a successful digital forensics investigation. The results of this study will be, to our best knowledge, a novel advancement of the literature.

This paper is organized as follows: Section 2 covers the essential background of the IoT environment, federated learning, and blockchain and describes the problem statement of this study. Section 3 covers the literature review of recently proposed solutions in the same area of research. Section 4 shows the proposed framework. Section 5 describes the experimental environment and the results. Finally, Section 6 gives the conclusion of our study.

2. Background

This section is divided into three subsections describing the concepts of IoT forensics, federated learning, and blockchain. Finally, it presents the motivation and problem statement of this study.

2.1. IoT Forensics

The Internet of Things is an environment that connects different computing devices for consistent data transfer. IoT shares information with many other devices in the environment in large-scale communication, which makes it vulnerable to many threats and cybercriminals. In 2017, there was an increase of approximately 600% in cyberattacks targeting devices in the IoT environment [7]. Moreover, multiple cases showed that cyber-attacks were not targeting the IoT device but rather using it to attack other devices and websites. The focus while manufacturing IoT environments is on cost, size, and usability, while aspects of security and forensics are ignored; because of this, they can be easily targeted and attacked. IoT forensics is a branch of Digital Forensics (DF). They share the same purpose of legally identifying and extracting digital evidence while maintaining a chain of custody to be presented in court. Digital evidence can be any information that helps provide answers to 5 WH questions: who committed criminal activities, where the crime took place, what did the criminal do, when the crime happened, why the criminal committed the crime, and how it happened [8]. The main difference between digital forensics and IoT forensics is the evidence resources; digital evidence can be found in a wide range of devices such as monitoring systems, traffic lights, or even medical implants in humans [7]. Chain of custody is an important part of the digital forensics process that involves documentation of every step in the investigation and evidence life cycle to ensure evidence integrity in court. This documentation will involve getting answers to the 5 WH questions, including the individuals involved in any given phase of the investigation, time records, and the investigation laboratory that examined and analyzed the evidence. This principle supports the authenticity of digital evidence to be later accepted in court [9].

2.2. Federated Learning and Blockchain

Over the years, Machine Learning (ML) has witnessed huge growth and success in many fields. Artificial Intelligence (AI) applications have also witnessed huge growth due to the massive amount of data used to power ML. With this data, ML can perform tasks that cannot be performed by humans. However, with the use of this big data comes data ownership and security concerns, which makes the traditional ML technologies—where data will be gathered in a central location to train ML models—unsuitable. With these concerns, the process of collecting and sharing data between organizations is becoming difficult considering the amount of sensitive data that must be maintained by its owners, for example, financial and medical data records [10].

As a solution to this issue, an ML model that did not require the data to be gathered in a central location was used to train and build the model. The training process was conducted at the location of the data; the trained model was then used at each location to train a global model. The communication process ensures the privacy and confidentiality of users’ data while the global model is built as if the data were gathered and combined. This is known as Federated Learning (FL) [10].

Blockchain is a decentralized platform technology that eliminates central authority in the system. Blockchain technology was first proposed by Satoshi Nakamoto in 2008 as a Proof-of-Work (PoW) consisting of a chain of blocks associated with financial transactions of cryptocurrencies such as Bitcoin [11]. Although it was introduced in 2008, the concept of distributed ledger goes back to the 1990s, when it was proposed by Haber and Stornetta and has been growing since [12].

Blockchain is a series of blocks connected with a hash function (SHA-256) without any central authority. Blocks are added after a miner confirms the hash in any given block; the block is added to his blockchain and the updated version of the blockchain is broadcast to other miners on the network. Other miners validate the hash and add it to their blockchains and any blocks that cannot be validated are discarded [13]. Therefore, tampering with the block value will change the hash in the blocks, and eventually, it will be discarded. The connection of blocks is shown in Figure 1.The first block in a blockchain is called a Genesis block. Each block contains a hash calculated from the contents of the block and the previous block’s hash value [14].

The high potential associated with blockchain technology is due to its special characteristics, explained as follows: it is a decentralized platform that does not have a central authority; it has transparency, where all transactions in the blockchain are public; it has redundancy, where every miner stores a copy of digital transactions; and it has immutability, which ensures the integrity of the records [15].

2.3. Problem Statement

The IoT environment has attracted many cyber-criminals and cyber-attacks due to the large amounts of information it contains; this situation raises the need to focus on the digital forensics aspect to be able to find individuals responsible for attacks.

Blockchain proved to have the desired characteristics to provide a suitable solution for digital forensics in this environment. However, this solution needs enhancements in terms of cost and performance as the subsequent mentioned literature shows. Additionally, the privacy aspect has not been addressed properly.

3. Literature Review

This section is divided into two parts; the first part will address the privacy issue associated with the use of blockchain technology by combining it with federated learning and the second part will review some of the recent research that proposed a blockchain-based framework for solving digital forensics challenges in the IoT environment.

3.1. Blockchain and Federated Learning

The authors of [5] proposed a system for sharing data in the Industrial Internet of Things (IIoT) between data providers and data requesters while preserving privacy. The proposed system is divided into two sections: a private blockchain with a retrieval transaction and sharing data transaction and federated learning. The blockchain in this model is only used to retrieve data and manage permissions associated with accessing the data, alongside recording all data events. Instead of sharing the raw data, the model provides computed results by allowing all requesting participants to train a model based on a learning algorithm while preserving private data. This model is used by data recipients to obtain the results of their requests. To evaluate the effectiveness of this model, they tested it on two real-world datasets: the Reuters dataset and 20 newsgroups datasets.

The results of this experiment were based on the Receiver Operating Characteristic (ROC) curve and showed high accuracy with an Area Under the Curve (AUC) average of 0.918, leading to a high diagnostic ability of the proposed federated learning. The developed model eliminates centralized trust and improves the usage of computing resources.

False detection in devices is an essential part of the Industrial Internet of Things; however, to sufficiently identify the problem, users need to upload raw data, which may be vulnerable to disclosure. Based on this, the authors of [16] suggested a blockchain integrated with federated learning where each client creates a Merkle tree(A tree structure composed of different hashed blocks), and the root of the tree is stored on the blockchain. The proposed system consists of a central organization (manufactural) and a client organization that provides data for local model training. Client servers can download the global model from the central server and provide data that can help train the model; furthermore, updates applied to the model can be sent back to the central server to be received by the aggregator, which merges the models received from clients to provide them with an updated global model. The updates are forwarded to the detector on the clients to obtain detection results and make decisions based on them.

To test the proposed system, they implemented it in a real-world scenario to evaluate the prototype in terms of feasibility, accuracy in detecting failures, and overall performance. They used MySQL for the off-chain data and the Ethereum blockchain. The prototype showed promising results, with high accuracy and an acceptable execution time of 0.065 for all four clients in the prototype with approximately 1000 data records.

On the other hand, the authors of [17] replaced the central aggregator mentioned in the previous research by instead using blockchain to ensure integrity and avoid risks caused by harmful clients or manufacturers. In addition, certain clients were chosen by the manufacturers to create a model based on the models received from them; one of the clients is selected as the leader of the rest of the miners and uploads the final model to the blockchain. In terms of privacy, to avoid leakage of private data in federated learning, the authors enforced differential privacy, enforcing confidentiality over participants’ private data and encrypting and signing the models to prevent malicious participants from stealing the model or gaining access to data in it.

The results indicate that the privacy of the users was well preserved in this model. This attracted more users to participate in training the model. The proposed model achieved 97% accuracy.

The authors of [18] suggested a model for market trading of resources in decentralized edge companies. They used a hybrid blockchain that combines the characteristics of permissionless and consortium blockchains to lower the overhead costs by allowing participants to use permissionless or consortium blockchains, leading to enhanced system performance. Smart contracts applied in consortium blockchain included the use of Data Quality-Driven Reverse Auction (DQDRA), which ensures reverse auction, computation efficiency, and overall better performance compared with reverse auction mechanisms. According to the authors, IoT environments generate a large amount of sensitive data. Federated learning can be used to preserve privacy by requesting a model training service that will provide the requester with his task requirements without exposing private data.

Experimental results showed that the proposed model achieved truthfulness in terms of payment, along with budget feasibility and improved computation efficiency while preserving privacy in federated learning requests by using trained models.

The overall literature on combining blockchain with federated learning proves that federated learning can be used to complement blockchain and offer effective privacy solutions against cyber threats targeting IoT environments.

3.2. Blockchain in Digital Forensics

The authors of [3] used blockchain technology to create a digital forensics framework that can take advantage of the properties of blockchain to ensure the integrity of the evidence block through hash functions that are used in blockchain. Blockchain technology also has immutability, which is a nature of blockchain, making it hard to tamper with evidence because the evidence is collected and added to the blocks and then distributed to the blockchain network alongside timestamp logs created by IoTFC (Internet of Things Forensics Chain). This provides provenance for the exact location of each piece of evidence and the ability to trace the evidence until it is presented in court, with restricted access to the evidence chain in the blockchain. According to the authors, digital fingerprinting is used to restrict access during the evidence-collection process. The device involved in the case is highly effective at preventing even the slightest change to the evidence. The framework consists of users (including the owners and examinators), IoT devices involved in the case, the Merkle tree (the hash tree that ensures the integrity of transactional evidence), blocks, and the smart contract (provides information exchange and process of the data provides automatic collection of related evidence by setting conditions at a cheaper and faster rate by excluding any middle party). Evidence gathering in this framework can be divided into five levels ranging from easy to identify to very difficult to identify. Each group of evidence is bookmarked to make the examination process easier.

The authors of [19] proposed a solution consisting of two blockchains to solve the challenges that digital forensics faces in an IoT environment using blockchain technology due to the many benefits it provides, including integrity, transparency, confidentiality, and many other advantages that make blockchain a promising solution. The solution should be implemented on a Hyperledger and administrated by authorized personnel. It consists off our layers starting from IoT devices in the Edge-IoF layer that can provide evidence related to real-world crimes with the proper access control and simultaneity between data from the storage with the reported evidence. It also provides anonymity for devices and the investigators handling the devices. Since this layer contains devices that have low power and storage, the authors used encryption, which can provide confidentiality, integrity, and non-repudiation without consuming resources by ensuring low complexity. The second layer is Fog-IoF, which includes fog devices and forensics tools. This layer can handle complicated duties during the forensics investigation and maintain the chain of custody (CoC) to ensure transparency through the CoC blockchain. The Consortium-IoF layer is responsible for building the blockchain in forensics investigations (Consortium blockchain). Finally, the Cloud Storage layer is connected to the blockchain that is used to store data.

The authors established a simulator for the framework using the Ethereum platform and found an average latency of 7.07%. Another experiment implemented to measure the throughput showed that the system can have an average of 4.5% throughput as evidence increases. In terms of CPU, gas, and energy consumption, the system showed that it has less consumption compared with other systems. However, increased memory usage was observed compared to the other systems due to the use of two blockchains in the framework.

According to the authors of [20], threats can occur internally or externally. This research addressed the internal threats when a disloyal internal entity tries to sabotage or add a piece of evidence that was not added by a witness or a victim or tries to obtain the identity of a witness or a member of the jury. Therefore, the authors proposed a model called LEChain that runs on a blockchain that consists of trusted authority such as a victim or a witness who will provide evidence, an investigator who will acquire the evidence for the police investigation, an analyzer of the crime scene, a judge, a trusted authority where specific entities from the government can access the blockchain, and other components that together will conduct a full criminal investigation. The proposed system will address the privacy problem mentioned above by providing privacy for the witness, jury, and data, authentication for any registered entity in the blockchain, proper access control for the entities, data integrity, traceability, and a sufficient level of efficiency with the assistance of a consortium blockchain.

The proposed solution is implemented on Ethereum and shows that the time required to process the data is almost less than 1 s. As the time cost and the evidence increased, the authors also tested the latency, and the average was nearly 2.5 ms. Typically, when evidence is no longer required or not related to the investigation, a disposition transaction can be used to discard that evidence. However, the proposed framework does not provide such a transaction. Instead, it will nullify the previous upload to delete data from the blockchain. On the downside, the system results showed that the scalability of the system is not sufficient and will need more focused work in the future.

In [21], the authors proposed a solution that consists of multiple blockchains to maintain the cost. They used the Elecro-Optical System (EOS) blockchain that can ensure good efficiency and consume less energy. Alongside the use of the Stellar blockchain which can provide scalability, they are used in this framework to store the hash for the data for a certain time until the data is transferred to the Ethereum platform. This framework was designed as a solution for a company seeking to maintain the integrity of the data while it is being collected. The authors compound multiple blockchains that are not widely used or secure Ethereum to ensure low cost. Using multiple inexpensive blockchains will increase the security of the framework despite the security issues associated with them.

The proposed system provides a low-cost solution that will stand against the 51% attack on blockchains. An experiment was conducted to calculate the cost, assuming that a company has 1000 bots and each bot sends 10 data pieces per day for a year. The total cost was estimated at $443, which is much lower than other frameworks.

The following Table 1 summarizes the research reviewed in this section, highlighting their advantages and limitations in creating a benchmark with the proposed solution.

One of the issues associated with blockchain is that it has many privacy concerns and sharing data as digital evidence may lead to leakage of sensitive data, which may cause problems beyond financial loss that may occur during the transmission or storing of the data. Existing solutions addressing this challenge assume that the attackers do not have the knowledge nor the ability to compromise the system and gain access to the data [5].

To address the privacy issue associated with blockchain, we suggest the use of federated learning, which is based on decentralized training of models to eliminate the need to share raw data. The trained models can be further uploaded to the blockchain [6].

4. Proposed Framework

The framework for this research is described within six phases in a detailed manner in the following subsections.

4.1. Phases of the Proposed Model

Phase 1: Selection of the dataset and the machine learning method. The first step is to select a dataset. We selected the NSL-KDD dataset [22], which is used to represent the attacks on the IoT environment. We used the Multilayer Perceptron (MLP) as the machine learning method.
Phase 2: Splitting the dataset. In this phase, we split the dataset into multiple parts to represent multiple clients. This will simulate the concept of federated learning and preserve privacy by avoiding the upload of any of the data directly to the blockchain.
Phase 3: Executing the artificial intelligence method. After preparing the dataset, we use Python to execute the training method for each client (part of the dataset) we have. This process is repeated for two, three, and four clients.
Phase 4: Investigation. In case of an attack, an investigation will be conducted to identify the individual responsible for the attack. Based on the investigation, a report that contains information on what actions took place, the time the action took place, who carried out the action, the status of the device, and the ID of the device is generated. The report is later uploaded to the blockchain in phase five.
Phase 5: Model aggregation. In this phase, the decision parameters of the models obtained from phase 3 and the reports obtained from phase 4 are aggregated via blockchain using the IoT gateway to make the blockchain lightweight.
Phase 6: Performance evaluation. The final phase analyzes the performance of the framework in terms of latency, gas consumption, consumption of resources, and preserving the privacy of the IoT environment. If the performance accuracy is less than expected, aggregation strategies found in phase five can be improved to meet the satisfaction level.

The phases of implementation in the IoT environment are shown in Figure 2.

4.2. Multilayer Perceptron

Artificial Neural Networks (ANNs) are structured in a way that enables them to function similarly to the human brain [23]. ANNs are employed in many fields to generate generalizable models and have proved important in fields such as pattern recognition. The process of ANN training is concerned with finding the values of weights such that the output obtained is correlated to the input [24]. Multilayer perceptron (MLP) is one of the best-known and most powerful ANNs that implements a supervised training process using data with known inputs [23]. MLP is a layered neural network where data flows in a unidirectional manner from the input layer to the output layer while passing through the hidden layers in between. In MLP, each neuron’s connection with one another has its own weight, while perceptron in the same layer possesses the same activation function [23].

4.3. Dataset Description

The NSL-KDD dataset is derived from the KDDCUP 99 dataset, which has been widely used for network intrusion attacks since 1999. KDDCUP 99 dataset includes a set of data that is used in military network environments and simulations of network intrusion events [25]. However, this dataset has two major issues that affect performance evaluations. According to [26], the first issue associated with this dataset is the number of redundant records, which affects the learning algorithms negatively. The other issue is that many random parts of the train set were used in the test set, resulting in at least an 86% classification rate.

Based on this, the authors suggested the newer NSL-KDD dataset, which solved the previously mentioned problems.

The new dataset has the following advantages:

It can eliminate the redundancy of records in the training set, hence the classifier will not be biased for frequent records.
It can eliminate the duplicated records in the testing set so the learners’ performance will not be biased towards the method with better detection rates.
The number of records chosen from each difficulty level is inversely proportional to the percentage of records from the KDD dataset.
The number of records from both the train and test sets is acceptable, making running experiments on the whole dataset affordable.

The dataset contains 42 attributes; the 42nd attribute contains network connection vectors that are divided into a normal class and four attacks (DoS, Probe, R2L, and U2R). Table 2 shows the attack types and their correspondence with attack classes.

4.4. MLP Training Process

The MLP process is divided into an input layer, two hidden layers, and an output layer. The hidden layers’ sizes are five and two above 1000 epochs. The process of a feedforward neural network depends on the Rectifier Linear Unit activation function (reLU) non-linear function. Table 3 shows the stored parameters for N clients on the blockchain after the MLP training process.

Algorithm 1 describes the process for each client, starting from splitting the dataset into multiple parts to simulate the process of federated learning, then executing the MLP on each part of the dataset to extract the decision parameters that will be sent to the blockchain to execute the second algorithm.

Algorithm 1: MLP Execution

Input: dataset
Output: decision parameters

Split the dataset into multiple clients.
Each client reads its own part of the dataset.
Execute MLP within each client.
Print performance in terms of precision, accuracy, F1 score, and recall.
Extract decision parameters.
Send decision parameters to the smart contract.

Algorithm 2 describes how smart contract handles the decision parameter by first calculating the average accuracy and checking whether the results are satisfactory before sending the updated decision parameters back to the clients. The smart contract will also create a report in case of an attack after receiving satisfactory results. However, if the results do not meet the satisfaction level, the aggregation strategies of the blockchain must be updated.

Algorithm 2: Smart Contract

Input: decision parameters
Output: report and updated decision parameters

Read MLP decision parameters.
Calculate the average accuracy for all clients.
Return updated decision parameters to clients.
If the results are satisfactory Then
Update decision parameters and create a report of the attack
Else
Improve aggregation strategies
End If

5. Experimental Study

This section describes the environment of the experiment while also showing the results.

The experiment is conducted as follows:

We split the dataset into multiple parts to simulate the federated learning process.
We used Python to execute MLP on each part of the dataset separately.
Smart contract receives the decision parameters via the gateway using Ethereum.
Using a smart contract, we check if the results are satisfactory.
In case of an attack, the smart contract will also create a report that includes details of the attack.

In this experiment, we used the NSL-KDD dataset to represent the attacks on the IoT environment. We also used Python 3.10 to execute the MLP code. To simulate the process of federated learning, we split the dataset into multiple parts and executed the same MLP code on each part, starting by simulating two clients to finally simulating four clients.

5.1. Environment

In Table 4, we give the environment setup used in our experiment study.

5.2. Federated Learning Results

We recorded the results of the experiment in terms of accuracy, precision, recall, and F1 score. The parameters of the experiment are described in the following equations, where TP stands for True Positive, FP stands for False Positive, TN stands for True Negative, and FN stands for False Negative.

The accuracy was recorded for each client individually and the average accuracy was calculated afterward to show the over all accuracy of the experiment.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

Precision describes how precise the experiment was by determining how many of the predicted positives are actual positives.

P r e c i s i o n = \frac{T P}{T P + F P}

Recall determines if the positives captured in our model are actual positives and is defined by the following equation:

R e c a l l = \frac{T P}{T P + F N}

The F1 score is used to find the balance between precision and recall. It is defined by the following equation:

F 1 S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

The experiment was performed on two clients, three clients, and finally four clients. The results are shown in the following Table 5 and Figure 3.

The results for the two clients are considerably above 97% in terms of accuracy, precision, recall, and F1 score. These results are high considering that privacy is preserved by implementing federated learning given in Table 6 and Figure 4.

When we repeated the experiment using three clients, the results remained above 97% in terms of accuracy, precision, recall, and F1 score given in Table 7 and Figure 5.

Finally, we repeated the experiment by splitting the dataset into four clients. The results remained high—above 97%.

The accuracy of the three experiments was over 98%, which is considered high. The overall results remained high and promising across all clients. The use of federated learning in this experiment provided a secure solution without affecting accuracy.

5.3. Ethereum Aggregation

The blockchain aggregation results focused on estimating the cost of deploying the smart contract and parameter aggregation. All transactions in the blockchain need Gas to be deployed; gas fees are estimated using a cryptocurrency called Gwei. The cost results for N clients in this experiment are shown in Table 8.

During the cost calculation, we noticed that as the client numbers increased, the cost of parameter aggregation also increased. The cost comparison is shown in Figure 6.

5.4. Overall Comparisons

The overall results of the experiment focused on providing solutions to the existing blockchain in digital forensics literature. One of the main issues that we focused on was providing security and privacy solutions by preventing data sharing in the blockchain. From the literature review, we noticed that there was no efficient solution for this problem that also focused on maintaining decent performance. For example, the work by [21] was cost-efficient but vulnerable to 51% attack on the blockchain. Our framework presents a solution to this problem by integrating federated learning to train models and prevent uploading raw data into the blockchain, thus preventing the leakage of any sensitive data and providing a trusted framework for the sensitive data environment. Other frameworks presented in [19,20] used private blockchains that are considered more secure but suffer from scalability. Our framework used a modified lightweight blockchain to improve scalability and reduce resource usage.

In terms of FL results, we compared our results with a centralized ML using the same dataset [27]. Figure 7 shows a comparison of the accuracy of deep centralized learning and federated learning using the same dataset.

By comparing the results with the highest accuracy results, we noticed a slight decrease in accuracy of nearly 0.70%. The accuracy acquired from our experiment is considered high and the slight decrease is compensated for by the privacy acquired using FL.

Privacy is assured in our model by combining these two technologies and we believe that a secure model will encourage clients to adopt and use it in an IoT environment.

6. Conclusions

The IoT environment is full of large amounts of sensitive data and is targeted by many cybersecurity threats. Based on this, a well-reformed investigation is required to answer the 5 WH questions: who committed criminal activities, where the crime took place, what the criminal did, when the crime happened, why the criminal committed the crime, and how it happened. Blockchain technology is an optimal solution for this issue; however, as the literature showed, it has disadvantages associated with privacy and performance. To the best of our knowledge, we proposed a solution for this problem by joining federated learning with lightweight blockchain to provide better performance with less gas consumption while maintaining a decent level of security to prevent any possible data leakage. First, we used MLP as the machine learning method and NSL-KDD as the dataset in our model. Second, we split the dataset to simulate federated learning. Finally, we performed aggregation of decision parameters via blockchain. The results of our experiments were considerably good in terms of federated learning accuracy, precision, recall, and F1 score. In terms of blockchain performance, we used a modified lightweight blockchain to improve scalability and reduce resource usage. In conclusion, our work provided a solution to the security problem mentioned earlier in the literature while maintaining good performance.

In the future, our work can be improved by adopting different datasets and employing other learning techniques to improve federated learning results. We can also try multiple aggregation techniques to ensure that the results always meet the satisfaction level. Adding differential privacy may also be adopted during the aggregation process.

Author Contributions

Conceptualization, W.A. and T.M.; methodology, W.A.; investigation, W.A.; resources, W.A.; data curation, W.A.; writing—original draft preparation, W.A.; writing—review and editing, T.M.; visualization, T.M.; supervision, T.M.; project administration, T.M.; funding acquisition, T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Qassim University, represented by the Deanship of Scientific Research, grant number COC-2022-1-1-J-25803.

Data Availability Statement

The dataset used to support the findings of this study is: NSL-KDD|Datasets|Research|Canadian Institute for Cybersecurity|UNB. Available online: https://www.unb.ca/cic/datasets/nsl.html (accessed on 29 December 2022).

Acknowledgments

The author(s) gratefully acknowledge Qassim University, represented by the Deanship of Scientific Research, on the financial support for this research under the number (COC-2022-1-1-J-25803) during the academic year 1444 AH/ 2022 AD.

Conflicts of Interest

The authors declare no conflict of interest.

References

IoT Devices Installed Base Worldwide 2015–2025|Statista. Available online: https://0-www-statista-com.brum.beds.ac.uk/statistics/471264/iot-number-of-connected-devices-worldwide/ (accessed on 29 December 2022).
Xu, L.; Jurcut, A.D.; Ranaweera, P. Introduction to IoT Security; Wiley: Hoboken, NJ, USA, 2019. [Google Scholar] [CrossRef]
Li, S.; Qin, T.; Min, G. Blockchain-Based Digital Forensics Investigation Framework in the Internet of Things and Social Systems. IEEE Trans. Comput. Soc. Syst. 2019, 6, 1433–1441. [Google Scholar] [CrossRef] [Green Version]
Hanggoro, D.; Sari, R.F. A Review of Lightweight Blockchain Technology Implementation to the Internet of Things. Available online: https://0-ieeexplore-ieee-org.brum.beds.ac.uk/abstract/document/9042431/ (accessed on 29 December 2022).
Lu, Y.; Huang, X.; Dai, Y.; Maharjan, S.; Zhang, Y. Blockchain and Federated Learning for Privacy-Preserved Data Sharing in Industrial IoT. IEEE Trans. Ind. Inform. 2020, 16, 4177–4186. [Google Scholar] [CrossRef]
Truex, S.; Baracaldo, N.; Anwar, A.; Steinke, T.; Ludwig, H.; Zhang, R.; Zhou, Y. A Hybrid Approach to Privacy-Preserving Federated Learning. In Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, London, UK, 15 November 2019; pp. 1–11. [Google Scholar] [CrossRef] [Green Version]
Stoyanova, M.; Nikoloudakis, Y.; Panagiotakis, S.; Pallis, E.; Markakis, E.K. A Survey on the Internet of Things (IoT) Forensics: Challenges, Approaches, and Open Issues; IEEE Communications Surveys and Tutorials; Institute of Electrical and Electronics Engineers Inc.: Interlaken, Switzerland, 2020; Volume 22, pp. 1191–1221. [Google Scholar] [CrossRef]
Flaglien, A.O. The Digital Forensics Process. In Digital Forensics; Wiley: Hoboken, NJ, USA, 2017; pp. 13–49. [Google Scholar] [CrossRef]
Prayudi, Y.; Sn, A. Digital Chain of Custody: State of The Art. Int. J. Comput. Appl. 2015, 114, 1–9. [Google Scholar] [CrossRef]
Yang, Q.; Liu, Y.; Cheng, Y.; Kang, Y.; Chen, T.; Yu, H. Federated Learning. 2020. Available online: https://0-link-springer-com.brum.beds.ac.uk/book/10.1007/978-3-031-01585-4 (accessed on 29 December 2022).
Zou, Y.; Meng, T.; Zhang, P.; Zhang, W.; Li, H. Focus on blockchain: A comprehensive survey on academic and application. IEEE Access 2020, 8, 187182–187201. [Google Scholar] [CrossRef]
Bhutta, M.N.M.; Khwaja, A.A.; Nadeem, A.; Ahmad, H.F.; Khan, M.K.; Hanif, M.A.; Song, H.; Alshamari, M.; Cao, Y. A Survey on Blockchain Technology: Evolution, Architecture and Security. IEEE Access 2021, 9, 61048–61073. [Google Scholar] [CrossRef]
Vokerla, R.R.; Shanmugam, B.; Azam, S.; Karim, A.; De Boer, F.; Jonkman, M.; Faisal, F. An Overview of Blockchain Applications and Attacks. In Proceedings of the 2019 international conference on vision towards emerging trends in communication and networking (ViTECoN), Vellore, India, 30–31 March 2019; pp. 1–6. [Google Scholar] [CrossRef]
Panda, S.K.; Jena, A.K.; Swain, S.K.; Satapathy, S.C. Blockchain Technology: Applications and Challenges; Intelligent Systems Reference Library: Berlin, Germany, 2021. [Google Scholar] [CrossRef]
Namasudra, S.; Deka, G.C.; Johri, P.; Hosseinpour, M.; Gandomi, A.H. The Revolution of Blockchain: State-of-the-Art and Research Challenges. Arch. Comput. Methods Eng. 2021, 28, 1497–1515. [Google Scholar] [CrossRef]
Zhang, W.; Lu, Q.; Yu, Q.; Li, Z.; Liu, Y.; Lo, S.K.; Chen, S.; Xu, X.; Zhu, L. Blockchain-based Federated Learning for Device Failure Detection in Industrial IoT. IEEE Internet Things J. 2020, 8, 5926–5937. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, J.; Jiang, L.; Tan, R.; Niyato, D.; Li, Z.; Lyu, L.; Liu, Y. Privacy-Preserving Blockchain-Based Federated Learning for IoT Devices. IEEE Internet Things J. 2020, 8, 1817–1829. [Google Scholar] [CrossRef]
Fan, S.; Zhang, H.; Zeng, Y.; Cai, W. Hybrid Blockchain-Based Resource Trading System for Federated Learning in Edge Computing. IEEE Internet Things J. 2021, 8, 2252–2264. [Google Scholar] [CrossRef]
Kumar, G.; Saha, R.; Lal, C.; Conti, M. Internet-of-Forensic (IoF): A blockchain based digital forensics framework for IoT applications. Future Gener. Comput. Syst. 2021, 120, 13–25. [Google Scholar] [CrossRef]
Li, M.; Lal, C.; Conti, M.; Hu, D. LEChain: A blockchain-based lawful evidence management scheme for digital forensics. Future Gener. Comput. Syst. 2021, 115, 406–420. [Google Scholar] [CrossRef]
Mercan, S.; Cebe, M.; Tekiner, E.; Akkaya, K.; Chang, M.; Uluagac, S. A cost-efficient iot forensics framework with blockchain. In Proceedings of the 2020 IEEE International Conference on Blockchain and Cryptocurrency (ICBC), Toronto, ON, Canada, 2–6 May 2020; pp. 1–5. [Google Scholar]
NSL-KDD|Datasets|Research|Canadian Institute for Cybersecurity|UNB. Available online: https://www.unb.ca/cic/datasets/nsl.html (accessed on 29 December 2022).
Olmedo, M.T.C.; Paegelow, M.; Mas, J.F.; Escobar, F. Geomatic Approaches for Modeling Land Change Scenarios. An Introduction; Springer International Publishing: Porto, Portugal, 2018. [Google Scholar] [CrossRef]
Ramchoun, H.; Amine, M.; Idrissi, J.; Ghanou, Y.; Ettaouil, M. Multilayer Perceptron: Architecture Optimization and Training. Int. J. Interact. Multimed. Artif. Intell. 2016, 4, 26. [Google Scholar] [CrossRef] [Green Version]
Meena, G.; Choudhary, R.R. A review paper on IDS classification using KDD 99 and NSL KDD dataset in WEKA. In Proceedings of the 2017 International Conference on Computer, Communications and Electronics (Comptelix), Jaipur, India, 1–2 July 2017. [Google Scholar] [CrossRef]
Ghorbani, A.A.; Tavallaee, M.; Bagheri, E.; Lu, W. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the Second IEEE Symposium on Computational Intelligence for Security and Defence Applications, Ottawa, ON, Canada, 8–10 July 2009. [Google Scholar] [CrossRef] [Green Version]
Yahyaoui, A.; Abdellatif, T.; Yangui, S.; Attia, R. READ-IoT: Reliable Event and Anomaly Detection Framework for the Internet of Things. IEEE Access 2021, 9, 24168–24186. [Google Scholar] [CrossRef]

Figure 1. Blocks Connected in a Blockchain [14].

Figure 2. Proposed Framework.

Figure 3. Experimental results for two clients.

Figure 4. Experimental results for three clients.

Figure 5. Experimental results for four clients.

Figure 6. Comparison of experiment costs in gwei.

Figure 7. Centralized deep learning (READ-IoT) vs. federated learning.

Table 1. Summary of Related Work.

Ref	Year	Description	No. of Blockchains	Public/Private	Advantages	Limitations
[3]	2019	- IoTFC creates timestamps for each block. - Consists of users, IoT devices, Merkle tree, blocks, and smart contracts. - 5 levels of evidence-gathering depending on the difficulty.	One	Public	- Classification of evidence based on its relation to the case. - Cost effective.	- Vulnerable to 51% attack. - No experiments to test the performance.
[19]	2021	- Edge-IoF layer consists of IoT devices that can provide evidence. - Fog-IoF includes fog devices and forensics tools. - Consortium-IoF layer is responsible for building the blockchain in forensics investigations. - Cloud Storage layer connected to the blockchain is used to store data.	Two	Private	- Average latency is 7.07%. - Average throughput of 4.5%. - Less CPU and gas consumption.	- Increased memory usage.
[20]	2021	- Addressed the internal threats. - Consists of victim or a witness, an investigator, an analyzer, a judge, and a trusted authority (TA). - Provides privacy for the witness, jury, and data.	One	Private	- Less than 1 s to process data. - Average of 2.5 ms latency. - Nullify unwanted evidence.	- Insufficient scalability.
[21]	2020	- EOS blockchain that can ensure good efficiency and consume less energy. - Stellar blockchain that can provide scalability. - Ethereum blockchain to store the data permanently.	Three	Public	- Provides low-cost solution that will stand against the 51% attack.	- The framework was not tested in an IoT environment.

Table 2. Attack Classes Associated with Attack Types.

Class	Type
DoS	Worms, smurf, pod
Probe	IP sweep, port sweep, Nmap
R2L	Send mail, guess the password, snoop
U2R	Rootkit, buffer overflow, load module

Table 3. MLP parameters for N clients, where N = [2, 3, 4].

Name	Description	Size	Data Type
coefs_	Neuron’s inputs weight in three layers (Two input layers and the output layer)	5 × 39 × N 2 × 5 1 × 2	Decimal matrices
intercepts_	Biases of each neuron in three layers	1 × 5 1 × 2 1 × 1	Decimal arrays

Table 4. Environment setup.

Item	Specification
OS	Windows 10
CPU	Intel Core i7-8700K @ 3.7 GHz
RAM	16 GB
Ganache	v2.5.4
MetaMask	10.18.0
Python	V3.10

Table 5. Experimental results for two clients.

	Accuracy	Precision	Recall	F1 Score
Client 1	98.57%	98.22%	98.75%	98.48%
Client 2	97.49%	98.35%	96.18%	97.25%
Average	98.03%	98.28%	97.47%	97.87%

Table 6. Experimental results for three clients.

	Accuracy	Precision	Recall	F1 Score
Client 1	97.36%	98.69%	95.63%	97.14%
Client 2	97.81%	97.06%	98.25%	97.65%
Client 3	98.48%	97.82%	98.92%	98.37%
Average	97.88%	97.86%	97.60%	97.72%

Table 7. Experiment results for four clients.

	Accuracy	Precision	Recall	F1 Score
Client 1	98.32%	98.04%	98.37%	98.21%
Client 2	98.35%	98.11%	98.38%	98.25%
Client 3	98.22%	97.93%	98.20%	98.06%
Client 4	97.90%	98.08%	97.41%	97.74%
Average	98.20%	98.04%	98.09%	98.06%

Table 8. Experimental costs (in gwei) for N clients.

Procedure	Execution Cost (Gwei)
Smart Contract Deployment	4,324,313
Parameters Aggregation for 2 Clients	5,698,860
Parameters Aggregation for 3 Clients	8,548,290
Parameters Aggregation for 4 Clients	11,397,720

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Almutairi, W.; Moulahi, T. Joining Federated Learning to Blockchain for Digital Forensics in IoT. Computers 2023, 12, 157. https://0-doi-org.brum.beds.ac.uk/10.3390/computers12080157

AMA Style

Almutairi W, Moulahi T. Joining Federated Learning to Blockchain for Digital Forensics in IoT. Computers. 2023; 12(8):157. https://0-doi-org.brum.beds.ac.uk/10.3390/computers12080157

Chicago/Turabian Style

Almutairi, Wejdan, and Tarek Moulahi. 2023. "Joining Federated Learning to Blockchain for Digital Forensics in IoT" Computers 12, no. 8: 157. https://0-doi-org.brum.beds.ac.uk/10.3390/computers12080157

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Joining Federated Learning to Blockchain for Digital Forensics in IoT

Abstract

1. Introduction

2. Background

2.1. IoT Forensics

2.2. Federated Learning and Blockchain

2.3. Problem Statement

3. Literature Review

3.1. Blockchain and Federated Learning

3.2. Blockchain in Digital Forensics

4. Proposed Framework

4.1. Phases of the Proposed Model

4.2. Multilayer Perceptron

4.3. Dataset Description

4.4. MLP Training Process

5. Experimental Study

5.1. Environment

5.2. Federated Learning Results

5.3. Ethereum Aggregation

5.4. Overall Comparisons

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI