1. Introduction
Internet of Things (IoT) combines the data-gathering capacity of sensors and smart devices with the power of cloud computing and data analytics. This data communication drives understanding and interactions between people and products in specific environments. Research from the 2017 World Economic Forum, titled “Technology and Innovation for the Future of Production”, explores that five technologies will predominate in the years to come: Internet of Things, Artificial Intelligence, Advanced Robotics, Wearables and 3D printing [
1]. The IoT has started and will continue to show its great impact on many fields, such as smart homes, smart cities, healthcare, autonomous cars, smart grids, smart retail, industrial automation, inventory management, and quality control.
One of the most promising IoT applications is healthcare [
2]. Networked healthcare devices create an Internet of Healthcare Things, which is aimed at health monitoring and preventive care for creating better conditions for patients who require constant medical supervision and/or preventive intervention. However, applications of new technologies often bring certain risks, including failures of components, devices, and infrastructure, which may cause disastrous results for patients. Hence, to minimize such risks and assure the required system availability, the modeling, performance evaluation, and performance improving techniques for the health IoT systems are certainly worth studying.
There are some research works in the literature on the modeling and performance evaluation techniques for the IoT systems [
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14], which are summarized as follows. In [
3], the authors surveyed advances in IoT-based healthcare technologies and reviewed the state-of-the-art network architectures/platforms, applications, and industrial trends in IoT-based healthcare solutions. In [
4], the economical, technological, security and application aspects of applying IoT in e-health were discussed and two solutions on the affordable prototyping platform were presented for e-health based on Raspberry Pi components. In [
5], a framework for security modeling and assessment of the IoT was proposed to construct graphical security models for the IoT. The benefits of the framework were presented via a study of two examples of IoT networks. In [
6], reliability and security issues of an IoT-based smart business center (SBC) network were discussed and a Markov model was developed to show a means of protection against hacker attacks with a high degree of security. In [
7], the impact of factors affecting the performance in IoT networks was analyzed using simulation-based models and an analytical framework was developed to model the impact of individual node behavior on overall performance using Markov chains.
In [
8], a healthcare IoT infrastructure with a brief description was presented and a case study of the considered system was modeled using the queueing theory. In [
9], several types of queueing models were presented to represent different quality of service (QoS) settings of IoT interactions, such as intermittent mobile connectivity, message drop probabilities, message availability/validity and resource-constrained devices. The models were simulated using a simulator called MobileJINQS, and the results demonstrated the significant effect on response times and message success rates when varying QoS settings. In [
10], a theoretical approach of performance evaluation for IoT services was proposed to provide a mathematical prediction on performance metrics at the design phase before system implementation, which was validated for the effectiveness by simulation experiments based on real-world data. In [
11], a simulation model was presented for an IoT network mediator to study the capacity of an IoT mediator through performance analysis of the traffic generated by devices connected via the IoT network. The simulation model is based on both discrete event and Random Waypoint simulations.
In [
12], a Markov model was proposed for a healthcare IoT infrastructure that allows for taking into account safety and security issues. The model considers basic states of the IoT system, including normal state, different attacking states, and failure states with constant failure/attack and recovery/repulse rates. The steady-state probabilities and availability function were obtained through simulations, though no analytical solution was provided. In [
13], a systematic review for the studies in recent five years was conducted to present the current advancements in wearable sensors and IoT-based monitoring applications to support independent living for older adults. The investigation found that most studies focused on the system aspects of wearable sensors and IoT monitoring solutions including advanced sensors, wireless data collection, communication platform and usability. Another recent review in [
14] studied the state-of-the-art works in IoT-based distributed healthcare systems, where all available medical resources are interconnected to provide effective and efficient healthcare services to those in need of medical assistance. From the study, the taxonomy of these systems was proposed considering various aspects, such as monitoring methods, communication technologies, computing techniques and low-power protocols.
In this paper, we propose an availability model of a healthcare IoT system that consists of two groups of structures with component failure events incorporated. The two groups of structures are described by separate Markov state-space models and integrated to implement the whole IoT system modeling. The system steady-state probabilities are solved recursively for a general number of biosensors. An explicit solution of the system is also provided given a specific number of biosensors. Based on the analytical solution, we derive some performance metrics of interest. We present a numerical evaluation of selected metrics with detailed analysis of the obtained results. We also propose an availability performance improving (API) method for increasing the probability of system full service and decreasing the system unavailability.
The remainder of the paper is organized as follows.
Section 2 describes a typical healthcare IoT system infrastructure
Section 3 develops the Markov models of individual groups and the whole IoT system.
Section 4 derives some system performance metrics of interest.
Section 5 proposes the API method with three possible schemes.
Section 6 presents a detailed numerical evaluation. Finally, the paper is concluded in
Section 7.
2. A Healthcare IoT System Infrastructure
The main components of the healthcare IoT system include a wireless body area network (WBAN) [
15] consisting of sensor nodes and a portable gateway device, cloud server(s), and healthcare providers, as shown in
Figure 1. A brief description of each component is given as follows.
The WBAN consists of a number of biosensors installed on different parts of the human body and the portable gateway device. The biosensors are used to sense physiological data and send them to the portable gateway device in an appropriate format. There are different types of medical sensors and sensor devices, including a heart rate monitoring sensor (monitoring heart rate on a real-time basis), body temperature sensor (monitoring body temperature), blood pressure (BP) sensor (collecting BP data regularly), electroencephalogram (EEG) sensor module (detecting minute electrical activity of brain cells), oxygen saturation monitoring sensor (calculating the pulse rate and SpO2), electrocardiogram (ECG) sensor module (using sensor array to record the electrical activity generated by heart muscle and send the converted signals to a computing device for display), the global positioning system (GPS) sensor, accelerometer sensor (capturing the intensity of physical activity for human movement by attaching the sensor to a person’s wrist or ankle or to a person’s waist with a belt clip), electromyography (EMG) sensor (diagnosing a wide variety of neuromuscular diseases, motor neurons problems, nerve injuries, or degenerative conditions), cough detection sensor (a built-in microphone audio system in a sensor module), diabetes sensor (a non-invasive opto-physiological sensor to track blood glucose), and a pedometer sensor (recording the number of steps the wearer takes). Data collected by biosensors are transmitted to the portable gateway device using, e.g., Bluetooth or ZigBee [
16] protocols.
The gateway device is designed to collect data from the biosensors and send the data to the cloud servers; it also monitors the sensor status, changes settings, and updates software. The gateway connects to the cloud servers through WiFi or cellular wireless technologies and to the sensor nodes through ZigBee or Bluetooth technology. The gateway can communicate with multiple sensor nodes simultaneously.
Cloud servers are virtual servers running in a cloud computing environment over the Internet. They provide a web-based information infrastructure, including computing power, storage space and network technology. Health providers can easily process, exchange, secure and manage data in either web-based or location-independent styles. Health providers use an infrastructure-as-a-service (IaaS) [
17] model to process workloads and store information. They do comprehensive data analysis and send commands from the cloud to the gateway device or further, from the gateway to do software updates or other commands to the specific sensor nodes. Health providers can also generate an analytic report for the patient’s illness status and send it via email to the patient and advise him/her to make a doctor appointment if needed.
The communication between the sensor nodes and the gateway device uses ZigBee technology due to its low power-consumption rate and battery life, which is a low-power wireless specification that uses physical (PHY) and media access control (MAC) layers based on the IEEE 802.15.4 standard [
18]. The ZigBee technology operates at the 2.4 GHz ISM (industrial, scientific and medical) bands. The protocol allows sensor nodes to communicate in a variety of network topologies and the battery life to last for a long time. The communication between the gateway and cloud servers uses WiFi/Cellular wireless technologies, which provides a longer distance of connection and more reliable communication links. The communication between the cloud servers and the healthcare providers uses wireless (e.g., WiFi or cellular technologies) or wired technologies (e.g., fiber-optic cable or copper cable).
For an operatable IoT system with reliable data transmission, the three segments of communication links must be workable: the links between the sensors and the gateway device; the link between the gateway and the cloud; and the link between the cloud and the healthcare providers, as shown in
Figure 1.
Link failures are very prone to occur, especially in the WBAN due to the resource constraint of biosensors (e.g., limited power, limited signal transmission range) and area limitation of the network. Many factors can cause wireless connection failures in the WBAN. Sensor nodes in a WBAN are made from hardware components integrated into a small module such as the transceiver, memory and microcontroller as well as the software components, such as the application programs and MAC protocols. When a sensor node has insufficient battery power, it will fail to collect and send data. A sensor node may malfunction from a physical hardware defect due to an aging hardware component. Similarly, a sensor node may fail to offer the desired functions due to a software fault such as a software bug. Body motion can cause frequent changes in the network topology. When the sensors are affected by reflection, diffraction, and shadowing due to rapid body movement, body structure and posture, they may cause channel fading or channel impairment that adversely affects signal propagation. When the patient moves to a place with excessive noise, the wireless channel may also be subject to failure. In addition, there are many other reasons that can cause connection failure.
Similarly, the communication links from the gateway to the cloud and from the cloud to the healthcare providers can also be prone to failure due to the inherent vulnerability of wireless channels. In general, different link connection failures may need different recovery times. In the following analysis, we will assume different link failure arrival rates and link recovery times for different segments of communication channels.
4. System Performance Metrics
Now, we derive some performance metrics of interest after obtaining the system steady-state probabilities. Let us define the following four events:
where E
1 represents the event that the IoT system is fully up, where G1
UP and G2
UP denote that both the Group 1 and Group 2 structures are in the Up state; E
2 is the event that the system is partially up (degraded service), where G1
DG denotes that the Group 1 structure provides degraded service; E
3 represents the event that the system cannot provide service due to the connection failure between all sensors and the gateway (which is referred to as SG link failure), where (G1
DN, G2
UP) denotes that the Group 1 structure is in the Down state while the Group 2 structure is in the Up state; and E
4 represents the event that the system is down due to the cloud-side (CS) connection failure in the Group 2 series structure including GC and CH links (which is referred to as CS link failure), where (G1
XX, G2
DN) denotes that the Group 1 structure may be in any state while the Group 2 structure is in the Down state (i.e., the failure of Group 2 structure directly causes the failure of the IoT system). Thus, the probability that the IoT system provides full service is:
The probability that the system provides degraded service is (here, the degraded service is defined if there is
j workable sensors available, 1 ≤
j ≤
N − 1):
The probability that the system cannot provide service due to SG link failure is:
The probability that the system cannot provide service due to CS link failure is:
The IoT system unavailability, denoted by
U, is the total probability that the system is down.
The frequency of an event [
20] is defined by the product of the transition rates departing from the event and the state probabilities, or the product of the transition rates arriving at the event and their respective state probabilities where these transitions start. Thus, the frequencies of the events
E1,
E2,
E3, and
E4 are derived as follows:
The mean duration of an event is obtained by the ratio of the event probability to the event frequency. The mean durations of events
E1,
E2, and
E3 are:
5. Improving Availability of the IoT System
In this section, we discuss how to improve the availability of the IoT system. In general, one way is to reduce the failure rate of the SG, GC and CH links; the other way is to reduce the recovery time of these links. The link failure rate is often not controllable; however, the link recovery rate can be improved through multiple efforts. For example, if the failure of a biosensor in a WBAN could be found earlier via different software/hardware detection or alarm mechanisms, then a corresponding sensor recovery time could be reduced. For another example, if the failure of a communication link is detected, instead of the regular link recovery mechanism, an emergency recovery mechanism may be created to achieve shorter communication link recovery time. Of course, to achieve this goal, more human resources, technical resources and equipment resources would be required, which will not be discussed here. In the following, our main focus is to study the performance improving method for the IoT system, given that an emergency link recovery mechanism is provided.
In
Figure 5, for a given state (0,
j), 0 ≤
j ≤
N − 1, the number of failed SG links (or failed biosensors) is
N −
j and the recovery rate is (
N −
j)
μ1. For the state (1,
j), 0 ≤
j ≤
N, the CS link fails. Normally, either a failed sensor or a failed CS link triggers a respective regular recovery process with its corresponding rate. Now, instead of a regular recovery process, an emergency recovery process can be defined for performance improving (of course, more resources will be involved), which can lead to a larger recovery rate (or smaller recovery time). Assume that when an emergency link recovery mechanism is triggered, a multiple times (say, β times) larger link recovery rate can be achieved. Clearly, here, the factor β can be adjusted depending on the system requirement and the resources invested for the system. Based on this idea, we propose an availability performance improving (API) method for the IoT system, given that an emergency link recovery triggering mechanism is provided. The API method is described as follows:
Two conditions may trigger an emergency link recovery mechanism and achieve the availability performance improving (API). Condition 1 is when the number of failed biosensors in the system reaches an integer threshold, K, 1 ≤ K ≤ N; Condition 2 is when the CS link failure event happens. When an emergency link recovery mechanism is triggered, a β times larger link recovery rate can be achieved to the failed links.
In the following numerical evaluation, we will present three schemes to validate the proposed API method with the threshold K = 50%N.
Scheme 1: Both Condition 1 and Condition 2 are required to trigger the emergency recovery mechanism.
Scheme 2: Only Condition 2 is required to trigger the emergency recovery mechanism.
Scheme 3: Either Condition 1 or Condition 2 is required to trigger the emergency recovery mechanism.
6. Numerical Evaluation
In this section, we present numerical evaluation for the IoT system by studying selected metrics with respect to various parameters, particularly the individual sensor failure rate and recovery time as well as the failure rates of the GC link and CH link. The typical parameter settings for numerical evaluation are given in
Table 1. Other values of related parameters are set separately in the figures to study the performance of the relevant metrics.
Figure 6 shows the probability of the full service of the system
P(
E1) with the change of biosensor failure rate λ
1 and other parameters. As expected, when λ
1 is increased, the probability
P(
E1) will decrease. An increase in sensor failure rate causes more sensors to leave the full service state. Similarly, an increase in sensor recovery time (i.e., decrease in μ
1) would delay the time to reach the full service state, leading to a lower value of
P(
E1). It can also be observed that a larger value of the GC/CH link failure rate (e.g., 20 failures/month) will decrease
P(
E1) more than a smaller one, as large GC/CH link failures naturally cause the system failure.
Figure 7 shows the probability of the degraded service of the system
P(
E2) with the change of biosensor failure rate
λ1 and other parameters. We observed that when
λ1 is increased, the probability
P(
E2) will increase. An increase in sensor failure rate causes more degraded service states to be reached. Similarly, an increase in sensor recovery time would lead to longer time of being in degraded service states. We also observed that a larger value of the GC/CH link failure rate would decrease the probability
P(
E2), as more degraded service states are left.
Figure 8 shows the probability of no service due to SG link failure
P(
E3) with the change of biosensor failure rate
λ1 and other parameters. We observed that
P(
E3) will increase when either
λ1 or 1/
μ1 is increased; the reason for this can be seen in
Figure 7, as the event
E3 is a special case of
E2 with the state (0,
j) replaced by (0, 0).
Figure 9 shows the probability of no service due to CS link failure
P(
E4) with the change of biosensor failure rate
λ1 and other parameters. As expected,
P(
E4) increases when the link GC/CH link failure rate is increased. We observed that when the sensor recovery time is increased,
P(
E4) will increase, since the equivalent recovery time 1/
μ(
j) is state dependent at state (1,
j), 0 ≤
j ≤
N − 1. The larger the sensor recovery time, the larger the equivalent recovery time required, leading to an increase in
P(
E4). We also observed that when the sensor failure rate is increased,
P(
E4) will tend to decrease slightly. This may be explained by the fact that an increase in sensor failure rate would cause more chance of leaving the no service states (1,
j) (to maintain the system traffic balance).
Figure 10 shows the system unavailability
U with the change of biosensor failure rate
λ1 and other parameters. We observed that
U will increase when the sensor recovery time or the GC/CH link failure rate is increased. A longer recovery time increases the opportunity of being at the system unavailability state, while the GC/CH link failure event directly causes the system to completely go down. As expected, we also observed that
U will increase when the sensor recovery time increases, since an increase in sensor recovery time causes the increase in the equivalent recovery time 1/
μ(
j).
Next, we evaluate the API method using the three schemes compared with the benchmark (which does not apply any API scheme). The settings of the benchmark (no API scheme is applied) and the API schemes are as follows: 1/
μ1 = 2 h, β = 2 (other values can also be used depending on the emergency recovery capability of the system),
λgc = 20 failures/week, and
λch = 20 failures/week. Other parameters are shown in
Table 1.
Figure 11 shows the probability of the full service of the system
P(
E1) with different API schemes. As expected, P(E1) improves the most using Scheme 3 among the three API schemes and the least using Scheme 1. In Scheme 1, triggering the emergency link recovery mechanism needs the simultaneous occurrence of the CS side link failure and the 50%N of failed sensors, while in Scheme 3 the triggerr condition becomes either the former or the latter. Scheme 2 is between Schemes 1 and 3 based on its triggering condition. Note that, in
Figure 11, Scheme 1 only slightly improves the system performance of full service, which is because the total number of sensors
N is small (
N = 4) in our evaluation; for a large value of
N, the performance difference will be more significant.
Figure 12 shows the probability of the degraded service of the system
P(
E2) with different API schemes. We generally observed that the three API schemes keep a higher
P(
E2) than the benchmark. We also observed that Scheme 3 has a lower value of
P(
E2) than Scheme 2; this is because Scheme 3 triggers the emergency recovery mechanism more easily than the other two, leading to more degraded service states to transition to the full service state.
Figure 13 shows the probability of no service due to SG link failure
P(
E3) with different API schemes and the benchmark case. Similar to
Figure 12, we observed that Scheme 3 has a lower value of
P(
E3) than Schemes 1 and 2, as
E3 is a special case of
E2. We also observed that Schemes 1 and 2 even have slightly higher
P(
E3) than the benchmark case; this may be because of the impact of one of the triggering conditions (i.e., the CS side link failure).
Figure 14 shows the probability of no service due to CS link failure
P(
E4) with different API schemes. We observed that all three schemes have better performance than the benchmark case, and Scheme 3 is better than Scheme 2, which is better than Scheme 1. Finally,
Figure 15 shows the system unavailability
U with a similar trend to
Figure 14. We observed that
U becomes lower under the three API schemes than that under the benchmark case. Similar to
Figure 14, Scheme 3 has achieved a lower value of system unavailability than Scheme 2, which has achieved a lower value of unavailability than Scheme 1.