Next Article in Journal
Runge-Kutta Model Predictive Speed Control for Permanent Magnet Synchronous Motors
Next Article in Special Issue
Configurable DDS as Uniform Middleware for Data Communication in Smart Grids
Previous Article in Journal
Briquettes Production from Olive Mill Waste under Optimal Temperature and Pressure Conditions: Physico-Chemical and Mechanical Characterizations
Previous Article in Special Issue
Robust Economic Model Predictive Control Based on a Zonotope and Local Feedback Controller for Energy Dispatch in Smart-Grids Considering Demand Uncertainty
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Classical Failure Modes and Effects Analysis in the Context of Smart Grid Cyber-Physical Systems

by
Andrés A. Zúñiga
1,
Alexandre Baleia
2,
João Fernandes
1 and
Paulo Jose Da Costa Branco
1,*
1
Institute of Mechanical Engineering, Associated Laboratory for Energy, Transports, and Aeronautics, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal
2
Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal
*
Author to whom correspondence should be addressed.
Submission received: 15 February 2020 / Revised: 29 February 2020 / Accepted: 2 March 2020 / Published: 6 March 2020
(This article belongs to the Special Issue Cyber-Physical Systems for Smart Grids)

Abstract

:
Reliability assessment in traditional power distribution systems has played a key role in power system planning, design, and operation. Recently, new information and communication technologies have been introduced in power systems automation and asset management, making the distribution network even more complex. In order to achieve efficient energy management, the distribution grid has to adopt a new configuration and operational conditions that are changing the paradigm of the actual electrical system. Therefore, the emergence of the cyber-physical systems concept to face future energetic needs requires alternative approaches for evaluating the reliability of modern distribution systems, especially in the smart grids environment. In this paper, a reliability approach that makes use of failure modes of power and cyber network main components is proposed to evaluate risk analysis in smart electrical distribution systems. We introduce the application of Failure Modes and Effects Analysis (FMEA) method in future smart grid systems in order to establish the impact of different failure modes on their performance. A smart grid test system is defined and failure modes and their effects for both power and the cyber components are presented. Preventive maintenance tasks are proposed and systematized to minimize the impact of high-risk failures and increase reliability.

Graphical Abstract

1. Introduction

Electric energy plays a crucial role in today’s society. It is the most versatile and easily controlled form of energy and it is involved in almost all aspects of society’s daily routine.
In recent years, several new challenges have been emerging due to the expansion of renewable energy sources (intermittent sources) in the electrical grid, due to the electrification of new industrial sectors and due to the new huge volume of online data generated from electrical systems. Moreover, in the future smart grids, it is expected that energy becomes available everywhere from dispersed sources associated with the growth of mobile loads and the increasing number of energy storage equipment [1,2]. With this, new technological functionalities are required to provide energy management in a more reliable, effective and secure way.
The conventional electric grid is a passive and rigid grid characterized by predictable power flow directions, conventional energy sources, and expected load profiles. On the contrary, a smart grid can be described as an active grid, with constant fluctuations due to the intermittent operation of renewable energy sources, like solar or wind, unexpected load profiles, and unpredictable power flow directions, making a more dynamic grid. Consumers’ participation in demand response and in electricity markets are also expected to play an important role in energy efficiency [2,3]. However, many new problems are arising, such as:
  • The increased complexity of the electrical system, creating a considerable number of barriers to the development of such systems, regarding technical and non-technical challenges;
  • The correct operation of every single equipment in a smart grid will be directly or, at least, indirectly dependent on the correct behavior of other equipment;
  • The needed to endow the grid of “intelligence” and robustness due to the interdependence between cyber and power systems. A malfunction of even minor equipment can put in danger the efficiency and reliability of the electrical grid.
In this framework, the lack of adequate control and monitoring/management strategies can lead to a power outage of parts of the grid. In addition, if the smart grid is not capable of handle cyber threats, a partial outage of the system may occur.
Reliability aspects like increased complexity represented by the interconnection and interoperability between different types of systems are an additional aspect to consider in the smart grid context and motivates the exploration of new approaches to assess the smart grid operational reliability. In this context, Reliability Centered Maintenance (RCM) has been a solution proposed in recent years and applied to electric power systems.
In [4], a four steps “rational RCM” approach is applied on a distribution system, where the key is to combine a risk assessment (using FMECA) and a dependability analysis to optimize the maintenance in the RCM context. In [5] an RCM methodology that provides a quantitative relationship between PM of assets and the total cost of maintenance is applied in distribution systems. In [6] an RCM approach is proposed to minimize the total customer interruption cost and the total energy not supplied cost in a distribution feeder. In [7] an RCM approach that considers the protection system hidden functions is applied to a distance protection system. In [8], a hybrid methodology that combines quantitative and qualitative analysis is used to identify the Maintenance Significant Items in an RCM context. In [9] a game theory method is applied in the RCM context to assess the contribution of every single piece of equipment to the whole system reliability under different loading conditions and high-order contingencies. In [10] a weighted importance reliability index is proposed to prioritize the elements of an electric distribution system in an RCM framework. In [11] the lifetime of main components involved in a smart distribution grid is analyzed and a probability model for the lifetime is proposed. In [12] a quantitative maintenance optimization to improve the RCM in power systems with renewable energy is proposed. In [13] variance-based fairness indices are proposed to summarize the inequality and inequity between power system stakeholders in order to measure the impact of reliability actions. In [14] an analytical method to quantify the cyber faults in the reliability is proposed. In [15] a quantitative method to analyze the impact of the cyber equipment and communication quality in a cyber-physical power distribution system is proposed. RCM combines reliability evaluation and maintenance planning [5,11,12,13,14,15].
When considering reliability assessment tools, RCM arises as one of the most important methods. Benefits of an RCM approach far exceed those of any type of maintenance program and it has been widely used by the aircraft, oil and gas, spacecraft, or nuclear industry [16]. However, the use of RCM is relatively new in industries outside of these areas [4,5,6,7,8,9,10,16]. RCM strategies and their impact on electrical power systems are still undefined due to the lack of evidence and current applications.
As an important part of any RCM evaluation, the Failure Modes and Effects Analysis (FMEA) method is a structured method used to define, identify, and eliminate known and/or potential failures, problems, and errors in the system, process, and/or service [17,18,19,20]. FMEA can be considered as a team-oriented engineering analysis that uses personnel’s knowledge and experience to perform it, generally supported on drawings, diagrams, charts, and failure statistics, among other useful information, being the intellect and ideas of the expert team a necessary element for meaningful discussion and successful outcomes [21].
It has been proven that FMEA is a good tool for exhaustively identifying and recording the local effects that arise from component failures, thus inferring the effects of those failures at the system’s level. Nowadays, in the context of smart grids, which combine a large number of components of different types and functions, FMEA can help identify those failure modes causing critical component failures and, consequently, lead to a critical system failure.
Among the failure modes that can be identified using FMEA are the smart grid security technical and non-technical threats; technical threats are related to infrastructure security, technical operational security and system’s data management security, and the non-technical threats related to environmental security and government regulatory policies and implementations [22]. FMEA can also help to identify the preventive actions oriented to mitigate specific failure modes occurrences, like the PMU based monitoring system applied in HV transformers [23]. In addition, artificial intelligence-based FMEA can improve the failure modes prioritization according to its criticality and other criteria [24].
Once the most critical failure modes have been identified using FMEA, they can be used to perform a quantitative reliability assessment to determine the impact that each determined failure modes would have on the system.
This paper has two purposes: (1) to show a review of the FMEA applications in smart grids environment, and (2) to show the results of an FMEA analysis conducted in a smart grid test system.
Section 2 starts reviewing FMEA application in electrical power equipment maintenance and testing. Section 3 summarizes the classical FMEA procedure. The cyber and power architecture of a smart grid test system is introduced in Section 4. Here, failure rate values for each cyber-control equipment are credited. Complementing, Section 5 categorizes the failure modes for power and cyber-control equipment, including the network links. FMEA analysis is conducted in Section 6. In Section 7, we analyze how much significant information is lost when using the classical FMEA procedure in a smart grid structure. As described in Section 8, the main conclusion is that maintenance tasks cannot be efficiently prioritized. The classical FMEA is successful in assemble failure modes and their causes for a smart grid, however, the classical FMEA needs to be modified to improve risk prioritization concerning the smart grid’s reliability assessment and risk analysis.

2. FMEA Applications in Electrical Power Equipment: A Brief Overview

Most of the applications of FMEA in electrical power equipment were developed at the component level, that is, without considering the effect of equipment failures on systems’ performance.
For example, concerning wind power technology, in [25] is shown a classical FMEA approach applied to assess the reliability of a 2MW wind turbine using three commercial software: XFMEA from Reliasoft, Reliability Workbench from Isograph, and Relex Reliability Studio 2007 from Crimson Quality. The authors divided each one of the three risk factors of FMEA (Severity, Occurrence, and Detection) into four risk categories. Eight mechanical failure modes, five electrical failure modes, and three structural failure modes were identified related to the turbine. Results show that when using the product of the Occurrence and Detection risk factors, FMEA underestimates the operational field’s failure rates in new turbine designs. Authors also propose that a procedure for failure prioritization using their risk priority number (RPN) value could be a useful tool for designers to identify weaknesses in new wind turbine designs.
Another FMEA application and analysis on wind power is shown in [26], where onshore and offshore wind turbines were considered. The classical FMEA is now compared with the authors’ modified FMEA that studies the probability of occurrence instead of a ranking for occurrence as in [12], considering now the cost of the failure mode instead of severity rank, and uses a non-detection possibility based on failure data instead of a detection ranking. The paper also proposes a priority number called a cost-priority number (CPN), which was obtained by multiplication of the new three risk factors considered [26]. Their results show that, in general, the priority number from both approaches, the RPN and CPN, produce very similar prioritization for most of the major components considered.
A non-electrical system is studied in [27], where FMEA analysis is conducted to assess the reliability of hydraulic turbines, and to compare FMEA with the Fault Tree Analysis (FTA) method. Seven main hydraulic turbine components were considered for both analyses. This work indicates that FMEA and FTA are complimentary risk analysis methodologies capable of identifying failures and tracking their possible consequences. While FMEA makes an exhaustive analysis for each failure mode, FTA allows having a general view of the system and the relations between different components.
In [28], the authors assess the condition of the electric and mechanical components of a hydropower plant (Angara-Yenisei hydropower station). The FMECA method (FMEA plus criticality assessment) was applied to face the lack of statistical information about failures. Results achieved show that FMECA allows evaluating the possible effects of the failure modes even when there is a gap in the failure statistics.
Another example of FMEA application is in photovoltaic (PV) systems. In [29], FMEA is applied in a simple test system composed of four PV strings, string combiner devices, inverter, cable system (aerial and underground), a three-phase transformer, and also its connection to the power grid. Five risk categories were defined for each of the FMEA’s risk factors ranking. The author clearly shows that FMEA can improve the early detection of some hidden failures that could not immediately affect the PV system, but would induce a degradation if no action was taken.
Another FMEA application in PV systems can be found in [30]. The authors used relevant criteria and practical experience provided by personnel working in a PV power plant instead of using the one from theoretical and office technicians. Ninety-four failure modes were identified, 16 of which had an RPN greater than 100, which is considered as the most critical failure modes for prioritization. Authors’ conclusions establish substantial differences between FMEA results using criteria from practical personnel, like maintenance operators, and those results using an FMEA performed considering the criteria of office and manage technicians, like engineers.
Risk analysis of an energy storage system (ESS) was developed in [31], where a review of the failure modes that affect lead-acid batteries (LAB) was done. The analysis focuses on three aspects: (i) positive active material degradation with loss of adherence to the metallic grid, and positive electrode grid corrosion; (ii) irreversible sulfating of the negative active material; and (iii) the electrolyte, separator, charge-discharge regime, and other elements that contribute to the battery failure. This work shows the importance of identifying the failure modes and its associated mechanisms in lead-acid batteries and in lead-carbon batteries (LCB) because it has a great potential for innovation and extensive applications in solar power integration projects.
Another extensive analysis of failure modes on batteries, now on lithium-based batteries (LIB), is presented in [32]. Lithium batteries are one of the most popular energy storage technologies for several applications, including electric cars. This paper covers several experimental and simulation results to characterize different failure modes and their respective mechanisms in LIB technology. Most important, the authors claim the urgency for the development of computational direct simulation techniques for LIB based on its chemo-mechanical models to have a better perspective about possible material failures [32].
FMEA has also been applied in electrical components of power systems. For example, in [33] an FMEA analysis is conducted to assess reliability in capacitors banks used in the distribution power system at the Sultanate of Oman. Four risk categories (catastrophic, critical, marginal, and insignificant) were defined for each FMEA risk factor ranking, and seventeen main failure modes were identified and analyzed. Some failure modes considered were the capacitor element short-circuit, open circuit, insulating liquid leakage, and leakage current for support insulators, for example. In [34], FMEA was used to identify the main failure modes to be used as input for a probabilistic method to assess the reliability of a 400 kV transmission system at the substation equipment level.
In [35], a modified FMEA based on Fuzzy Logic was developed. Three FMEA risk factor categories were represented by fuzzy sets and based on three continuity indexes: the loss of power in distribution transformers when a failure mode occurs, the frequency of interruption in each consumer unit, and the duration of interruption in each customer unit. Results show that the FMEA based on fuzzy logic achieves better prioritization results for the analyzed equipment.
Power transformers’ failures have been extensively analyzed through the FMEA method because of its high impact in terms of security and cost in electric power grids. Three recent applications are presented in the following. In [36], an FMEA including criticality analysis is performed on 92 power transformers, identifying three critical components: windings with high criticality, on-load tap changer (OLTC), and bushings with medium criticality. In [37], FMEA with criticality was applied on 384 non-failed distribution transformers in India. Results show that component insulation failures have a greater RPN and are caused by corrosion, moisture, high acidity, hot spot due to overloading, and/or low quantity of oil. The second priority is achieved by winding failures that may be due to manufacturing defects, transient overvoltage, lightning, short-circuit, and faulty connections. The third example is described in [38] where a general FMECA is applied to assess the risk of failure of 220 kV in-service power transformers, considering the failures that can result in transformer service interruption. Authors classified the failures as minor and major ones, performing an FMECA analysis for each of the two types of failure; the minor failures have no significant effects on transformer performance, while major failures are related to the transformer’s components degradation and would be irreversible. Results show that outages caused by overcurrent have the highest RPN in the minor failure analysis. Failures due to insulation deterioration have the highest RPN in the major failure analysis, followed by load tap changer failures.
In electric power distribution systems, one can verify three lines: (1) a “local” one represented by the micro-grids; (2) a “global and classical” one exemplified by distribution power systems, and (3) being also “global” but incorporating the cyber-physical component, the smart grid. Some research can be pointed out. In [39], an FMEA analysis is conducted to identify the failure modes in micro-grid equipment including different generation technologies. In [40], a classical FMECA was applied in a power distribution system located in the region of RELIZANE northwest of ALGERIA; authors conduct the FMECA analysis according to IEC 60812 standard [19]. Results show that it is necessary to replace the most of equipment, especially transformers and transmission lines; the analysis also allowed to identify the critical components that must be taken into account to improve the maintenance plans. More recently, authors showed in [24] an FMEA analysis for a smart grid framework. A comparison with a modified FMEA that combines the classical FMEA with a fuzzy inference system was studied to improve the prioritization of failure modes. Results clearly showed that fuzzy-based FMEA obtains better prioritization criteria for the analyzed failure modes when compared with classical FMEA applied to a smart-grid framework.
Overall, several studies focused on RCM and alternative approaches to evaluating reliability assessment in smart grid systems, but none of them have considered FMEA as a reliable tool for risk assessment.

3. The Classical Failure Modes and Effect Analysis (FMEA): Main Concept and Procedure

FMEA is a systematic methodology designed to identify known and potential failure modes and their causes and effects on system performance [17,19,21,33]. It was originally used by the US Armed Forces in 1949 [41] to classify failures “according to their impact on mission success that was related to the personnel and equipment safety”. After, an impulse was given by its use in the program Apollo in the 1960s following its application in the aerospace industry. As defined in [21], FMEA is a method designed to:
  • Identify and fully understand potential failure modes and their causes, and the effects of failure on the system or end users, for a given product or process.
  • Assess the risk associated with the identified failure modes, effects, and causes, and prioritize issues for corrective action.
  • Identify and carry out corrective actions to address the most serious concerns.
FMEA can be viewed as a proactive procedure for evaluating a process by identifying where and how it might fail and assessing the relative impact of different failures [42,43]. Despite FMEA’s primary objective is improving the system design, it can be applied in any stage of a project to mitigate potential future risks produced by failure modes. FMEA is conducted by a cross-functional team of subject matter experts that analyzes the system to identify weaknesses and propose correcting actions that prevent a negative impact on the system’s performance [21]. At this point, it is important to note that FMEA’s objective is not to predict failures. Its aim is to identify existing and potential failures through a subjective and systematic assessment to classify those failures according to a risk measure.
The FMEA method is supported by a set of standards and procedures among other technical documents. The main standards and procedures are:
  • SAE J1739:2000, Potential Failure Mode and Effects Analysis in Design (Design FMEA) and Potential Failure Mode and Effects Analysis in Manufacturing and Assembly Processes (Process FMEA), and Potential Failure Mode and Effects Analysis for Machinery (2000)—Society of Automotive Engineers;
  • Military StandardMIL-STD-1629A, Procedures for Performing a Failure Mode Effects and Criticality Analysis (1980)—U.S. Army;
  • IEC 60812:2006—Analysis techniques for system reliability—Procedure for failure mode and effects analysis (FMEA) (2006)—International Electrotechnical Commission;
  • ECSS-Q-ST-30-02C—Failure modes, effects (and criticality) analysis (FMEA/FMECA) AIAG and VDA FMEA Handbook (2019)—European Cooperation for Space Standardization;
  • SAE ARP5580, Recommended Failure Modes and Effects Analysis (FMEA) Practices for Non - Automobile Applications—Society of Automotive Engineers, and;
  • NASA RCM Guide, Reliability centered maintenance guide for facilities and collateral equipment (2008)—National Aeronautics and Space Administration.
The classical FMEA approach considers three factors, called risk factors, to characterize each failure mode [19]:
  • Severity (SEV): that assesses the significance of the failure mode’s effect on system operation;
  • Frequency of Occurrence (OCC): that represents the number of times the failure mode occurs. This risk factor is related to the failure rate, that can be defined as the frequency in which an engineering system or component fails, expressed in failures per unit of time [43];
  • Detectability (DET): that represents how detectable a certain failure can be before it happens.
Risk factors OCC, SEV, and DET are divided into categories. In the classical FMEA, each of these categories is rated by an integer number, usually on a scale from 1 to 10 as in [17], or 1 to 5 as used in [24]). Of course, the categories and ratings for SEV, OCC, and DET can be the same as those proposed in standards related to classical FMEA like IEC 60812:2006 [19] or specially defined depending on the problem characteristics.
The risk factor’s categories and ratings used in this work are listed in Table 1 (Frequency of occurrence), Table 2 (Severity), and Table 3 (Detectability).
Based on these three risk factors, a risk priority number (RPN) is calculated as
RPN = SEV × OCC × DET,
via the product of SEV, OCC, and DET as a metric for evaluating each failure mode in the FMEA, as proposed in [19]. Because the RPN calculation in the classical FMEA approach results from the unique arithmetic product between three integers, then there is no associated computational complexity.
The higher the RPN of a failure mode, the greater the risk is for the system reliability. Hence, proper actions should be preferentially taken on the high-risk failure modes so that the system should increase its availability. As will be shown in the discussion Section, the RPN calculation is an important issue for FMEA. However, classical FMEA is still an important tool for reliability and risk assessment in highly complex industries such as aerospace, nuclear, and petrochemical.

FMEA Procedure

One can summarize the FMEA procedure in 10 main steps. They are:
  • Review all subsystems to do a list of the main items and their components;
  • Brainstorm potential failure modes associated with each component;
  • List the probable failure effects of each component to each subsystem;
For each failure mode:
4.
Assign an Occurrence (OCC) rating using Table 1;
5.
Assign a Detectability (DET) rating using Table 3;
6.
Assign a Severity (SEV) rating using Table 2;
7.
Calculate the failure mode RPN using Equation (1));
8.
Develop an action plan to address the highest RPN;
9.
Take corrective action, and;
10.
Reevaluate the RPN number after the actions are completed.
The flowchart in Figure 1a shows how the 10 steps are linked for conducting a classical FMEA [17,19]. Once the three risk factors OCC, DET, and SEV were assigned in steps 4, 5, and 6, failure mode RPN is computed in step 7 using Equation (1). After, it is necessary to develop an action plan to reduce the highest RNP, forming step 8 as indicated in Figure 1a. Following that, step 9 considers the need of executing some corrective action to decrease the RPN value. This can be accomplished reducing the failure mode occurrence OCC, implementing actions to reduce the failure mode consequences (severity SEV), or even improving the detection methods already implemented. Notice the main loop in the FMEA flowchart in Figure 1a. It appears when all RPNs are computed since the recommended corrective actions must be implemented, and reevaluated being performed again to verify if these corrective actions reduced the risk in the system.
The final FMEA report must contain all the failure modes ordered by their RPN ranking, being registered in a worksheet whose columns (Figure 1b) contain information about the component, associated failure mode(s), causes, consequences, detection methods, recommended actions, and the corresponding ratings for SEV, OCC, and DET risk factors. Every FMEA report should include one section detailing all assumptions considered for the FMEA effectuated.

4. The Test System Architecture for A Smart Grid

In this section, a smart grid test system is presented for studying how the cyber-power interdependencies coupled with different failure modes will disturb the grid performance. Failure modes will be identified for both the power and the cyber components, and a complete FMEA analysis will be applied.
Figure 2 shows the cyber and power architecture of the smart grid test system, this system was specially designed to be used in [43] and to include the main power and cyber equipment that can be found in a smart grid. The 30 kV power network, depicted in black lines in Figure 2 down, is a meshed grid consisting of four 30 kV substations. The grid presents redundancy in the 30 kV grid, i.e., there are different ways for energy transport between busbar Nº.1 (B1), busbar Nº.2 (B2), busbar Nº.3 (B3), and busbar Nº. 4 (B4).
A 110 MW conventional generation station (CG) is connected to busbar B1. Distributed renewable generation stations are connected to busbar B2 (130 MW wind energy park, WE) and busbar B4 (100 MW photovoltaic power plant, PV). The power network also has a 50 MW energy storage system (ES) connected to busbar B3. A total of four power transformers (TR1, TR2, TR3, and TR4) and fifteen circuit breakers (CB1, CB2, …, CB15) are also included in the power network.
Consumers in Figure 2 are represented as three load points named LPB2, LPB3, and LPB4 and connected to busbar B2, B3, and B4, respectively. Load LPB2 represents a 20 MW residential area, while LPB3 and LPB4 represent industrial and commercial areas referred to as 85 MW and 40 MW load, respectively.
Regarding the power equipment, only busbars, power cables (aerial lines L1, L4), circuit breakers (CB), and power transformers are considered for this FMEA analysis. Storage facility and generation stations were not considered into this FMEA analysis.
Failure rates for each component have been collected from two main sources: from statistical data obtained from the Portuguese electrical utility, being the second source a set of specialized databases and manufacturer datasheets [44,45]. Table 4 lists the failure rates used in our research and attributed them to each power component. Note that relative to aerial cables and for simplification purposes, it was assumed that different substations are equally distanced between each other (about 2.5 km). Additionally, notice that the transformers’ failure rate is not always higher than other electrical equipment, such as stated in references [44,45] and listed in Table 4. For example, transformers between 300 to 10,000 KVA have a failure rate of around 0.0059 failures per unit-year. Transformers whose nominal power is greater than 10,000 KVA have failure rates around 0.0153.

Description of the Cyber Network of the Smart Grid Test System

Included in the power network in Figure 2 there is a cyber network topology to monitor, protect, and control the power system. Among all possible cyber network topologies, a cyber-ring topology was selected due to its elementary architecture, providing an acceptable level of reliability with a redundant path for data transmission. The cyber-control network shown in Figure 2 is a bus topology LAN-Ethernet and WAN-optical fiber network consisted of human-machine interfaces (HMIs), Ethernet switches (SWs), servers (SVs), energy boxes (EBs), intelligent electronic devices (IEDs), and Ethernet and optical fiber links (all marked in blue, red, and green lines in Figure 2).
The metering infrastructure is composed of smart meters designated in Figure 2 as energy boxes (EBs), being linked to load points in order to collect data about energy consumption. Note that, in practice, it is assumed that each customer is connected to a single EB. However, for simplifying purposes in this work, we consider only one main EB for all customers at each load point.
IEDs act as interface devices between power and communication networks, including measuring units, protective relays, and controllers. Each IED is responsible for monitoring and executes the commands received from HMIs. Table 5 lists the cyber-power links between each IED controller in Figure 2 network and their corresponding power elements (buses and circuit breakers).
As indicated in Figure 2, each IED or EB element is connected to an Ethernet switch (SW) through a LAN-Ethernet communication, which is then responsible for redirecting information through the corresponding communication links. Ethernet switches are all connected through a ring topology towards WAN-optical fiber network links (green lines in Figure 2). At last, a central Ethernet device (MAIN SW) is responsible for gathering information from all points of the communication network, sending it to the corporate and control centers (up blue blocks in Figure 2).
In the control center, all data concerning the power system status is available for monitoring, analysis, and decision-making. The control center is responsible for scheduling power generation to meet consumers’ demand, also managing major system problems by executing automatic procedures or manual instructions through the HMIs. Real-time data gathered from the power system are also displayed on the HMI, which allows real-time intelligent data handling and network status monitoring. As also shown in Figure 2 up, left in red, an Inter-Control Center Communications Protocol server (ICCP server) is specified to provide data exchange over WANs between utility control centers and substations. As also indicated in Figure 2 up, an APPLICATIONS SERVER and an ENGINEERING SERVER manage a big amount of data and information that are stored in an ENGINEERING DATABASE.
The CORPORATE CENTER (Figure 2 up, right) is responsible for managing a high number of energy market players that will compete to provide the best power quality at the best price. Cost fluctuations on energy generation (due to different penetration levels of distributed generation and dynamic energy demand) are managed in the BUSINESS SERVER in order to optimize cost-effectiveness operations and optimize the balance between energy demand, storage, and production. A CORPORATE DATABASE is responsible for collecting and storing all energy market information in the corporate center, while E-MAIL SERVER, WEB APPS SERVER, and FILE TRANSFER PROTOCOL (FTP) servers make it accessible for all market stakeholders.
The reliability values of each cyber equipment described in the anterior paragraphs and used in this work are listed in Table 6. All values were obtained from datasheets and reliability statistics [45,46,47,48,49,50,51], and all derived using reliability theory about failure rates [43]. For the Ethernet links, however, reliability data was not found explicitly in literature. To surpass this, one assumed a very low failure rate value. Concerning the optical fiber links, it was assumed a total length of 10 km in the communication network.

5. Identifying Potential Failure Modes in the Smart Grid Test System

Potential failure modes that can occur in the smart grid test system in Figure 2 needed to be evaluated in their causes and influence on the system. With this objective, this section summarizes the potential failure modes of each equipment considered in our smart grid test system.
Each equipment was first categorized according to their type and function in the system. In this way, several failure modes were then defined and described for each power equipment.
The assessment considers two assumptions:
  • The analysis focused on the identification of single failures for smart grid components, and;
  • Complex interdependences or cascading failures are out of scope for the current analysis.
Power equipment comprehended in our analysis of four components: busbar, power cable, circuit breaker, and a power transformer. For each one, a set of failure modes and associated criteria were identified as listed in Table 7.
Related to the cyber-control equipment, their failure modes are listed in Table 8. The list shows the five cyber-control devices considered: Intelligent Electronic Device (IED), server (SV), Human-Machine Interfaces (HMI), Ethernet switch (SW), and the Energy Box (EB).
Security failure and power failure were considered for all devices. Security failure is related to the susceptibility of cyber equipment to lose their integrity, while power failure is related to its interruption affecting the normal operation of the cyber network.
The IED defective communication is the failure mode associated with damaged transducers or poor signal causing intermittent communication between the IED and remaining cyber-network.
The server (SV) data overload is the failure mode associated with lower storage capacity or an unexpectedly large amount of data to storage that can result in defective data storage. Hardware crash is another failure mode related to some physical damage caused by overheating situations or humidity causing a hard drive crash, thus resulting in loss of data. At last, any software error corrupting stored data will result in an operational failure mode.
An HMI data error is a failure mode that is generally associated with inherent problems in HMI operation that of course will compromise its normal functioning.
Two failure modes attributed in Table 8 to an Ethernet switch (SW) are related to cyber-attacks: the Performance decrease and the Network/Cyber storm failure modes. Congestion of packets and/or broadcast of an excessive number of messages in an uncontrollable way in a communication network can decrease the SW operational performance or even congestion SW operation. At last, an SW Operational failure caused by a bad SW configuration or module failure can blackout its operation.
Energy Box contains a Catastrophic failure mode associated to temperature stresses that can severely damage the EB. Power consumption misreading and Operational failure are two failure modes related to incorrect data acquisition. Manual manipulation, significant measurement error, improper EB programming, and defective installation all result in incorrect data acquisition problems.
Related to network links, two types were considered: optical fiber links for communications in long distances, and Ethernet links for short distances. Their inherent characteristics result in different failure modes described in Table 9. Optical fiber links have a set of failure modes that are all related to its physic integrity: Fracture, lead-bonds degradation, and humidity-induced failure modes. Ethernet link failures degrade any network performance by decreasing available capacity and disturbing IP-packet forwarding. Hardware or software failures can happen at protocol network layers. Integrity defects as manufacturing imperfections, incorrect connections, or degradation in the RJ45 connectors, for example, may lead to loss of physical connectivity in the network hardware or link breakdown. Superposition of events usually occurs when electromagnetic coupling happens in adjacent pairs of wires causing signals interference. This is referred to as Crosstalk and is more frequent as the signal frequency increases.

6. FMEA Analysis and Its Results

A complete FMEA analysis was fulfilled to the smart grid test system in Figure 2, representing a typical cyber-power network. Using the failure modes systematized in the previous section, one searches for causes and potential impacts of each power and cyber equipment failures on the smart grid. Not only does our performed FMEA take into account the main interdependencies between power and cyber systems topology, but mechanisms that prevent the cause of each failure mode from occurring (current controls) are also proposed.
The three risk factors (Severity (SEV) in Table 1, Occurrence (OCC) in Table 2, and Detection (DET) in Table 3) were first assigned for each failure mode:
  • For Severity (SEV) rating, the seriousness of the failure and its effects in the system is taken into consideration;
  • For Detection (DET) assignment, it is considered the ability to detect the failure before it could affect the system, and;
  • For the Occurrence (OCC) rating, its value is stated according to equipment’s failure rates, as specified in Table 4 and.
Assignment of all ratings is performed according to FMEA evaluators’ expert criteria. Even in an Occurrence (OCC) rating, which could be accurately performed, it can be revised in accordance with a specific cause of failure that seems to be more or less likely to occur according to the FMEA’s evaluators criteria.
In a general way, any failure mode is expected to be assigned with different Detection (DET) and Occurrence (OCC) ratings that depend on the causes that triggered it. However, Severity (SEV) rating is unique for each failure mode. Since each failure mode’s priority is evaluated by its RPN value (1), this may lead to different RPNs for the same failure mode since each cause of failure has its own RPN value.
Our research identified and analyzed a total of 107 failure modes associated with the smart grid test system, the overall failure modes can be found in [43]. To this paper, we selected the 42 highest risk failure modes, listed in Table 10, and ordered from most risky to least risky. Remember that Table 10 is the result of the conducted FMEA analysis using the OCC, DET, and SEV rating numbers and respective RPN values from Equation (1). The complete FMEA table can be found in [43]. Table 10 also includes the potential Failure Cause and the suggested recommended actions in order to minimize the impact of those failure modes in the smart grid.
Examining the costs and causes of power and cyber incidents using the 42 highest risk failure modes in Table 10, we conducted the following conclusions:
Power equipment incidents:
(1)
Servers (SVs) and transformers have the most critical failure modes, achieving RPNs of 480 and 450, respectively. Their high-risk failure compromises the correct smart grid operation;
(2)
Bus bar failure modes were also identified as critical (rank 5, 8, and 9), in the sense that their impact in the smart grid is significant mainly due to several associated failure modes with high RPNs;
Cyber equipment incidents:
(1)
Related to cyber equipment, failure modes with the highest RPNs are those related with operational failures verified in Human-Machine Interfaces (HMIs) with RPN = 400, Ethernet switches (SWs) shutdown reaching RPN = 360, or Intelligent Electronic Device (IEDs) having some control failure, achieving RPN = 392;
(2)
Ethernet links, optical fiber links and Energy boxes (EBs) revealed the less critical equipment in the cyber system, mainly due to their low failure rates;
(3)
Failure modes related to security reasons, despite the enormous impact cyberattacks, can cause, were not indicated by FMEA as high-risk failures. For example, servers (SV) achieved a security failure of only RPN = 200. This is explained due to low occurrence ratings, in the sense that, in spite of the expected increase of cyberattacks attempts in future years, they will not be necessarily successful;
(4)
Power outages in a cyber-equipment’s power supply are expected to be less frequent, thus expressed in Table 10 with lower RPN values.
In fact, a general outlook on Table 10 outcomes show two important indications:
(1)
Besides all ratings being treated as equals, Occurrence OCC rating remains with low variations between different failure modes with high and low RPNs. Hence, it is not a decisive rating with impact on high-risk failures;
(2)
Failure modes characterized by high levels of unpredictability are likely to be more critical. These modes occur without early warning and are difficult to prevent, while strong negative impacts on the smart grid operation also have a repercussion in high Severity SEV ratings.
Finally, a conclusion regarding human interference in future smart grids must be pointed out. In fact, HMI’s operational failure due to human error proves to have negative impacts on the grid. This human error is unintentional and its high probability of occurrence and unpredictability (as seen in Table 10) makes it a high-risk failure cause. This way, we expect that one of the main weaknesses in future smart grids is related to some tasks that demand human interference.

7. Discussion

In order to use the achieved FMEA results, it is important to account for significant information loss during a classical FMEA procedure. This situation can compromise important conclusions concerning high-risk failure modes and their impact on the reliability of the system. In fact, Table 10 shows the result of FMEA giving prioritization of high-risk failure modes (based on their RPN value) due to their high-risk causes of failure. This means that, according to FMEA, maintenance strategies should be prioritized from the highest RPN to the lowest in order to increase the smart grid’s reliability. This implies that failure’s causes must receive special attention in any maintenance task. Doing this will decrease or eliminate any risk of a failure in the system, thus reducing some failure mode impact on the smart grid.
That should be established to decrease the number of times the respective failure manifests itself, so that system reliability increases as intended. However, this also means that numerous failure causes are herein discriminated as long as high-risk causes of failure of each failure mode are not taken into account for final FMEA analysis. In fact, some failure modes with critical causes have, sometimes, fewer RPN values than certain less critical failure modes, although they are identified as prioritized because of their higher RPN. In these situations, maintenance strategies for these failure modes with fewer RPN values may be ignored, if using the FMEA approach. For instance, Table 11 contains selected failure modes extracted from Table 10, the causes of which have equal Severity (SEV) ratings but different RPNs. Related to the busbar’s electrical disturbances failure mode, it can be caused due to short circuits between bars with different phases (RPN of 320) or due to harmonics (RPN of 256). Although harmonics still have a high RPN, meaning it is a high-risk cause of failure, its importance could be neglected because it is ranked in 7th place from the 10 failure modes–causes shown in Table 11 and, therefore, maintenance strategies would not be recommended for this failure mode–cause [43].
We point out that maintenance tasks cannot be efficiently prioritized when the classical FMEA approach is applied in a smart grid framework. Therefore, it has implications in the maintenance costs/risk-decrease ratio. Besides this, the relative importance among OCC, SEV, and DET risk factors are not taken into account in the classical FMEA. The three factors are treated as equals, with the same weight in the RPN computation. It is thus clear that it may not be adequate when considering a practical application of FMEA in smart grids.
As an illustration, Table 10 indicates that software errors in IEDs control applications have a high negative impact on system performance (high severity risk) when compared to unintentional human error in HMI operations (SEV rating is assigned with 8 and 5, respectively, for IED and HMI). However, one can perceive that HMI operational failure due to human error has a higher-risk failure mode when compared with IEDs control failure. The severity of the failure seems then to be herein suppressed.
Similarly, different combinations of OCC, SEV, and DET values may result in the same RPN rating, but with different hidden risk implications. For example, the wrong operation in CB due to overloads and magnetic-core delamination in transformers has the same RPN—168 more precisely—but their ratings are different. Their impacts on the system could be different, but unfortunately, FMEA could not distinguish them. This clearly shows that FMEA is limited in the prioritization of maintenance tasks. FMEA is not able to assign different weights for its ratings, leading to some misreading concerning the risk of a failure mode. For an adequate application of FMEA, it is of utmost importance to assemble subject experts with a high level of knowledge of the smart grid operation. This condition is related to the fact that failure modes and failure causes must be enumerated and exhaustively detailed and discussed in order to evaluate, as accurately as possible, the impacts of failure in the smart grid.
In the literature, we verified the lack of failure rates information discriminated against for each failure mode, either for power and cyber equipment. Even data found in the Portuguese electric energy utility (EDP Distribuição), a big company with interests in cost-effective maintenance methodologies, was inconclusive. In our research, failure mode’s failure rates were subjectively discriminated from equipment’s failure rates, which may have led to some errors in RPN final calculation, especially for OCC rating, which seemed to cause low impact for RPN the way it was obtained.
For FMEA to be correctly applied, experimental failure rates for each mode of failure must be detailed. If possible, extensive research would be useful to get experimental rates for each cause of failure. Therefore, for a deeper understanding of the criticality of certain failure, the collection of data on the frequency of failure for each power and cyber equipment, by specifying failure rates for each failure mode and their causes, would be profitable for reliability purposes. Knowing the frequency of certain failure, as long as the real impact that failure triggers in the smart grid was kept in mind, would make FMEA more efficient (more reliability of OCC rating) and maintenance strategies more precise (strategies based on maintenance frequency adjustments are improved).
Finally, in order to ensure the system’s high-reliability level, a cost-effective maintenance strategy must be achieved by prioritizing failure modes from the most critical to the lowest one, as long as one has to take into consideration maintenance costs for each equipment and each failure mode. This way, in what concerns the level of risk of the analyzed smart grid test system (note that, concerning the economic side, it is not evaluated in the present study), it is of utmost importance to establish maintenance strategies according to their risk number.
Strategies with the aim of (i) mitigating or eliminating failure modes in order to decrease OCC rating, (ii) increasing failure detectability for the purpose of lowering DET rating, and (iii) minimizing losses or negative impacts when a failure occurs in order to diminish SEV rating, all three must be performed in order to increase reliability of a smart grid topology.

8. Conclusions

This paper analyses the application of classical FMEA analysis in a smart grid environment. A simple smart grid test system was defined as having power and a cyber-components. Results of qualitative assessment of reliability analysis were performed, and a critical analysis of FMEA results was carried out. Analyzing the results achieved, seven critical conclusions can be pulled out, which allow highlighting the advantages and disadvantages concerning the practical implementation of classical FMEA in a smart grid cyber-physical system:
(1)
Advantage: it allowed identify the top ten high risky failure modes, which are related to server, transformers, HMI, IED, busbar, power cables and Ethernet switch;
(2)
Advantage: short circuits were identified as the causes for the riskiest failure modes in power equipment;
(3)
Advantage: concerning the cyber equipment, human and software errors (associated with HMI and servers) were identified as causes for high-risk failure modes;
(4)
Disadvantage: the RPN value is highly sensitive to small variations in the three risk factors SEV, OCC, and DET;
(5)
Disadvantage: The prioritization of failure modes based on the RPN value is not adequate for applications in a smart grid cyber-physical system. It does not take into account the relative importance of the three risk factors, which is different for each team analyzing the system. For example, the relative importance of the severity factor (SEV) in a transformer is different from the importance of the SEV factor when it concerns an Ethernet switch;
(6)
Disadvantage: there is a lack of information regarding the failure rates associated with each failure mode analyzed. This happens since the occurrence of a failure is recorded without differentiating which failure mode is related to the said failure.
It is important to highlight that classical FMEA is successful in assemble failure modes and their causes in a given smart grid. However, for a better reliability assessment and risk analysis of a smart grid using FMEA, it needs to be modified to improve risk prioritization. Since power systems reliability assessment is usually conducted considering component failures as a whole, that is, without differentiating the failure modes that drive the component failure, FMEA can be used first to identify the criticality of the failure modes and then use these critical failure modes as inputs for a quantitative reliability analysis instead a single failure rate for each component.
Component’s failure rate used in reliability analysis is a composition of failure probability functions for each of the failure modes identified for this component. This implies that reliability analysis would consider both the critical and non-critical failure modes. Therefore, considering the most critical failure modes for each equipment and using it as input for the quantitative reliability analysis, it would be possible to improve the perception of the failure mechanisms that lead to a reduction in system’s reliability, allowing the chance to focus the maintenance efforts at reducing the impact of this specific failure mode. For this reason, is important to the registry the failure statistics at the failure mode level and not only at the component level.

Author Contributions

Conceptualization, A.A.Z., J.F., and P.J.D.C.B.; methodology, A.A.Z. and A.B.; validation, A.A.Z. and A.B.; formal analysis, A.A.Z. and A.B.; investigation, A.A.Z. and A.B.; writing—original draft preparation, A.A.Z.; writing—review and editing, A.A.Z., J.F. and P.J.D.C.B.; supervision, P.J.D.C.B.; funding acquisition, P.J.D.C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by national funds through the Fundação para a Ciência e a Tecnologia (FCT), through IDMEC, under LAETA project UID/EMS/50022/2020, and by Secretaría Nacional de Educación Superior, Ciencia, Tecnología e Innovación (SENESCYT) of the Ecuadorian Government through fellowship CZ05-000291-2017.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature and Abbreviations

FMEAFailure Modes and Effects Analysis
RPNRisk priority number
RCMReliability Centered Maintenance
FMECAFailure Modes, Effects, and Criticality Analysis
PMUPhasor Measuring Unit
CPNCost-priority number
FTAFault Tree Analysis
PVPhotovoltaic Power System
ESSEnergy Storage System
LABLead-acid batteries
LCBLead–carbon batteries
OLTCOn-load tap changer
SAESociety of Automotive Engineers
IECInternational Electrotechnical Commission
ECSSEuropean Cooperation for Space Standardization
NASANational Aeronautics and Space Administration
SEVSeverity
OCCOccurrence
DETDetection
LANLocal area network
WANWide area network
HMIHuman-machine interface
SWEthernet switch
SVServer
EBEnergy box
IEDIntelligent electronic device
ICCPThe inter-control center communications protocol
FTPFile transfer protocol

References

  1. Moslehi, K.; Kumar, R. A reliability perspective of the smart grid. IEEE Trans. Smart Grid 2010, 1, 57–64. [Google Scholar] [CrossRef]
  2. Santacana, E.; Rackliffe, G.; Tang, L.; Feng, X. Getting smart. IEEE Power Energy Mag. 2010, 8, 41–48. [Google Scholar] [CrossRef]
  3. Farhangi, H. The path of the smart grid. IEEE Power Energy Mag. 2010, 8, 18–28. [Google Scholar] [CrossRef]
  4. Yssaad, B.; Abene, A. Rational reliability centered maintenance optimization for power distribution systems. Int. J. Electr. Power Energy Syst. 2015, 73, 350–360. [Google Scholar] [CrossRef]
  5. Bertling, L. Reliability Centred Maintenance for Electric Power Distribution Systems. Ph.D. Thesis, Royal Institute of Technology KTH, Stockholm, Sweden, 2002. [Google Scholar]
  6. Mirsaeedi, H.; Fereidunian, A.; Mohammadi-Hosseininejad, S.M.; Lesani, H. Electricity distribution system maintenance budgeting: A reliability-centered approach. IEEE Trans. Power Deliv. 2018, 33, 1599–1610. [Google Scholar] [CrossRef]
  7. Rafiei, M.; Khooban, M.-H.; Igder, M.A.; Boudjadar, J. A novel approach to overcome the limitations of reliability centered maintenance implementation on the smart grid distance protection system. IEEE Trans. Circuits Syst. II Express Briefs 2019, 67, 320–324. [Google Scholar] [CrossRef]
  8. Tang, Y.; Liu, Q.; Jing, J.; Yang, Y.; Zou, Z. A framework for identification of maintenance significant items in reliability centered maintenance. Energy 2017, 118, 1295–1303. [Google Scholar] [CrossRef]
  9. Pourahmadi, F.; Fotuhi-Firuzabad, M.; Dehghanian, P. Application of game theory in reliability-centered maintenance of electric power systems. IEEE Trans. Ind. Appl. 2017, 53, 936–946. [Google Scholar] [CrossRef]
  10. Afzali, P.; Keynia, F.; Rashidinejad, M. A new model for reliability-centered maintenance prioritization of distribution feeders. Energy 2019, 171, 701–709. [Google Scholar] [CrossRef]
  11. Sanz-Bobi, M.A.; Mazidi, P.; Sanchez, A.; Gonzalez, A.; Bachiller, R. Lifetime analysis of the main communication components in a smart distribution grid. In Proceedings of the International Telecommunications Energy Conference (INTELEC), Sentosa, Singapore, 13–17 October 2019. [Google Scholar]
  12. Shayesteh, E.; Yu, J.; Hilber, P. Maintenance optimization of power systems with renewable energy sources integrated. Energy 2018, 149, 577–586. [Google Scholar] [CrossRef]
  13. Heylen, E.; Ovaere, M.; Proost, S.; Deconinck, G.; Van Hertem, D. Fairness and inequality in power system reliability: Summarizing indices. Electr. Power Syst. Res. 2019, 168, 313–323. [Google Scholar] [CrossRef]
  14. Liu, W.; Gong, Q.; Han, H.; Wang, Z.; Wang, L. Reliability modeling and evaluation of active cyber-physical distribution system. IEEE Trans. on Power Syst. 2018, 33, 7096–7108. [Google Scholar] [CrossRef]
  15. Wang, S.; Wu, Z.; Su, A.; Jin, S.; Xia, Y.; Zhao, D. Reliability modeling and simulation of cyber-physical power distribution system considering the impacts of cyber components and transmission quality. In Proceedings of the Chinese Control Conference (CCC), Wuhan, China, 25–27 July 2018. [Google Scholar]
  16. National Aeronautics and Space Administration (NASA). Reliability Centered Maintenance Guide for Facilities and Collateral Equipment; National Aeronautics and Space Administration: Washington, DC, USA, 2000.
  17. Liu, H. FMEA Using Uncertainty Theories and MCDM Methods; Springer Science & Business Media: Singapore, 2016. [Google Scholar]
  18. Leimeister, M.; Kolios, A. A review of reliability-based methods for risk analysis and their application in the offshore wind industry. Renew. Sustain. Energy Rev. 2018, 91, 1065–1076. [Google Scholar] [CrossRef]
  19. International Electrotechnical Commission. IEC 60812—Analysis Techniques for System Reliability—Procedure for Failure Mode and Effects Analysis (FMEA); International Electrotechnical Commission: Geneva, Switzerland, 2006. [Google Scholar]
  20. Bertling, L. Infrastructure Asset Management with Power System Application; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
  21. Carlson, C.S. Effective FMEAs: Achieving Safe, Reliable, and Economical Products and Processes Using Failure Mode and Effects Analysis; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2002. [Google Scholar]
  22. Otuoze, A.O.; Mustafa, M.W.; Larik, R.M. Smart grids security challenges: Classification by sources of threats. J. Electr Syst. Inf. Technol. 2018, 5, 468–483. [Google Scholar] [CrossRef]
  23. Basu, K.; Padhee, M.; Roy, S.; Pal, A.; Sen, A.; Rhodes, M.; Keel, B. Health monitoring of critical power system equipment using identifying codes. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); LNCS; Springer: Cham, Switzerland, 2019; Volume 11260, pp. 29–41. [Google Scholar] [CrossRef] [Green Version]
  24. Zúñiga, A.A.; Fernandes, J.F.P.; Branco, P.J.C. A Fuzzy-based failure modes and effects analysis (FMEA) in smart grids. In Advances in Intelligent Systems and Computing; Rocha, Á., Ferrás, C., Paredes, M., Eds.; International Conference on Information Technology & Systems ICITS 2019; Springer: Cham, Switzerland, 2019; Volume 918, pp. 507–516. [Google Scholar]
  25. Arabian-Hoseynabadi, H.; Oraee, H.; Tavner, P. Failure modes and effects analysis (FMEA) for wind turbines. Int. J. Electr. Power Energy Syst. 2010, 32, 817–824. [Google Scholar] [CrossRef] [Green Version]
  26. Shafiee, M.; Dinmohammadi, F. An FMEA-based risk assessment approach for wind turbine systems: A comparative study of onshore and offshore. Energies 2014, 7, 619–642. [Google Scholar] [CrossRef] [Green Version]
  27. Souza, R.D.Q. Fmea and fta analysis for application of the reliability-centered maintenance methodology: Case study on hydraulic turbines. ABCM Symp. Ser. Mechatron. 2008, 3, 803–812. [Google Scholar]
  28. Lifar, A.S.; Brom, A.E. FMECA use for the equipment reliability analysis in hydro-power engineering. In Proceedings of the IOP Conference Series: Earth and Environmental Science, Ota, Nigeria, 18–20 June 2019; Volume 72, pp. 1–6. [Google Scholar]
  29. Colli, A. Failure mode and effect analysis for photovoltaic systems. Renew. Sustain. Energy Rev. 2015, 50, 804–809. [Google Scholar] [CrossRef]
  30. Villarini, M.; Cesarotti, V.; Alfonsi, L.; Introna, V. Optimization of photovoltaic maintenance plan by means of an FMEA approach based on real data. Energy Convers. Manag. 2017, 152, 1–12. [Google Scholar] [CrossRef]
  31. Yang, J.; Hu, C.; Wang, H.; Yang, K.; Liu, J.B.; Yan, H. Review on the research of failure modes and mechanism for lead-acid batteries. Int. J. Energy Res. 2017, 41, 336–352. [Google Scholar] [CrossRef]
  32. Lyu, D.; Ren, B.; Li, S. Failure modes and mechanisms for rechargeable Lithium-based batteries: A state-of-the-art review. Acta Mech. 2019, 230, 701–727. [Google Scholar] [CrossRef]
  33. Pourramazan, A.; Saffari, S.; Barghandan, A. Study of Failure Mode and Effect Analysis (FMEA) on capacitor bank used in distribution power systems. Int. J. Innov. Res. Electr. Electron. Instrum. Control Eng. 2017, 5, 113–118. [Google Scholar]
  34. Pottonen, L.; Pulkkinen, U.; Koskinen, M. A method for analysing the effect of protection system failures on power system reliability. In Proceedings of the 15th Power Systems Computation Conference, Liege, Belgium, 22–26 August 2005. [Google Scholar]
  35. Araújo, W.P.P. Metodologia Fmea-Fuzzy Aplicada à Gestão de Indicadores de Continuidade Individuais de Sistemas de Distribuição de Energia Elétrica. Master’s Thesis, Universidade Federal de Santa Catarina, Florianopolis, Brazil, 2008. [Google Scholar]
  36. Saraswati, D.; Marie, I.A.; Witonohadi, A. Power transformer failures evaluation using Failure Mode Effect and Criticality Analysis (FMECA) method. Asian J. Eng. Technol. 2014, 2, 484–489. [Google Scholar]
  37. Singh, J.; Singh, S.; Singh, A. Distribution transformer failure modes, effects and criticality analysis (FMECA). Eng. Fail. Anal. 2019, 99, 180–191. [Google Scholar] [CrossRef]
  38. Khalil, M.M.; Cristaldi, L.; Faifer, M. FMECA analysis for the assessing of maintenance activity for power transformers. In Proceedings of the Maintenance Performance Measurement and Management (MPMM) Conference; Imprensa da Universidade de Coimbra: Coimbra, Portugal, 2014; pp. 21–26. [Google Scholar]
  39. Hare, J.; Shi, X.; Gupta, S.; Bazzi, A. Fault diagnostics in smart micro-grids: A survey. Renew. Sustain. Energy Rev. 2016, 60, 1114–1124. [Google Scholar] [CrossRef]
  40. Yssaad, B.; Khiat, M.; Chaker, A. Maintenance optimization for equipment of power distribution system based on FMECA method. Acta Electrotech. 2012, 53, 218–223. [Google Scholar]
  41. US Department of Defense. Mil.-Std-1629, Procedures for Performing a Failure Mode, Effects and Criticality Analysis; US Department of Defense: Arlington, VA, USA, 1949.
  42. Akbari, M.; Khazaee, P.; Sabetghadam, I.; Karimifard, O. Failure Modes and Effects Analysis (FMEA) for power transformers. In Proceedings of the 28th International Power System Conference, Tehran, Iran, 13 November 2013; pp. 1–7. [Google Scholar]
  43. Baleia, A.N.S. Failure Modes and Effects Analysis (FMEA) for Smart Electrical Distribution Systems. Master’s Thesis, Universidade de Lisboa, Lisboa, Portugal, 2018. [Google Scholar]
  44. Institute of Electrical and Electronics Engineers. IEEE Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems; IEEE: Piscataway, NJ, USA, 2007. [Google Scholar]
  45. Swedpower, A.B. T-Book Reliability Data of Components in Nordic Nuclear Power Plants; The TUD Office: Stockholm, Sweden, 2005. [Google Scholar]
  46. Bakcblaze. Hard Drive Stats for Q3 2017. 2018. Available online: https://www.backblaze.com/blog/hard-drive-failure-rates-q3-2017/ (accessed on 25 September 2019).
  47. CISCO. Cisco Industrial Ethernet 2000 Series Switches—Data Sheet; CISCO: San Jose, CA, USA, 2016; pp. 1–12. [Google Scholar]
  48. EKE-Electronics. Human-Machine Interface Technical Specifications N.D. Available online: https://www.eke-electronics.com/human-machine-interface (accessed on 30 September 2019).
  49. Scheer, G.W.; Dolezilek, D.J. Comparing the reliability of Ethernet network topologies in substation control and monitoring networks. In Proceedings of the 2nd Annual Western Power Delivery Automation Conference, Spokane, WA, USA, 27–29 March 2002; pp. 1–15. [Google Scholar]
  50. Frontier Economics Ltd. Research into the Costs of Smart Meters for Electricity and Gas DSOs; Frontier Economics Ltd.: London, UK, 2008. [Google Scholar]
  51. Berghmans, F.; Eve, S.; Held, M. An introduction to reliability of optical components and fiber optics sensors. In NATO Science for Peace and Security Series B: Physics and Biophysics; Bock, W.J., Gannot, I., Tanev, S., Eds.; Optical Waveguide Sensing and Imaging; Springer Science & Business Media: Dordrecht, The Netherlands, 2008; pp. 73–100. [Google Scholar]
Figure 1. (a) FMEA flowchart from [17,19]. (b) Example of an FMEA worksheet line and respective columns.
Figure 1. (a) FMEA flowchart from [17,19]. (b) Example of an FMEA worksheet line and respective columns.
Energies 13 01215 g001
Figure 2. Schematic diagram of the cyber-power network representing the smart grid test system [43].
Figure 2. Schematic diagram of the cyber-power network representing the smart grid test system [43].
Energies 13 01215 g002
Table 1. Traditional ratings for failure mode’s occurrence (OCC) [17].
Table 1. Traditional ratings for failure mode’s occurrence (OCC) [17].
RatingOccurrence (OCC)Possible Failure Rate
10Extremely high≥1 in 2
9Very high1 in 3
8Repeated failures1 in 8
7High1 in 20
6Moderately high1 in 80
5Moderate1 in 400
4Relatively low1 in 2000
3Low1 in 15,000
2Remote1 in 150,000
1Nearly impossible≤1 in 150,000
Table 2. Traditional ratings for failure mode’s severity (SEV) [17].
Table 2. Traditional ratings for failure mode’s severity (SEV) [17].
RatingEffectSeverity of Effect
10Hazardous without warningThe highest severity ranking of a failure mode, occurring without warning and with the consequent hazard.
9Hazardous with warningHigher severity ranking of a failure mode, occurring with a warning and the consequent hazardous.
8Very highOperation of the system is broken down without compromising safe
7HighOperation of the system may be continued, but its performance is affected
6ModerateOperation of the system is continued, but its performance is degraded
5LowPerformance of the system is affected seriously, and the maintenance is needed
4Very lowPerformance of the system is less affected, and the maintenance may not be needed
3MinorSystem performance and satisfaction with minor effect
2Very minorSystem performance and satisfaction with a slight effect
1NoneNo effect
Table 3. Traditional ratings for failure mode’s detection (DET) [17].
Table 3. Traditional ratings for failure mode’s detection (DET) [17].
RatingDetectionCriteria
10Absolutely impossibleDesign control does not detect a potential cause of failure mode, or there is no design control
9Very remoteVery remote chance the design control will detect a potential cause of the failure or subsequent failure mode
8RemoteRemote chance the design control will detect a potential cause of the failure or subsequent failure mode
7Very lowVery low chance the design control will detect a potential cause of the failure or subsequent failure mode
6LowLow chance the design control will detect a potential cause of the failure or subsequent failure mode
5ModerateModerate chance the design control will detect a potential cause of the failure or subsequent failure mode
4Moderately highModerately high chance the design control will detect a potential cause of the failure or subsequent failure mode
3HighHigh chance the design control will detect a potential cause of the failure or subsequent failure mode
2Very HighVery high chance the design control will detect a potential cause of the failure or subsequent failure mode
1Almost certainDesign control will almost certainly detect a potential cause of the failure or subsequent failure mode
Table 4. Failure rates attributed to each power equipment.
Table 4. Failure rates attributed to each power equipment.
EquipmentFailure Rate [(f/year)/km]Length [km]Failure Rate [f/year]Source
Busbar 30 kV--0.01[45]
Cable 30 kV0.00542.50.135EDP Distribuição
Circuit breaker 30 kV--0.023EDP Distribuição
Transformer--0.01EDP Distribuição
Table 5. Cyber-power links between power and cyber network.
Table 5. Cyber-power links between power and cyber network.
LinkLinked Equipment
1(IED1:BUS1), (IED1:CB2), (IED1:CB3)
2(IED2:BUS2), (IED2:CB5), (IED2:CB6)
3(IED3:BUS3), (IED3:CB8), (IED3:CB9)
4(IED4:BUS4), (IED4:CB12), (IED4:CB13)
5(IED5:CG), (UED5:CB1)
6(IED6:WE), (IED6:CB4)
7(IED7:ES), (IED7:CB11)
8(IED8:PV), (IED8:CB14)
Table 6. Reliability values used in this work for each cyber-control equipment.
Table 6. Reliability values used in this work for each cyber-control equipment.
EquipmentFailure Rate [f/year]Source
HMI0.1720EKE-electronics [48]
SW0.0225Cisco [47]
SV0.0270Backblaze [46]
IED0.0526Schweitzer Engineering Laboratories [43]
EB0.0050Frontier Economics [50]
Ethernet link   10 6 -
Optical fiber link0.0438[51]
Table 7. Failure modes for power equipment considered for analysis.
Table 7. Failure modes for power equipment considered for analysis.
Power EquipmentFailure ModeCriteria
BusbarLoss of structural integrityThe metallic strip can lose its mechanical integrity due to support insulators breakdown, cracking of welds and fracture of the copper bar.
Loss of electrical continuityThe occurrence of arc flashes degrades the copper bar.
Loss of electrical efficiencyMoisture and humidity can lead to short circuits.
Electrical operation failureShort circuits between buses and harmonics can lead to ohmic heating.
Power cableInsulation failureThe aging process results in the eventual failure of the insulating and sheathing materials.
Cable integrity defectManufacturing imperfection, incorrect installation or hostile environments can result in cable breakdown.
Electrical operation failureMoisture, shield damage, overloads or short circuits can damage the cable.
Circuit breakerInsulation failureLoss of dielectric properties can damage the CB.
Wrong operationImproper parameterization or manual installation leads to spurious opening or closures.
Bushing breakdownLightning or external short circuits can damage the bushing.
Bushing terminal hotspotMoisture can increase ohmic resistance in bushing terminals, resulting in bushing damage.
Loss of dielectric strength in bushingsHeat, oxidation, acidity, and moisture can lead to bushing degradation.
Mechanical failure in the operating mechanismLack of lubrication, contamination or corrosion prevents CB from acting when necessary.
Contacts degradationContact wear and electrical treeing can damage the equipment.
Power TransformerBushing breakdownLightning or external short circuits can damage transformer bushings.
Bushing terminal hot spotMoisture can increase ohmic resistance in bushing terminals, resulting in bushing damage.
Loss of dielectric strength in bushingsHeat, oxidation, acidity, and moisture can lead to bushing degradation.
Magnetic-core delaminationHarmonics or corrosion can induce core degradation.
Tap changer mechanical failure in the drive mechanismCorrosion, friction or contamination can lead to transformer unstable operation.
Tap changer contacts degradationContact wear and electrical treeing can lead to transformer unstable operation.
Tank ruptureVibration-induced damage, corrosion or cracking of welds result in oil leakage and possible catastrophic event.
Windings isolation degradation or breakdownOil contamination, oil moisture or short circuits and overloads can damage transformer windings.
Distortion, loosening or displacement of the windingsIt can lead to short circuits.
Transformer explosionInternal short circuits or human sabotage can lead to catastrophic events.
Cooling system failureDamaged fans or cooling pipes obstruction can also lead to catastrophic events.
Table 8. Failure modes for cyber-control equipment considered for analysis.
Table 8. Failure modes for cyber-control equipment considered for analysis.
Cyber EquipmentFailure ModeCriteria
IEDSecurity failureRelated to the susceptibility of cyber equipment to lose their integrity.
Power failureRelated to the remote disconnection of power, which affects the normal operation of the cyber network.
Defective communicationDamaged transducers or poor signal can lead to intermittent communication between IED and remaining cyber-network.
SVSecurity failureRelated to the susceptibility of cyber equipment to lose their integrity.
Power failureRelated to remote disconnection of power, affecting the normal operation of the cyber network.
Data overloadLower storage capacity or an unexpectedly large amount of data to storage results in defective data storage.
Hardware crashPhysical damage, overheating, humidity or hard drive crash, all result in loss of data.
Operational failureInherent software errors can corrupt stored data.
HMISecurity failureRelated to the susceptibility of cyber equipment to lose their integrity.
Power failureRelated to remote disconnection of power, affecting the normal operation of the cyber network.
Data errorRelated to inherent problems in the HMI operation that compromises its function.
SWSecurity failureRelated to the susceptibility of cyber equipment to lose their integrity.
Power failureRelated to remote disconnection of power, affecting the normal operation of the cyber network.
Performance decreaseThe congestion of packets in the communication network can decrease SW operational performance.
Network/Cyber stormBroadcast of an excessive number of messages in an uncontrollable way can congestion SW operation.
Operational failureInherent problems in SW configuration or module failure can blackout the SW.
EBSecurity failureRelated to the susceptibility of cyber equipment to lose their integrity.
Power failureRelated to remote disconnection of power, affecting the normal operation of the cyber network.
Communication errorPoor signal with SV leads to no transmission data.
Power consumption misreadingManual manipulation or significant measurement error lead to incorrect data acquisition.
Operational failureImproper EB programming or defective installation result in incorrect data acquisition.
Catastrophic failureTemperature stress can severely damage the EB.
Table 9. Failure modes for network link considered for analysis.
Table 9. Failure modes for network link considered for analysis.
Network LinkFailure ModeCriteria
Optical fiber linkFractureStress, corrosion or fatigue can lead to microcracks, resulting in cable breakdown.
Lead–bonds degradationTemperature stress can damage plated contacts.
Humidity inducedElectrochemical oxidation in transmitters and receivers.
Ethernet linkCrosstalkExcessive traffic of packets results in congestion and overload of data.
Integrity defectManufacturing imperfection, incorrect installation or RJ45connectors degradation results in delays in data transmission, or even its interruption.
Link breakdownCable breakdown due to external physical damage.
Table 10. Final RPN obtained for highest risk selected failure modes.
Table 10. Final RPN obtained for highest risk selected failure modes.
RankEquipmentFailure Mode(s)Failure Cause(s)OCCDETSEVRPNRecommended Action(s)
1TransformerTransformer explosionInternal short circuit5109450Real-time signal analysis
2HMIOperational failureHuman error5105400Hire or educate qualified employees
3IEDControl failureDefective data processing (software error)778392Periodic software update
4Bus barLoss of structural integrityBreak of the support insulators697378Implement hot spot alert strategies
5CableElectrical operation failureShort circuits transients6106360Real-time current analysis
6SWOperational failure (SW blackout)SW is locked up6106360Periodic reboot
7Bus barLoss of electrical continuityArc flash4108320Improve preventive maintenance actions
8Bus barElectrical disturbancesShort circuits between bus bars4108320Real-time current analysis
9TransformerDistortion, loosening or displacement of the windingShort circuits597315Real-time current analysis
10BusbarLoos of structural integrityFracture of the cooper bar597315Implement hot spot alert strategies
11BusbarLoos of structural integrityCracking of connection welds597315Implement hot spot alert strategies
12CBBushing breakdownExternal short circuit5106300Real-time current analysis
13SVData errorsSoftware malfunction5106300Periodic software update: periodic data backup
14SWOperational failure (SW blackout)Module failure5106300
15TransformerWinding overheatingOverload677294Real-time signal analysis
16CableCable integrity defectLightnings758280Use of active lightning protection equipment
17CBCB contacts degradationElectrical treeing (partial discharges)596270Implement hot spot strategies
18BusbarElectrical disturbancesHarmonics488256Real-time current analysis
19SWPerformance decreasedMististatic traffic766252Establish optimized communication network topology for better performance; SW replacement
20IEDCommunication failurePoor communication between IED and remaining cyber-network586240Establish alternative paths for communication
21TransformerWinding isolation degradation or breakdownShort circuits and overloads4106240Real-time current analysis
22TransformerBushing breakdownExternal short circuit4106240Real-time current analysis
23TransformerTank ruptureCracking of welds398215Implement hot spot strategies
24IEDPower outagesRemote disconnection of power3107210Install a capacity external battery for backups (UPS)
25SVPower outagesRemote disconnection of power3107210Install a capacity external battery for backups (UPS)
26CBInsulation failureLoss of dielectric properties576210Signal analysis optimization in order to find opening patterns
27IEDCommunication FailureNetwork/Cyber storm576210Establish an optimized communication network topology for better performance
28SVSecurity failureDenial of Service attacks (DoS)21010200Enforce appropriate security policies
29CBBushing terminal hot spotMechanical stress due to external short circuit conditions486192Establish preventive cleaning and terminal squeeze routines
30IEDCommunication failureSignal processing error (corrupted data)486192
31IEDSecurity failureFaulty information injection (cyberattack)379169Enforce appropiate security policies and configuration
32IEDMonitoring failureSignificance measurement error566180Cross data with other monitored data in the grid
33HMISecurity failureHuman retaliation2109180Restrict access to specialist personnel and controlled by a security check
34SWPower outageRemote disconnection of power3106180Install a capacity external battery for backup (UPS); install PLC system
35SWNetwork/Cyber stormBroadcast of an excessive number of messages in an uncontrollable way (misleading information)476168Install higher-performance SWs; establish communication network topology for better performance
36TransformerCooling system failureCooling pipes obstruction378168Periodic cooling system maintenance (Check for leaks, rust or accumulation of dirt)
37CBWrong operation (Spurious opening and closure)Overload647168Real-time current analysis
38TransformerMagnetic-core delaminationHarmonics476168Real-time current analysis
39TransformerBushing terminal hot spotMechanical stress due to external short circuit conditions476168Establish preventive cleaning and terminal squeeze routines
40TransformerTap changer contacts degradationElectrical treeing (partial discharges)396162Implement hot spot alert strategies
41Optical fiber linkFractureStress, corrosion for fatigue due to microcracks3104120Increase cable robustness
42Optical fiber linkHumidity inducedElectrochemical oxidation of transmitter and receivers3104120Use of hermetically sealed package
Table 11. Selected failure modes for analysis and discussion.
Table 11. Selected failure modes for analysis and discussion.
EquipmentFailure Mode(s)Failure CauseOCCDETSEVRPNRANK
BusbarLoss of structural integrityFracture of the copper bar5973154
Break of the support insulators6973781
Cracking of connection welds5973155
BusbarElectrical disturbancesShort circuit between bus bars41083203
Harmonics4882567
SWOperational failure (SW blackout)SW is locked up61063602
Module failure51063006
IEDCommunication failurePoor communication between IED and remaining cyber network5862408
Signal processing error (corrupted data)48619210
Network/Cyber storm5762109

Share and Cite

MDPI and ACS Style

Zúñiga, A.A.; Baleia, A.; Fernandes, J.; Branco, P.J.D.C. Classical Failure Modes and Effects Analysis in the Context of Smart Grid Cyber-Physical Systems. Energies 2020, 13, 1215. https://0-doi-org.brum.beds.ac.uk/10.3390/en13051215

AMA Style

Zúñiga AA, Baleia A, Fernandes J, Branco PJDC. Classical Failure Modes and Effects Analysis in the Context of Smart Grid Cyber-Physical Systems. Energies. 2020; 13(5):1215. https://0-doi-org.brum.beds.ac.uk/10.3390/en13051215

Chicago/Turabian Style

Zúñiga, Andrés A., Alexandre Baleia, João Fernandes, and Paulo Jose Da Costa Branco. 2020. "Classical Failure Modes and Effects Analysis in the Context of Smart Grid Cyber-Physical Systems" Energies 13, no. 5: 1215. https://0-doi-org.brum.beds.ac.uk/10.3390/en13051215

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop