Relaxed Rule-Based Learning for Automated Predictive Maintenance: Proof of Concept

Razgon, Margarita; Mousavi, Alireza

doi:10.3390/a13090219

Open AccessArticle

Relaxed Rule-Based Learning for Automated Predictive Maintenance: Proof of Concept

by

Margarita Razgon

^* and

Alireza Mousavi

Department of Computer Science, Brunel University, Uxbridge, London UB8 3PH, UK

^*

Author to whom correspondence should be addressed.

Algorithms 2020, 13(9), 219; https://0-doi-org.brum.beds.ac.uk/10.3390/a13090219

Submission received: 12 July 2020 / Revised: 25 August 2020 / Accepted: 28 August 2020 / Published: 3 September 2020

Download

Browse Figure

Versions Notes

Abstract

:

In this paper we propose a novel approach of rule learning called Relaxed Separate-and-Conquer (RSC): a modification of the standard Separate-and-Conquer (SeCo) methodology that does not require elimination of covered rows. This method can be seen as a generalization of the methods of SeCo and weighted covering that does not suffer from fragmentation. We present an empirical investigation of the proposed RSC approach in the area of Predictive Maintenance (PdM) of complex manufacturing machines, to predict forthcoming failures of these machines. In particular, we use for experiments a real industrial case study of a machine which manufactures the plastic bottle caps. We compare the RSC approach with a Decision Tree (DT) based and SeCo algorithms and demonstrate that RSC significantly outperforms both DT based and SeCo rule learners. We conclude that the proposed RSC approach is promising for PdM guided by rule learning.

Keywords:

Predictive Maintenance; failure prediction; rule learning; Decision Tree; Machine Learning

1. Introduction

Rule Learning (RL) is a well known methodology of Machine Learning (ML). By the Occam Razor principle [1], smaller models tend to make more accurate predictions. Based on this principle, RL algorithms should be designed to create small sets of small rules.

Arguably, the best known specific method of RL is Decision Tree (DT) algorithm. However, existing literature highlights the fact that if we insist on the ‘tree-likeness’ of the rules set, rules become prohibitively long and complicated (Section 1.5.3. of [2]). This is due to the effect of fragmentation. A detailed discussion of this phenomenon is provided in Example 2, see also a related discussion in [3].

Therefore, an alternative approach has been actively investigated in which rules are created one by one without insisting on their set to fit a DT structure. In order to design an algorithm based on this approach, the following two questions must be answered.

How to create a single rule?
How to create a collection of rules?

According to the Occam Razor principle, an algorithm for composition of a single rule must endeavor to make the rule as small as possible and as precise as possible. Thus, the task of a rule creation can be envisaged as an optimization problem with the objective function expressing a combination of these two criteria (plus the high coverage criteria to avoid overfitting). For a reasonably complex domain such an optimization problem is intractable [1], hence there is no hope to obtain an ‘optimal’ rule in a reasonable time. Consequently, the main methodology for obtaining

a single rule

are greedy local search (mainly Hill Climbing) algorithms [2]. The common feature of these algorithms is the absence of backtracking. In other words, a local search algorithm grows a rule adding constraints on attributes one by one and cannot remove a constraint once it has been added.

In the above local search framework, the main effort concentrates on the heuristics for choosing the next attribute constraint being added to a rule. Moreover, the constraints have a special form outlined in Example 1 below. Also, we identify a dataset with a table, an attribute (attr) with a column in the dataset, and an instance with the row of the dataset.

Example 1.

Assume that our learning task is to learn a concept depending on 5 attributes

a t t r_{1}, \dots, a t t r_{5}

of a dataset. Assume that these attributes have integer values ranging between 0 and 50. Then the rules are of the form

a t t r_{1} \in [10, \dots, 40] \land a t t r_{3} \in [20, \dots 35] \to t r u e

. The above rule states that all the instances of the dataset with the value of

a t t r_{1}

between 10 and 40 and the value of

a t t r_{3}

between 20 and 35 then this instance ‘belongs’ to the concept.

Note that the constraints in Example 1 are given in the form of intervals. Moreover, the same attribute can occur more than once. In this case, the actual constraint on the attribute is the intersection of the intervals of all the occurrences of this attribute. For example, the rule as in Example 1 can be rewritten as follows

a t t r_{1} \in [0, \dots, 40] \land a t t r_{3} \in [20, \dots 35] \land a t t r_{1} \in [10, 50] \to t r u e

.

The rule growth procedure starts from an empty rule and performs a number of iterations. Each iteration chooses an attribute and an interval and adds to the rule being formed the respective constraint, stating that the value of the chosen attribute belongs to the value of the chosen interval. The procedure also needs a terminating condition. An obvious one is that all the instances covered by the current rule are invariant w.r.t. the concept being studied (all belong or all do not belong to the concept). However, such a terminating condition may lead to long rules and potentially prone to overfitting. To avoid this situation, there are terminating conditions that stop the rule growth procedure even in case there is no full invariance.

Let us now discuss approaches to forming

a collection of rules

. The main issue that needs to be addressed is the handling of conflicting predictions. Indeed, suppose that the same instance is covered by two rules, one of them states that the instance belongs to the concept, the other that the instance does not. Another matter to address is the terminating condition for the procedure of forming a rule collection: we add rules one by one to the collection, when are we to stop?

Both of the above issues can be relatively straightforwardly addressed by a methodology of Separate-and-Conquer (SeCo) [2,3]. According to this methodology, when a new rule is formed, the instances covered by this rule are removed. So, the newly formed rules are guaranteed to cover new instances and the process will stop when there are no new instances (of course the terminating condition can be relaxed to avoid overfitting). Unlike in the case of DT, the rules may overlap. Indeed, suppose a rule

R_{1}

has been formed and the instances covered by it removed. When a new rule

R_{2}

is being formed, the procedure growing it does not ‘see’ the removed instances but this does not mean that these instances cannot be covered. However, the rules formed by a SeCo procedure are ordered in a chronological order (according to the time they have been formed). When a prediction is about to be made about some instance x, the prediction is made by the first rule covering this instance. To understand the intuition, assume that the instance x is covered by the 5th rule. Then rules 1 to 4 do not cover the instance hence there is no point using them for making the prediction. Rules 6 onward can cover the instance. However, rule 5 has been formed as a result of analysis of a larger training set, so it is rational to assume that it will be more precise on instances covered by it and the subsequent rules.

Both DT and SeCo methods suffer from fragmentation, though for SeCo algorithm the effect is milder, as demonstrated by Example 2 below.

Example 2.

[Fragmentation] Consider the same settings of the attributes as in Example 1. Suppose that the concept can be described as the disjunction of the following two rules.

1.: $R_{1}$ : $a t t r_{1} \in [10, \dots, 20] \land a t t r_{2} \in [15, \dots 40] \to t r u e$ .
2.: $R_{2}$ : $a t t r_{3} \in [20, \dots, 35] \land a t t r_{4} \in [25, \dots, 50] \to t r u e$ .

The sets of instances covered by

R_{1}

and

R_{2}

clearly overlap. Therefore, after a SeCo procedure discovers the first rule, it is likely to be more difficult to discover the second rule.

Indeed, suppose such a procedure discovers

R_{1}

. After that the rule learner will have to look at the part of the dataset that is not covered by

R_{1}

and to discover rule

R_{2}

.

A rule learner that learns non-overlapping rules (e.g., a DT algorithm) will have to discover rules that are covered by

R_{2}

but not covered by

R_{1}

. There are several way to present the corresponding set of rules, the most compact of them would look as follows.

1.: $R_{3}$ : $a t t r_{1} \in [0, \dots, 9] \land a t t r_{3} \in [20, \dots, 35] \land a t t r_{4} \in [25, \dots, 50] \to t r u e$ .
2.: $R_{4}$ : $a t t r_{1} \in [21, \dots, 50] \land a t t r_{3} \in [20, \dots, 35] \land a t t r_{4} \in [25, \dots, 50] \to t r u e$ .
3.: $R_{5}$ : $a t t r_{2} \in [0, \dots, 14] \land a t t r_{3} \in [20, \dots, 35] \land a t t r_{4} \in [25, \dots, 50] \to t r u e$ .
4.: $R_{6}$ : $a t t r_{2} \in [41, \dots, 50] \land a t t r_{3} \in [20, \dots, 35] \land a t t r_{4} \in [25, \dots, 50] \to t r u e$ .

That is instead of a single rule of length 2, the rule learner will have to learn 4 rules of length 3. In case of k short rules, the number of rules to be learned grows exponentially with the number of rules already learned. We informally refer to this effect as fragmentation.

In case of a more general SeCo method, the situation is not as acute as in the case of non-overlapping rule learning because there is no need to avoid overlapping with

R_{1}

. However, the dataset resulting from removal of the rows covered by

R_{1}

is smaller that the original dataset and, more importantly, is ’distorted’ by a non-uniform removal of instances. As a result, learning of

R_{2}

in this distorted dataset becomes more difficult than in the original one. In case of more than two rules to be learned, this difficulty becomes even more pronounced.

One way to address the above deficiency is somehow to assign weights to direct the RL algorithm towards considering instances that have not been covered. The main two approaches doing that are weighted covering [2] and boosting [4].

The weighted covering [2] attempts to generalize the SeCo method as follows. We can see SeCo as a method that assigns weight 1 to the instances not yet covered by the existing rules and 0 to the instances that are covered. Then new rules are sought over instances of non-zero weight. Weighted covering uses more flexible methods of weights assignment. The related heuristics are organized so as to choose heavier instances and this creates as a result a ‘fuzzy’ version of SeCo. It is important to note that whatever way the weights are assigned, the part of the dataset covered by the existing rules will be discriminated against the instances that are not yet covered. In other words the distortion of the dataset as presented in Example 2 will still be present in case of weighted covering.

The boosting method [5] is a theoretical approach whose purpose is to show that a reasonable but not very accurate learning algorithm can undergo several rounds of retraining to learn a concept with an arbitrary degree of accuracy. The idea applied to RL [4] is that the learning algorithm first produces a single rule and then, as a result of boosting, new rules are added to the collection.

In this paper we propose an alternative rule learning approach considering instances that have already been covered by the previous rules. We call this approach Relaxed Separate-and-Conquer (RSC). In particular, when a new rule is formed, it is required to cover at least one instance not covered by the previous rules. This means that already covered instances are not excluded (like in the case of separate-and conquer) nor are they discriminated against (like in the case of weighted covering). However, the algorithm is forced to look not only at the already covered instances, but also elsewhere.

The proposed RSC approach generalizes both SeCo and weighted covering. In particular, any reasonable rule growing heuristic for SeCo or weighted covering can be simulated by an appropriate rule growing heuristic for the RSC method. Also, the rule growing heuristic can control ’tree-likeness’ of the rules and hence can simulate any DT algorithm. More technical details related to the generalization are provided in Section 2.1 (Theorem 1). In addition in this subsection we propose Conjecture 1 formalising the intuition outlined in Example 2 regarding the advantage of RSC over SeCo.

In this paper we apply the RSC method in the area of failure prediction of complex manufacturing machines. These sophisticated machines are equipped with a series of sensors and actuators that provide a combination of real-time data about the state of machines (performance) and product state (quality) during the production process. The attributes of the dataset correspond to sensors and the values of the attributes are respective sensor readings. The binary outcome column is interpreted as an alarm (outcome 1) or no alarm (outcome 0). The purpose of the failure prediction is to notify the operator of a forthcoming failure. Therefore, there is no point to learn rules whose outcome is 0. That is all the rules learned by a method we are going to present have 1 (an alarm) as the outcome. This allows to introduce the following two simplifications.

The outcome can be omitted. Therefore, each rule can be presented as a conjunction of (attribute, interval) pairs.
Since all the rules have outcome 1, conflicting predictions between overlapping rules cannot occur.

The failure prediction task explored in this paper is an important method of Predictive Maintenance (PdM). PdM is a set of techniques helping engineers to organize maintenance based on actual information about forthcoming failures [6]. The main aim of PdM is to reduce operating costs of two other maintenance strategies [6]: (1) Run-to-Failure (R2F) where corrective maintenance is performed only after the occurrence of failures; and (2) Preventive Maintenance (PvM) where equipment checks are performed at fixed periods of time. PdM also prolong the useful life of the equipment [7] and optimize the use and management of assets [8]. PdM uses predictive techniques, based on continuous machine monitoring, to decide when maintenance is needed to be performed. Two main approaches to the design of PdM software are Discrete-Event Simulation [9] and ML.

The ML approach is based on prediction of future performance based on historical data. Large volumes of past performance data have been collected in large enterprises. With the advent of modern ML approaches, the analysis of these data can provide very useful information about the future performance. There are many results applying ML to the past performance data of the equipment, see surveys [10,11,12] for comprehensive overviews. The existing ML methodologies for PdM are based on methods such as Support Vector Machines [6,13,14,15,16,17,18], k-Nearest Neighbors [6,13,16], Artificial Neural Networks and Deep Learning [16,19,20], stochastic processes [21], K-means [13,16,22], Bayesian reasoning [23]. Ensemble methodologies where several methods are used and the weighted average of their predictions are reported in e.g., [24,25,26].

Rule-based methods are rather under-represented in PdM. DT based methods have been proposed in e.g., [16,27,28]. Random Forests (RF) based methods have been used in e.g., [29,30,31]. The use of more generic rule-learning such as SeCo is even more limited in the area of PdM: we are only aware of work [32] (thanks to the anonymous reviewer for bringing this paper to our attention). Our paper reports a progress towards further exploration of the potential of rule learning in the area of PdM.

In the context of failure prediction, we report the following technical results.

We present a generic framework forming a collection of rules according to the RSC approach. In particular, this framework allows implementation of a wide range of heuristics.
We present one particular heuristic that aims to maximize the precision of the newly formed rule as well as the coverage of the positive instances that are not covered by the previous rules.
We present empirical investigation of the resulting rule learner. In particular, we compare the RSC approach with a DT based and SeCo rule learners on two domains:
(a)
A randomly generated dataset simulating alarms caused by small number of factors.
(b)
A real industrial dataset collected from a machine which manufactures the plastic bottle caps. This dataset records alarms occurred in this machine and the associated sensor readings.
In both cases the RSC algorithm significantly outperforms the DT based rule learner and SeCo method using the same heuristic. The RSC produces a set of rules that is smaller and much more accurate DT based and SeCo rule learners. We conclude that the RSC is a promising approach deserving further investigation.

The rest of the paper is organized as follows. In Section 2 we describe the Relaxed Separate-and- Conquer (RSC) rule learning approach and provide its theoretical justification. In Section 3 we provide the experiments. Section 4 concludes the paper.

2. Relaxed Separate-and-Conquer Rule Learning Approach

In this section, we describe the Relaxed Separate-and-Conquer (RSC) method of rule learning. We emphasize that, like Separate-and-Conquer (SeCo), this is an approach rather than a single algorithm. Several heuristic choices need to be made in order to turn this approach into an algorithm. We present the approach equipped with a quite straightforward heuristic based on a common sense. We also demonstrate that the RSC approach generalizes SeCo and weighted covering methods, the latter under a mild restriction.

In order to present the RSC approach, we introduce first the related terminology. Our dataset is presented as a table called

D A T A

having

n + 1

columns. The first n columns are referred to as attributes

a t t r_{1}, \dots, a t t r_{n}

. The values of

a t t r_{i}

are integer numbers between 0 and some maximum possible value

m a x_{i}

. The last column is called the outcome and denoted by

o u t

. The

o u t

column is binary with possible values 1 (interpreted as ‘alarm’) and 0 (‘no alarm’). Our aim is to learn the rules predicting alarms. We assume that there are no two distinct rows with the same tuple of attributes, to make sure that the dataset represents a function.

Definition 1.

An attribute-value pair (AVP) is a pair

(j, [a, b])

where

1 \leq j \leq n

and

0 \leq a \leq b \leq m a x_{j}

. A row

D A T A [i]

of

D A T A

is covered by

(j, [a, b])

if

a \leq D A T A [i] [j] \leq b

. In other words, an AVP

(j, [a, b])

restricts the values of

a t t r_{j}

to

[a, b]

.

Definition 2.

A rule is a set of AVPs. A row of

D A T A

is covered by the rule if it is covered by all the AVPs. In other words, we can see a rule as a conjunction of AVPs.

Definition 3.

A collection of rules is a set of rules. A row is covered by a collection of rules if it is covered by at least one rule of the collection.

Thus we can see that a collection of rules can be seen as a monotone (no negations) Disjunctive Normal Form (DNF) with AVPs used instead of Boolean variables.

The algorithm consists of a generic function for forming a collection of rules and growing a single rule which needs an heuristic to choose the next AVP to add to the current rule (if any).

The main function is called

R S C

, which is an abbreviation of Relaxed Separate-and-Conquer. It starts with an empty collection of rules and runs a function

F o r m R u l e

that returns a rule. If this rule is not empty then it is added to a collection. If the rule returned by

F o r m R u l e

is empty then the algorithm stops and the collection formed so far is returned. The pseudocode of

R S C

and

F o r m R u l e

functions is given in Algorithm 1.

Note it is the responsibility of the function

F o r m R u l e

to ensure that the loop of the function

R S C

stops:

F o r m R u l e

must eventually return an empty rule.

F o r m R u l e

runs a heuristic function called

C h o o s e N e x t

.

C h o o s e N e x t

either returns an AVP which is added to the rule being formed or returns

n i l

that means that the heuristic determines that the current rule should not be grown further. In this case

F o r m R u l e

returns the current rule.

The

C h o o s e N e x t

heuristic is, as we mentioned above, central in turning the approach into an algorithm. The heuristic chooses whether to return an

A V P

and, if positive, which one to return. The RSC approach does not prescribe a particular algorithm for

C h o o s e N e x t

, however imposes one important constraint: the returned AVP must cover a row not covered by the current collection of rules. The particular algorithm for

C h o o s e N e x t

presented below is just one possible variant fitting the pattern. First of all, for the sake of speeding up, rather than running through all the AVPs the heuristic runs only through half-intervals of the attributes as defined below.

Algorithm 1 Relaxed Separate-and-Conquer (RSC) Rule Learning approach.

function RSC()
$C o l l e c t i o n \leftarrow \emptyset$
loop
$R u l e \leftarrow F o r m R u l e (C o l l e c t i o n)$
if $R u l e = \emptyset$ then
return $C o l l e c t i o n$
end if
$C o l l e c t i o n \leftarrow C o l l e c t i o n \cup {R u l e}$
end loop
end function

function FormRule( $C o l l e c t i o n$ )
$R u l e \leftarrow \emptyset$
loop
$A V P \leftarrow C h o o s e N e x t (C o l l e c t i o n, R u l e)$
if $A V P = n i l$ then
return $R u l e$
end if
$R u l e \leftarrow R u l e \cup {A V P}$
end loop
end function

Definition 4.

An AVP

(j, [a, b])

is a half-interval if either

a = 0

or

b = m a x_{j}

.

In other words,

(j, [a, b])

is a half interval if either a is the initial value of

a t t r_{j}

or b is the final value of this attribute. For attribute j there are

2 * m a x_{j}

half-intervals and

O (m a x_{j}^{2})

AVPs in general. Therefore, going through half-intervals only significantly saves the runtime. Note that the expressive power is not affected because any AVP can be seen as a rule including two half-intervals.

The pseudocode of

C h o o s e N e x t

heuristic is provided in Algorithm 2.

C h o o s e N e x t

uses two auxiliary functions:

I s C h o s e n

and

I s R e p l a c e d

. The function

I s C h o s e n

operates when no AVP has been chosen yet to add to the current rule and this function decides whether the currently considered interval is a viable (though possibly not the best) candidate for the rule growth. The function

I s R e p l a c e d

operates when there is already a candidate AVP to be returned and a new one is considered, and the function decides whether the new AVP is preferable to the current favorite.

It is the responsibility of

I s C h o s e n

to ensure that the whole algorithm does not enter into an infinite loop. In particular, when all the rows with outcome 1 have been covered by the current collection of rules,

I s C h o s e n

must reject all the candidate AVPs. Then an empty rule will be returned by the function

F o r m R u l e

and the run of the main function

R S C

will be terminated.

In order to describe functions

I s C h o s e n

and

I s R e p l a c e d

we need to introduce new terminology. First of all, each row of table

D A T A

is associated with its index (as usual row 1, row 2, and so on). When we refer to a set of rows, we mean the set of their respective numbers.

Let R be a rule. We denote by

P O S (R)

and

N E G (R)

the sets of rows covered by R that respectively have positive and negative outcomes. That is,

P O S (R) \cup N E G (R)

is the total set of rows covered by R.

Definition 5.

The precision

p r e c (R)

of a rule R is defined as follows. If

P O S (R) \cup N E G (R) = \emptyset

then

p r e c (R) = 0

. Otherwise,

p r e c (R) = | P O S (R) | / (| P O S (R) | + | N E G (R) |)

.

Algorithm 2 ChooseNext heuristic for RSC approach.

function ChooseNext( $C o l l e c t i o n, R u l e$ )
$C u r A V P \leftarrow n i l$
for each half-interval $A V P$ do
if $C u r A V P = n i l$ then
if $I s C h o s e n (C o l l e c t i o n, R u l e, A V P)$ then
$C u r A V P \leftarrow A V P$
end if
else
if $I s R e p l a c e d (C o l l e c t i o n, R u l e, C u r A V P, A V P)$ then
$C u r A V P \leftarrow A V P$
end if
end if
end for
return $C u r A V P$
end function

Let

C

be a collection of rules. Then

P O S (C) = ⋃_{R \in C} P O S (R)

. In other words, the positive rows covered by the collection is the union of the positive rows covered by the rules in this collection.

Definition 6.

Let

C

be a collection of rules and let R be a rule such that

R \notin C

. Then the free coverage of R w.r.t.

C

is

P O S (R) \ P O S (C)

and it is denoted by

F r e e (R, C)

. In other words, the free coverage corresponds to the positive rows that are covered by the new rule R being formed but is not covered by the current collection

C

of rules.

The pseudocode of the function

I s C h o s e n

is given in Algorithm 3.

I s C h o s e n

uses two parameters (thresholds)

i n i t_f r e e

and

i n i t_p r e c

. They are not specified by the algorithm and their right value is determined by experiments. Thus

I s C h o s e n

decides to not grow the rule with

A V P

if the result of adding

A V P

to the current rule covers less ‘new’ positive rows than the specified threshold. For this condition to prevent the whole algorithm running into an infinite loop,

i n i t_f r e e

must be at least 1. Setting the parameter to a larger value will force the new rules to cover more new positive rows and, as a result, to potentially decrease the total number of rules needed. The initial precision threshold

i n i t_p r e c

is not necessary for a properly functioning algorithm. However, making sure that the initial precision is sufficiently high, the algorithm potentially avoids creation of too long rules.

The pseudocode of the function

I s R e p l a c e d

is provided in Algorithm 3.

I s R e p l a c e d

compares two different AVPs to be added to the current rule. In order to compare them,

I s R e p l a c e d

forms two new rules,

R u l e_{1}

and

R u l e_{2}

,

R u l e_{1}

with the current best candidate to be added to the current rule and

R u l e_{2}

with a new AVP added. If the

R u l e_{2}

covers less new rows than

i n i t_f r e e

then the new AVP is immediately discarded. The new AVP replaces the current one if the precision of

R u l e_{2}

is greater than the precision of

R u l e_{1}

. Another reason to prefer the new AVP if

R u l e_{2}

and

R u l e_{1}

have the same precision but

R u l e_{2}

has a greater free coverage. In fact, the function is ready to sacrifice precision a little bit for the sake of a greater coverage. In particular, we introduce a parameter

p r e c_l o s s

and consider

R u l e_{2}

preferable to

R u l e_{1}

if the precision of

R u l e_{2}

is at least the precision of

R u l e_{1}

minus

p r e c_l o s s

but the free coverage is larger.

Algorithm 3 Two auxiliary functions for ChooseNext heuristic.

function IsChosen( $C o l l e c t i o n, R u l e, A V P$ )
$R u l e_{1} = R u l e \cup {A V P}$
if $| F r e e (R u l e_{1}, C o l l e c t i o n) | < i n i t_f r e e$ then
return $f a l s e$
end if
if $p r e c (R u l e_{1}) < i n i t_p r e c$ then
return $f a l s e$
end if
return $t r u e$
end function

function IsReplaced( $C o l l e c t i o n, R u l e, C u r A V P, A V P$ )
$R u l e_{1} = R u l e \cup {C u r A V P}$
$R u l e_{2} = R u l e \cup {A V P}$
if $| F r e e (R u l e_{2}, C o l l e c t i o n) | < i n i t_f r e e$ then
return $f a l s e$
end if
if $p r e c (R u l e_{2}) > p r e c (R u l e_{1})$ then
return $t r u e$
end if
if $p r e c (R u l e_{2}) < p r e c (R u l e_{1}) - p r e c_l o s s$ then
return $f a l s e$
else if $F r e e (R u l e_{2}, C o l l e c t i o n) > F r e e (R u l e_{1}, C o l l e c t i o n)$ then
return $t r u e$
end if
return $f a l s e$
end function

The purpose of parameters.

i n i t_p r e c

allows the whole algorithm to stop even if there is a small percentage of rows with uncovered outcome. In particular, this parameter is used to fight off noise. The parameter

p r e c_l o s s

helps to create rules that are possibly not

100 %

accurate but have a good coverage. Change of these parameters can affect (positively or negatively) the quality of rule learning. An extensive study of the right choice of parameters for SeCo has been performed in [33,34]. Studying of the interplay of these parameters for the RSC is left for the future work.

SeCo with the $C h o o s e N e x t$ heuristic. Below and in the next section, we use the SeCo method running exactly the

C h o o s e N e x t

heuristic (Algorithm 2 and 3) as the RSC. The only modification we need is a different way to calculate precision: without taking into account the rows covered by the collection of the existing rules. Let us state this formally.

Let

C

be the current collection of rules. Let

C O V (C) = ⋃_{R \in C} (P O S (R) \cup N E G (R))

. Let R be a new rule. Let

F r e e P O S (R, C) = P O S (R) \ C O V (C)

and

F r e e N e g (R, C) = N E G (R) \ C O V (C)

. Let

f r e e_p r e c (R, C)

be defined as follows. If

F r e e P O S (R, C) \cup F r e e N E G (R, C) = \emptyset

then

f r e e_p r e c (R, C) = 0

. Otherwise,

f r e e_p r e c (R, C) = | F r e e P O S (R, C) | / (| F r e e P O S (R, C) | + | F r e e N E G (R, C) |)

. SeCo uses

f r e e_p r e c (R, C)

instead of

p r e c (R)

and exactly at the same places.

2.1. Advantages of RSC Versus Methods of Separate-and-Conquer (SeCo) and Weighted Covering

Definition 7.

Let

C

be a collection of rules and let R be a new rule. We say that R is reasonable w.r.t.

C

if

F r e e (R, C) \neq \emptyset

.

The only constraint of the RSC method is that each new rule is reasonable w.r.t. the collection of the previously formed rules. This condition is significantly weaker than that required for SeCo.

Indeed, let

C

be a collection of rules and recall that

C O V (C) = ⋃_{R \in C} (P O S (R) \cup N E G (R))

. The SeCo method, having formed

C

excludes rows

C O V (C)

from the dataset. A new rule R must have positive coverage outside of

C O V (C)

. Otherwise such a rule simply does not make sense. Clearly, such a rule R is reasonable.

Note also that

C h o o s e N e x t

heuristic receives the current collection

C

of rules as an argument. Therefore,

C h o o s e N e x t

can compute

C O V (C)

and hence implement any SeCo heuristic. We conclude that the RSC method generalizes SeCo.

Having access to

C O V (C)

also allows to implement any weight function within

C h o o s e N e x t

. We conclude that any rule growing heuristic for weighted covering that guarantees to return a reasonable rule w.r.t.

C

can be implemented within

C h o o s e N e x t

.

The above discussion is summarized by the following theorem.

Theorem 1.

The RSC method is a generalization of the SeCo method. The RSC method also generalizes weighted covering for an arbitrary rule growing heuristic, guaranteeing to return a reasonable rule w.r..t. the current formed collection of rules.

Remark 1.

1.: It is unlikely that the weighted covering can simulate RSC. Indeed, any assignment of weights discriminates the rows covered by the existing collection of rules. This is a stronger constraint than the requirement of the RSC that the new rule must be just reasonable.
2.: Our implementation of the RSC method maintains $C O V (C)$ and $F r e e (R, C)$ , where $C$ is the current collection of rules and R is the new rule being formed. Therefore, simulation of weighted covering or SeCo methods does not involve any computational overhead.
3.: Since the $C h o o s e N e x t$ heuristic receives the current collection of rules as an argument, it can enforce tree-likeness of the collection of rules. Hence, any DT algorithm can be easily implemented within this framework.

Thus we have seen that RSC generalizes SeCo. We need to show now whether there is any advantage in this generalization. In the next section, we provide empirical evidence to that effect. In the rest of this section we argue that RSC is better than SeCo also from the theoretical perspective. In particular, we propose a conjecture that, in order to have a comparable performance, SeCo must have a much larger training set.

This conjecture is stated for a broad domain called truth table learning, see Section 3.1.

We start from considering one particular scenario in which a rule learner has little choice but to make a wrong conclusion. In particular, consider a set of rules

R = (x_{1} \land x_{2}) \lor (x_{3} \land x_{4})

over a binary domain (that is, two rules

x_{1} = 1 \land x_{2} = 1

and

x_{3} = 1 \land x_{4} = 1

). Assume further that the rule learning algorithm runs on the following rather unfortunate training set: in all the rows covered by R the variable

x_{5}

equals 1 and in all the rows not covered by R the variable

x_{5}

equals 0. In this case, a rule learner, seeking to learn a short rule, would gladly report that the underlying rule is

x_{5}

(that is, the outcome equals one whenever

x_{5} = 1

).

The above anomaly can easily occur in small training sets but the larger the training set becomes the less likely anomalous patterns are to occur because many random choices tend to concentrate around the expectation. In the particular example above, the values of

x_{5}

can be considered as outcomes of independent coin tosses. If there are many such tosses then the percentages of 1 and 0 outcomes are likely to be close to

50 %

. Consequently, if there are many rows that are covered by rule R and many rows that are not covered by rule R then the above anomaly is very unlikely to happen.

The discussion above suggests that a rule learning needs a sufficiently large training set in order to work properly. Let us formalise this intuition. Suppose that we have n variables and the domain of each variable has m values. Further on, let f be a function on this variable induced by a set

S

of at most r random rules each involving at most k variables. Let

A

be a rule learner. Let us denote by

Q_{A} (n, m, r, k)

the size of a training set such that with a high probability

A

guesses the function correctly given the training set of this size. Denote

Q_{S E C O}

and

Q_{R S C}

the respective sizes of training sets for SeCo with the

C h o o s e N e x t

heuristic and RSC. Then we make the following conjecture.

Conjecture 1.

Q_{S E C O} (n, m, r, k)

is exponentially (by factor about

2^{r}

) larger than

Q_{R S C} (n, m, r, k)

.

The intuition behind this conjecture is that SeCo in fact considers not one but many training sets that are obtained from the original set by removal the rows covered by the already discovered rules. Since we do not know in advance which rules will be discovered first, we must consider removal of rows covered by all possible

2^{r}

subsets of rules. After those removals the remaining training set must be sufficiently large to derive the remaining rules. On the other hand the RSC is not subject to such a constraint. Thus we predict that the learning space needed for good performance of SeCo is larger by an exponential factor in r than such a space for RSC. This exponential factor is a compensation price for distortion of the learning space carried out by SeCo during its performance.

Conjecture 1 is closely related to so called Juntas Learning Problem [35] that is essentially a theoretical abstraction of the task of feature selection. The important difference is that we consider not a problem in general but rather specific algorithms for solving the problem.

3. Experiments

The purpose of this section is to empirically assess the potential of our Relaxed Separate-and- Conquer (RSC) approach. For this purpose, we compare RSC with Decision Tree (DT) and Separate-and-Conquer (SeCo) methods. We use the SeCo method equipped with the same heuristic as RSC (but computed over the dataset yet uncovered by the current collection of rules). Below we overview the DT method that we use for the experiments.

In the context of ML, DT is a directed rooted tree, whose non-leaf nodes correspond to conditions on attributes of a dataset and leaves correspond to the outcomes. The outgoing edges of each non-leaf node are labeled with

T r u e

and

F a l s e

meaning whether or not the condition associated with that node is satisfied. Thus each edge is associated with a condition which is either condition associated with its tail or the negation of this condition.

The semantics of DT is tied to its root-leaf paths. Each such a path P is seen as the set of conditions

C o n d_{1}, \dots, C o n d_{q}

associated with the edges of P plus the outcome

o u t

associated with the leaf. Thus each root-leaf path P of DT can be seen as a rule of the form

C o n d_{1} \land \dots \land C o n d_{q} \to o u t

, where

C o n d_{1} \land \dots \land C o n d_{q}

is the body of the rule consisting of conjunctions of individual conditions and

o u t

is the outcome of the rule.

The procedure of turning a DT into a set of rules as described above is called linearization. For example, the rules corresponding to the DT in Figure 1 are the following:

(A < 3) \land (B > 5) \to 1

;

(A < 3) \land (B \leq 5) \to 0

;

(A \geq 3) \to 1

.

We use a standard DT algorithm provided by the ML Python library Scikit-Learn [36], with the Gini index served as the splitter and the DT depth is upper-bounded by 7. To obtain a set of rules the resulting DT is linearized. For failure prediction we have two types of outcome only:

o u t = 1

associated with a failure and

o u t = 0

otherwise. We record only those rules whose outcome is 1. In other words, we ignore the rules with outcome 0 explaining why a particular failure does not occur because these rules are simply not relevant for our task.

Choice of parameters. As specified in the previous section, the RSC algorithm requires setting of three parameters:

i n i t_f r e e, i n i t_p r e c, p r e c_l o s s

. In all of our experiments, we set these parameters to

1, 1 %, 0.5 %

respectively.

The rest of the section consists of two subsection. In each subsection we consider a particular domain and compare our RSC approach with the DT based and SeCo rule learners using this domain.

3.1. Learning the Truth Table of the Given Collection of Rules

Any function on finite domain variables can be defined using a truth table. The truth table consists of all possible tuples of assignment of variables with their domain values, and each tuple is associated with the respective value of the function.

In our first experiment we randomly generate a small collection of small rules, then randomly select a subset S of rows of the truth table of the collection. Next, we run a RL algorithm (RSC, SeCo, and a DT based rule learner) on S with the goal to create a collection of rules as close as possible matching the original one.

The rest of the subsection is organized as follows.

We define a truth table for a collection of rules.
We specify an algorithm for generalization of a random collection of rules and of a random subset of its truth table.
We describe the tests that we performed and their results.

Truth table for a collection of rules and the induced function.

A collection of rules can be associated with many truth tables. This is because, in addition to the variables of the rules, the truth table can also contain many variables that are not essential for the rule. However, since the RL algorithm does not ‘know’ that these extra variables are not essential, these variables make the RL task more difficult.

For example, consider a single rule consisting of a single AVP

(x_{1}, [2, 4])

. A truth table for this rule may consist of 100 variables

x_{1}, \dots, x_{100}

. The domain of each variable can be e.g.,

{1, \dots, 10}

. However, the value of the respective function is determined only by the above AVP: it is 1 if the value of

x_{1}

is between 2 and 4, and 0 otherwise.

Having in mind the above example, we give below a formal definition of a truth table for a collection of rules. As an intermediate notion we also define a function induced by the collection of rules, a notion that we will use for the description of a training set.

Definition 8.

Let

C

be a collection of rules. Let X be the set of variables of

C

. For each

x \in X

, let

v a l (x)

be the set of values of X used in the rules of

C

. Let

X^{*}

be a set of variables such that

X \subseteq X^{*}

. For each

x \in X^{*}

, the domain

d o m (x)

of x is defined under the following constraint: if

x \in X

then

v a l (x) \subseteq d o m (x)

. Otherwise,

d o m (x)

is an arbitrary finite set. Then a function f induced by

C

is defined as follows. The domain of f is

X^{*}

. Let

X^{*} = {x_{1}, \dots, x_{n}}

. Let

v a l_{1}, \dots, v a l_{n}

be the tuple of assignments to the respective variables. If this tuple is covered by at least one rule of

C

then the corresponding value of f is 1, otherwise it is 0.

Given

X^{*}

and their domains as above, the truth table of

C

becomes the truth table of f. That is the rows of the table correspond to the

x_{1}, \dots, x_{n}

and the last column is for the outcome. The rows of the table are all the tuples of assignments to

X^{*}

and their corresponding value of f as described above.

Generation of a random collection of rules and a random subset of the related truth table.

Choose the following parameters.
(a)
$n u m_a t t r$ , the number of attributes.
(b)
$m a x_v a l$ , the largest value for each attribute meaning that the attribute values will lay in the interval $[0, m a x_v a l]$ .
(c)
$n u m_r u l e s$ , the number of rules to be generated.
(d)
$l e n_r u l e$ , the length of the generated rules
(e)
$n u m_r o w s$ , the number of rows of the training set.
Randomly generate a collection of $C$ having number of rules $n u m_r u l e s$ . Each rule is a random generation of $l e n_r u l e$ AVPs that can be done as follows.
(a)
Randomly choose $l e n_r u l e$ attributes for the given rule.
(b)
For each chosen attribute $a t t r$ , randomly generate an interval $[a, b]$ such that $0 \leq a \leq b \leq m a x_v a l$ and the resulting AVP is $(a t t r, [a, b])$ .
Randomly generate $n u m_r o w s$ of the ‘truth’ table for the above rules in order to create a training set. A row of the truth table is generated as follows.
(a)
Randomly generate a value for each attribute between 0 and $m a x_v a l$ .
(b)
Let $t p$ be the resulting tuple of attribute values.
(c)
If $t p$ is covered by $C$ then $o u t = 1$ , otherwise $o u t = 0$ .
(d)
Add $o u t$ in the last column of $t p$ .

To make the work of a rule learner more complicated, we also generate rows by introducing a random noise. In order to do this, we choose a small parameter

n o i s e_p r o b

(e.g.,

0.005

) and then, in the above algorithm, after having computed the outcome, alter it with probability

n o i s e_p r o b

.

Example 3 demonstrates the experiment.

Example 3.

Suppose

m a x_v a l = 1

that means that all the attribute values are binary: 0 or 1. Moreover, this also means that the collection of rules become Disjunctive Normal Forms (DNFs).

Then

n u m_r u l e s

becomes the number of conjuncts and

l e n_r u l e

becomes the length of the conjuncts. Suppose that both of them equal 2. Let the collection of rules be

(x_{1} \land x_{2}) \lor (x_{3} \land x_{4})

. Let also the number of attributes be 10. Thus we have defined the function

f (x_{1}, \dots, x_{10}) = (x_{1} \land x_{2}) \lor (x_{3} \land x_{4})

. The whole dataset is just the truth table of this function. The

n u m_r o w s

parameter is the size of the training set (seen by the algorithm). These

n u m_r o w s

rows are randomly selected out of the whole dataset. A RL algorithm is supposed to guess the whole function out of these rows.

Testing and analysis of the results.

Now, suppose a RL algorithm returns a collection of rules g. How can we determine the closeness of g to the function f induced by the collection of rules? The truth table of g consists of the same tuples as the truth table of f (but the values of the function can, of course be different). Therefore, we proceed as follows.

Calculate the numbers of rows that are satisfied by f, g and $f \land g$ (both g and f) and denote them by $| f |$ , $| g |$ and $| f \land g |$ , respectively.
The number $| f \land g | / | f |$ is the percent of rows covered by f that are also covered by g. The larger this number the better is the quality of the learned model.
The number $| f \land g | / | g |$ is the percent of rows covered by g that are also covered by f. The larger this number the smaller the number of rows of g that are not covered by f and the better the quality of g.

Note that the computation of the number of rows is in general an intractable problem. However, since we consider small collections of small rules this can be done by a brute-force algorithm.

We test the algorithms (RSC, SeCo, and DT) in the modes specified by the following parameters.

The number of extra variables. Extra variables are those that do not take part in the rules of f. They are inessential, however their presence can seriously hinder performance of an RL algorithm. Getting rid of such variables is the main task of the feature selection algorithm. We consider two extreme modes: few extra variables and many extra variables. We gradually increase the number of variables in order to see a point where the rule learner starts to work much worse. In case of many variables, the performance can be significantly improved with the introduction of feature selection algorithms. However, in this experiment we want to see how the algorithm deals alone with this matter.
Having many variables has another interesting feature: the size of the truth table becomes very large (100 variables with domain 2 in each of them result in a truth table of $2^{100}$ rows). This means that a training set (in which rows are explicitly presented) becomes tiny compared to the whole truth table. It is interesting to see how an RL algorithms would cope with this situation.
The number of extra values. Suppose x is a variable occurring in the rules of f. However, the domain of x may contain many values that do not take part in any interval of an AVP of x in f. When there are many additional values, the event of the function f equal 1 becomes rare and hence it is more difficult for a rule learner to ‘spot’ the rule. We will check truth tables with few and with many background values.

Thus the options of few/many extra variables and few/many extra values give us 4 modes of testing combined. If we add presence/absence of random noise this will make 8 modes of testing in total.

We perform experiments according to the above classification. Our conclusions based on these experiments are summarized below.

Small number of extra variables. In this case the RSC algorithm correctly reconstructs the original function. However, if we increase the number of values, the algorithm splits the original rules so the number of resulting rules is larger than the number of the original rules. The effect of splitting can be demonstrated on the following example. Consider a rule $(x_{1}, [1, 10]) \land (x_{2}, [1, 10])$ . As a result of splitting this rule can be represented by the following collection of four rules ${[x_{1}, [1, 5]) \land (x_{2}, [1, 5])$ , $(x_{1}, [1, 5]) \land (x_{2}, [6, 10])$ , $(x_{1}, [6, 10]) \land (x_{2}, [1, 5])$ , $(x_{1}, [6, 10]) \land (x_{2}, [6, 10])}$
The larger the intervals the stronger output of our algorithm is affected by splitting. This effect can be alleviated by using non-zero $p r e c_l o s s$ parameter, for instance, about half percent ( $p r e c_l o s s$ is defined for the function $I s R e p l a c e s$ in Algorithm 3 in Section 2). As a result of this, the algorithm is ‘encouraged’ to move to a larger interval even if the resulting precision is slightly smaller. However, the fragmentation of the rules still exists. We believe this can be addressed by a post-learning algorithm that tries to simplify the already created rules [4]. This is an interesting topic for future research.
Many extra variables. In this case, the RSC algorithm has tendency to include irrelevant variables into the rules. This inclusion has an interesting side effect: redundant variables in correct rules. For example, suppose we have a rule $[x_{1}, [1, 5]) \land (x_{2}, [1, 5])$ . If the number of variables is say 100 and the size of the sample of the truth table considered by the algorithm is say 1000 (tiny proportion of total number of $2^{100}$ of the truth table) then there might be some irrelevant variable with an interval whose precision is better then any interval of $x_{1}$ or $x_{2}$ . In this case the algorithm picks something like $(x_{10}, [1, 2])$ and then a relevant variable. This effect makes the collection of rules longer than needed. Still, in the vast majority of cases, the function of the collection of rules formed was exactly the function of the original rules.
In about $1 %$ of cases, we obtained rules with false positives. The reason for that is an effect of ‘shadowing’: when a training set is so tiny compared to the ‘full’ data, some statistical ‘anomalies’ are possible. For example, it may happen that an interval of an irrelevant variable perfectly correlates with the rows where the function is 1. Clearly, in this case, the algorithm will pick the correlating interval of an irrelevant variable.
The above situation can be fixed if the algorithm considers several random training sets of the same size. This allows the ‘stray’ irrelevant variable to be ‘shaken off’.
If in additional to many extra variables some relevant variables contain many extra values, the negative effects specified above are, of course moderately aggravated. For instance, the RSC algorithm did not manage to correctly guess the function only in about 3% of cases.
The influence of noise. The noise does not significantly affect the behavior of the algorithm as specified above. In particular, the RSC algorithm is still able to recognize the main rules and does not try to ‘collate’ the ‘noisy’ rows with the main ones.
Comparison of RSC with DT based and Separate-and-Conquer (SeCo) rule learners. Finally, it is important to say that on this domain the RSC algorithm works much better than the DT and SeCo (with $C h o o s e N e x t$ heuristic) rule learners.
Indeed, in those rare cases where RSC returns an incorrect collection of rules, the difference between the output and the original collection of rules has never been more than 2%. On the other hand, the rules returned by the DT based rule learner even in case of few extra variables and small domains are at least $20 %$ different from the original collection of rules. In case of many extra variables, the difference can be up to $40 %$ . The SeCo (with $C h o o s e N e x t$ heuristic) is only marginally better than DT.
A typical situation when both DT and SeCo fail to discover the right set of rules can be described by the following simple example. Suppose that the dataset consists of 10 attributes $a t t r_{1}, \dots, a t t r_{10}$ , each attribute can take values $1, \dots, 5$ and the outcome is 1 only for rows covered by one of the following two rules.
(a)
$a t t r_{1} \in [1, 4] \land a t t r_{4} \in [2, 5]$
(b)
$a t t r_{2} \in [1, 3] \land a t t r_{3} \in [3, 5]$
Both RSC and SeCo easily discover the first rule. RSC quickly discovers the second rule. However, the SeCo rule have been removed. It picks an unrelated variable and then creates many irrelevant rules just to cover the remaining rows. Unsurprisingly, on the testing set such rules are far from being accurate.

3.2. Failure Prediction Using a Real Industrial Dataset

For our experiments we use a real industrial dataset collected from a machine which manufactures the plastic bottle caps. This dataset consists of two following parts.

The first part is a collection of tuples of sensor readings provided in CSV format that have been collected over more than one year from this machine. Each tuple of sensor readings is associated with a timestamp. We create a table R with columns (attributes) corresponding to the sensors and the rows being the tuples of corresponding readings. To make connection with the second part of the data, we also keep the timestamps of the tuples in the memory.

The second part is information about alarms. This data consists of tuples having three components: start and end timestamps of an alarm and alarm error code. The alarms are associated with failures in this industrial machine, in the sense that if an alarm occurs, the machine should be switched off to find the failure. The alarm error codes are organized into four groups: shutdown, stoppage, mandatory action and message. The first two groups (shutdown and stoppage) are main errors that should be predicted to prevent failures in the machine. Five types of shutdown and stoppage alarms happen most often. In this section we refer to them by index

i \in {1, 2, 3, 4, 5}

for the purpose of explanation.

The rest of this subsection is divided into the following four parts.

Testing the ability of the considered algorithms to predict the actual alarms occurring at the given moment of time.
Testing the remaining useful life prediction (RUL), this is effectively the ability to predict an alarm to occur in the near future.
Testing the true and false positive rates.
Making conclusion based on the obtained empirical results.

Prediction of actual alarms.

For all alarms, we form the respective datasets

D_{1}, \dots, D_{5}

. Each

D_{i}

is formed as follows.

We take the table R created from the first part of the dataset and add to it one extra column $o u t$ .
For each row of R, we check whether alarm i occurred at the moment of the timestamp associated with the row. If it did, the value of $o u t$ in this row is set to 1. Otherwise, the value of $o u t$ is set to 0.

As a result, we obtain datasets where the sensor readings serve as attributes and the values of the last column serve as outcomes.

We perform the experiments for RSC, DT based and SeCo (with

C h o o s e N e x t

heuristic) algorithms as follows.

We run the algorithm for each $D_{i}$ separately. For this, we randomly partition the rows of $D_{i}$ into the training ( $70 %$ of the rows) and testing sets ( $30 %$ of the rows), and record all the rules.
Each rule is tested on the testing set corresponding to the predicted alarm. That is, for the predicted alarm i any rule obtained from the the training set of $D_{i}$ is tested on the testing set of $D_{i}$ . For each rule, we record its precision with respect to the testing set (see Definition 5).
We record together the rules obtained from the exploration of all datasets $D_{1}, \dots D_{5}$ , replacing the outcome 1 with the respective real alarm code, and remove those rules that cover less than 20 lines in the dataset as insignificant.

Some rules and their precision are reported in Table 1 for RSC, in Table 2 for a DT based rule learner and in Table 3 for SeCo. Each row in the tables corresponds to a rule. The first column ‘alarm’ states the predicted alarm code. The second column ‘rules’ describes the body of the rule, we grouped several rules predicting the same alarm. For example, the rule on the first row of Table 1 should be interpreted as follows. If the value of the attribute

% Z P

is 0 and the value of the attribute

A d s . H m i V i s . E X T R . P R E S S_I

is in the interval [8.63,170.95] then alarm 1017 occurs. The rule on the last row of Table 2 is interpreted as follows. If the value of the attribute

% P S

is greater than

40.7

and the value of the attribute

A d s . H m i V i s . C E N T R . P O T_M 1 B

is greater than 61.1 then alarm 3099 occurs. The last column ‘precision’ measures how much the given rule is precise for the dataset, calculating the percentage of rows of the testing set on which the alarm actually occurs among those covered by the respective rule (as in Definition 5).

Let us make one interesting remark. The rules generated by the above algorithms are overlapping in the sense that a row of table R can be covered by more than one rule meaning that the set of rules predict that more than one alarm is taking place during the corresponding timestamp. This means that two or more alarms may occur simultaneously. In fact, classifying each alarm separately is a standard ML approach for multiple classification tasks. It is called unordered rules and there is evidence that this approach makes more accurate predictions than learning mutually exclusive rules, see e.g., [37].

RUL prediction.

We also test the ability of algorithms to predict the remaining time to failure (or RUL - remaining useful life). In particular, for a time t seconds, we modify tables

D_{1}, \dots, D_{5}

to obtain the respective tables

D_{1}^{t}, \dots, D_{5}^{t}

as follows. Take table

D_{i}

and set the

o u t

column to 1 in those rows whose timestamp is at most t seconds before the timestamp of a row having

o u t = 1

in

D_{i}

. The resulting table is

D_{i}^{t}

. We report experiments with two values of t: 60 and 120, chosen for the sake of demonstration. The resulting rules and the testing results of the respective RSC, DT based and SeCo algorithms are reported in Table 4, Table 5 and Table 6 for

R U L = 60

s and in Table 7, Table 8 and Table 9 for

R U L = 120

s.

TP

and

FP

rates.

We also calculate the True Positive (

T P

) and False Positive (

F P

) rates for our experiments. For this task we form the dataset

D = D_{1} \lor D_{2} \lor D_{3} \lor \lor D_{4} \lor D_{5}

. That is, if any alarm occurred at the moment of the timestamp associated with each row,

o u t = 1

in this row, otherwise,

o u t = 0

. We run RSC and DT based rule learner on obtained dataset D for rule generation.

To calculate

T P

(correct prediction of alarms), we define A as a set of rows associated with any alarm in the testing set of D (having

o u t = 1

) and a as a number of all rows in A. Then

T P = t / a

, where t is the number of such rows of A which are covered by at least one rule. The results are

T P = 97 %

for RSC,

T P = 90 %

for DT based rule learner and

T P = 92 %

for SeCo.

To calculate

F P

(incorrect prediction of alarms), we define N be a set of rows in the testing set of D which are not associated with any alarm (with

o u t = 0

) and n is the number of all rows in N. Then

F P = f / n

, where f is the number of such rows of N which are covered by at least one rule. We obtain

F P = 0.01 %

for RSC,

F P = 0.2 %

for DT based rule learner and

F P = 0.1 %

for SeCo.

Also, we perform

T P

and

F P

calculation on datasets

D^{60} = D_{1}^{60} \lor \dots \lor D_{5}^{60}

and

D^{120} = D_{1}^{120} \lor \dots \lor D_{5}^{120}

, and obtain similar results.

T P

and

F P

calculation are provided in Table 10.

The proposed algorithm outputs rules predicting alarms (outcome 1). There are no rules making negative predictions (absence of the alarm). As a result, there are no false negative predictions. This, in turn, means that those measures that involve false negatives (TN, FN) do not make sense: for example, the accuracy coincides with the precision and the recall becomes equal 1.

Conclusions of experiments.

Based on the experiments, the following conclusions are reached.

The levels of precision for individual rules produced by the RSC algorithm are higher than those of DT based and SeCo rule learners.
The rules produced by RSC are significantly shorter than those produced by DT based and SeCo rule learners.
The $T P$ rate for RSC algorithm is much higher than that of DT based and SeCo rule learners: on average it is $97 %$ versus $90 %$ and $92 %$ , respectively. We attribute this improvement to the shortening of rules.
The $F P$ rate for RSC algorithm is also much better than for DT based and SeCo rule learners: $0.01 %$ versus $0.2 %$ and $0.1 %$ , respectively.

4. Conclusions

In this paper we have considered a new approach of RL: Relaxed Separate-and-Conquer (RSC).

We have demonstrated that RSC equipped with a simple heuristic outperforms the DT based rule learner and the SeCo algorithm equipped with the same heuristic on two domains in the area of failure prediction. We have concluded that RSC is a promising approach deserving further investigation.

We identify two interesting directions of future research: combining the RSC algorithm with a meta-methodology to increase accuracy and using the RSC in an unsupervised environment.

We identify two methodologies for increasing accuracy: random forest (RF) and post-pruning. Both these methodologies are in fact meta-methodologies: they are applicable to many learning algorithms.

The RF algorithm [38] aims to improve the precision of DT algorithm. The RF algorithm generates many (independent) random DTs. Separate prediction is made using each DT, and the prediction made by the whole model is the average of these predictions (suitably rounded if needed). The methodology of boosting a model by making multiple random choice is not inherently connected to DTs. For example, a well-known methodology in the area of AI search called randomized restarts [39] does exactly this to backtracking: the backtrack search stops at a random moment of time and starts again from a random point of the search space; this process is repeated many times over an over again. This rather pervasive nature of the methodology and also its serious theoretical justification based on the Law of Large Numbers [38] give us a reason to expect that RSC can also be boosted by this approach.

The methodology of post-pruning is applicable to any rule learning algorithm. The input of a post-pruning algorithm is a set of rules already created w.r.t. the given data set. The algorithm tries to make the given set of rules more compact (shortened and possible smaller). Numerous studies of this approach [40,41,42] show significant roles of post-pruning in reduction of overfitting. We plan to study methods of overfitting that increase the accuracy of RSC.

Unsupervised failure prediction is very important from the practical perspective. Indeed, some companies have log records related to the past performance of their equipment, but these records contain just sensor readings without alarms or failure notifications. Looking at these records, it is impossible to know when the alarms or failures actually occurred. It is natural however to assume that at the times around failures the sensor readings exhibited some anomalies that leads to the need of using methods of anomaly detection [17,43].

We plan to use RL for unsupervised PdM as a two stages process. In the first (preprocessing) stage, we will run an anomaly detection algorithm. As a result, the initially unsupervised data become supervised as the column of anomaly/no anomaly outcome is added. In the second stage, a supervised RL algorithm will be applied. Thus the process will produce rules for anomalies. It will be interesting to compare the resulting method with methods of mining rare patterns in the area of association rules [44].

Author Contributions

Conceptualization, M.R. and A.M.; methodology, M.R. and A.M.; software, M.R.; validation, M.R. and A.M.; formal analysis, M.R.; investigation, M.R. and A.M.; resources, M.R. and A.M.; data curation, M.R. and A.M.; writing–original draft preparation, M.R.; writing–review and editing, M.R. and A.M.; visualization, M.R.; supervision, A.M.; project administration, A.M.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This project has received funding from European Union’s Horizon 2020 research and innovation program under grant agreement Z-BRE4K No. 768869.

Acknowledgments

We thank the anonymous reviewers for their very helpful reviews for the initial version of our paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

attr	attribute
AVP	attribute-value pair
DNF	Disjunctive Normal Form
DT	Decision Tree
FP	False Positive
ML	Machine Learning
PdM	Predictive Maintenance
PvM	Preventive Maintenance
R2F	Run-to-Failure
RF	Random Forests
RL	Rule Learning
RSC	Relaxed Separate-and-Conquer
RUL	remaining useful life
SeCo	Separate-and-Conquer
TP	True Positive

References

Kearns, M.J.; Vazirani, U.V. An Introduction to Computational Learning Theory; MIT Press: Cambridge, MA, USA, 1994. [Google Scholar]
Fürnkranz, J.; Gamberger, D.; Lavrač, N. Foundations of Rule Learning; Cognitive Technologies; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Fürnkranz, J. Separate-and-Conquer Rule Learning. Artif. Intell. Rev. 1999, 13, 3–54. [Google Scholar] [CrossRef]
Cohen, W.W.; Singer, Y. A Simple, Fast, and Effective Rule Learner. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, Orlando, FL, USA, 18–22 July 1999; The MIT Press: Cambridge, MA, USA, 1999; pp. 335–342. [Google Scholar]
Schapire, R.E. The Strength of Weak Learnability. Mach. Learn. 1990, 5, 197–227. [Google Scholar] [CrossRef] [Green Version]
Susto, G.A.; Schirru, A.; Pampuri, S.; McLoone, S.F.; Beghi, A. Machine Learning for Predictive Maintenance: A Multiple Classifier Approach. IEEE Trans. Ind. Inform. 2015, 11, 812–820. [Google Scholar] [CrossRef] [Green Version]
Qiao, W.; Lu, D. A Survey on Wind Turbine Condition Monitoring and Fault. IEEE Trans. Ind. Electron. 2015, 62, 6536–6545. [Google Scholar] [CrossRef]
Kumar, A.; Chinnam, R.B.; Tseng, F. An HMM and polynomial regression based approach for remaining useful life and health state estimation of cutting tools. Comput. Ind. Eng. 2019, 128, 1008–1014. [Google Scholar] [CrossRef]
Mobley, R.K. An Introduction to Predictive Maintenance; Butterworth-Heinemann: Oxford, UK, 2002. [Google Scholar]
Carvalho, T.P.; Soares, F.A.; Vita, R.; da P. Francisco, R.; Basto, J.P.; Alcalá, S.G. A systematic literature review of Machine Learning methods applied to Predictive Maintenance. Comput. Ind. Eng. 2019, 137, 106024. [Google Scholar] [CrossRef]
Wuest, T.; Weimer, D.; Irgens, C.; Thoben, K.D. Machine Learning in Manufacturing: Advantages, challenges, and applications. Prod. Manuf. Res. 2016, 4, 23–45. [Google Scholar] [CrossRef] [Green Version]
Zhang, W.; Yang, D.; Wang, H. Data-Driven Methods for Predictive Maintenance of Industrial Equipment: A Survey. IEEE Syst. J. 2019, 13, 2213–2227. [Google Scholar] [CrossRef]
Durbhaka, G.K.; Selvaraj, B. Predictive Maintenance for Wind Turbine Diagnostics using vibration signal analysis based on collaborative recommendation approach. In Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics, Jaipur, India, 21–24 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1839–1842. [Google Scholar]
Garcia Nieto, P.J.; García-Gonzalo, E.; Sánchez-Lasheras, F.; de Cos Juez, F. Hybrid PSOSVMbased method for forecasting of the Remaining Useful Life for aircraft engines and Evaluation of its reliability. Reliab. Eng. Syst. Saf. 2015, 138, 219–231. [Google Scholar] [CrossRef]
Mathew, J.; Luo, M.; Pang, C.K. Regression kernel for prognostics with Support Vector Machines. In Proceedings of the 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Limassol, Cyprus, 12–15 September 2017; pp. 1–5. [Google Scholar]
Mathew, V.; Toby, T.; Singh, V.; Rao, B.M.; Kumar, M.G. Prediction of Remaining Useful Lifetime (RUL) of turbofan engine using machine learning. In Proceedings of the 2017 IEEE International Conference on Circuits and Systems (ICCS), Thiruvananthapuram, India, 20–21 December 2017; pp. 306–311. [Google Scholar]
Sipos, R.; Fradkin, D.; Mörchen, F.; Wang, Z. Log-based Predictive Maintenance. In Proceedings of the The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 1867–1876. [Google Scholar]
Zhang, X.; Liang, Y.; Zhou, J.; Zang, Y. A novel bearing fault diagnosis model integrated permutation entropy, ensemble empirical mode decomposition and optimized SVM. Measurement 2015, 69, 164–179. [Google Scholar] [CrossRef]
Heng, A.; Tan, A.; Mathew, J.; Montgomery, N.; Banjevic, D.; Jardine, A. Intelligent Conditionâ based Prediction of Machinery Reliability. Mech. Syst. Signal Process. 2009, 23, 1600–1614. [Google Scholar] [CrossRef]
Kolokas, N.; Vafeiadis, T.; Ioannidis, D.; Tzovaras, D. Forecasting faults of industrial equipment using Machine Learning Classifiers. In Proceedings of the 2018 Innovations in Intelligent Systems and Applications (INISTA), Thessaloniki, Greece, 3–5 July 2018; pp. 1–6. [Google Scholar]
Zhang, Z.; Si, X.; Hu, C.; Lei, Y. Degradation data analysis and Remaining Useful Life estimation: A review on Wiener-process-based methods. Eur. J. Oper. Res. 2018, 271, 775–796. [Google Scholar] [CrossRef]
Uhlmann, E.; Pastl, R.; Geisert, C.; Hohwieler, E. Cluster identification of sensor data for Predictive Maintenance in a Selective Laser Melting machine tool. Procedia Manuf. 2018, 24, 60–65. [Google Scholar] [CrossRef]
Lewis, A.D.; Groth, K.M. A Dynamic Bayesian Network Structure for Joint Diagnostics and Prognostics of Complex Engineering Systems. Algorithms 2020, 13, 64. [Google Scholar] [CrossRef] [Green Version]
Hu, C.; Youn, B.D.; Wang, P.; Yoon, J.T. Ensemble of Data-Driven Prognostic Algorithms for Robust Prediction of Remaining Useful Life. Reliab. Eng. Syst. Saf. 2012, 103, 120–135. [Google Scholar] [CrossRef] [Green Version]
Xiao, Y.; Hua, Z. Misalignment Fault Prediction of Wind Turbines Based on Combined Forecasting Model. Algorithms 2020, 13, 56. [Google Scholar] [CrossRef] [Green Version]
Wang, B.; Lei, Y.; Li, N.; Li, N. A Hybrid Prognostics Approach for Estimating Remaining Useful Life of Rolling Element Bearings. IEEE Trans. Reliab. 2020, 69, 401–412. [Google Scholar] [CrossRef]
Li, G.; Chen, H.; Hu, Y.; Wang, J.; Guo, Y.; Liu, J.; Li, H.; Huang, R.; Lv, H.; Li, J. An improved Decision Tree-based fault diagnosis method for practical variable refrigerant flow system using virtual sensor-based fault indicators. Appl. Therm. Eng. 2017, 129, 1292–1303. [Google Scholar] [CrossRef]
Li, H.; Parikh, D.; He, Q.; Qian, B.; Li, Z.; Fang, D.; Hampapur, A. Improving Rail Network Velocity: A Machine Learning Approach to Predictive Maintenance. Transp. Res. Part C: Emerg. Technol. 2014, 45, 17–26. [Google Scholar] [CrossRef]
Canizo, M.; Onieva, E.; Conde, A.; Charramendieta, S.; Trujillo, S. Real-time Predictive Maintenance for Wind Turbines using Big Data frameworks. In Proceedings of the 2017 IEEE International Conference on Prognostics and Health Management, Dallas, Texas, USA, 19–21 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 70–77. [Google Scholar]
Santos, P.; Maudes, J.; Bustillo, A. Identifying maximum imbalance in datasets for fault diagnosis of gearboxes. J. Intell. Manuf. 2018, 29, 333–351. [Google Scholar] [CrossRef]
Shrivastava, R.; Mahalingam, H.; Dutta, N.N. Application and Evaluation of Random Forest Classifier Technique for Fault Detection in Bioreactor Operation. Chem. Eng. Commun. 2017, 204, 591–598. [Google Scholar] [CrossRef]
Kauschke, S.; Fürnkranz, J.; Janssen, F. Predicting Cargo Train Failures: A Machine Learning Approach for a Lightweight Prototype. In Discovery Science, Proceedings of the 19th International Conference, DS 2016, Bari, Italy, 19–21 October 2016; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9956, pp. 151–166. [Google Scholar]
Fürnkranz, J.; Flach, P.A. An Analysis of Stopping and Filtering Criteria for Rule Learning. In Machine Learning: ECML 2004, Proceedings of the 15th European Conference on Machine Learning, Pisa, Italy, 20–24 September 2004; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3201, pp. 123–133. [Google Scholar]
Janssen, F.; Fürnkranz, J. An Empirical Investigation of the Trade-Off between Consistency and Coverage in Rule Learning Heuristics. In Discovery Science, Proceedings of the 11th International Conference, DS 2008, Budapest, Hungary, 13–16 October 2008; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5255, pp. 40–51. [Google Scholar]
Mossel, E.; O’Donnell, R.; Servedio, R.A. Learning juntas. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, San Diego, CA, USA, 9–11 June 2003; ACM: New York, NY, USA, 2003; pp. 206–212. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Clark, P.; Boswell, R. Rule Induction with CN2: Some Recent Improvements. In Machine Learning-EWSL-91, European Working Session on Learning; Springer: Berlin/Heidelberg, Germany, 1991; pp. 151–163. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Gomes, C.P.; Selman, B.; Crato, N.; Kautz, H.A. Heavy-Tailed Phenomena in Satisfiability and Constraint Satisfaction Problems. J. Autom. Reason. 2000, 24, 67–100. [Google Scholar] [CrossRef]
Cohen, W.W. Fast Effective Rule Induction. In Machine Learning. In Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; pp. 115–123. [Google Scholar]
Fürnkranz, J.; Widmer, G. Incremental Reduced Error Pruning. Machine Learning. In Proceedings of the Eleventh International Conference, New Brunswick, NJ, USA, 10–13 July 1994; Morgan Kaufmann: Burlington, MA, USA, 1994; pp. 70–77. [Google Scholar]
Fürnkranz, J. Pruning Algorithms for Rule Learning. Mach. Learn. 1997, 27, 139–172. [Google Scholar] [CrossRef] [Green Version]
Benedetti, M.D.; Leonardi, F.; Messina, F.; Santoro, C.; Vasilakos, A.V. Anomaly Detection and Predictive Maintenance for photovoltaic systems. Neurocomputing 2018, 310, 59–68. [Google Scholar] [CrossRef]
Koh, Y.S.; Ravana, S.D. Unsupervised Rare Pattern Mining: A Survey. ACM Trans. Knowl. Discov. Data 2016, 10, 1–29. [Google Scholar] [CrossRef]

Figure 1. DT example.

Table 1. Relaxed Separate-and-Conquer (RSC) (RUL = 0 s).

Alarm	Rules	Precision
$1017$	$(% Z P = 0) \land (A d s . H m i V i s . E X T R . P R E S S_I \in [8.63, 170.95])$	$98 %$
	$(% Z P = 0) \land (T E M P_E_{1} \in [185.73, 194.97])$	$95 %$
$3042$	$(T E M P_C I \in [49.6, 50.1]) \land (T E M P_E_{10} \in [164.76, 199.1]) \land (T E M P_Z P \in [21.4, 184.8])$	$96 %$
	$(T E M P_C I \in [49.6, 50.1]) \land (P R E S_P I \in [0, 0.8]) \land (% E_{1} \in [- 33.34, 15.1])$	$97 %$
	$(A d s . H m i V i s . C C M . C O R S A_A P \in [113.3, 114.5]) \land (T E M P_C I \in [21.7, 42]) \land (T E M P_P I \in [26, 31])$	$96 %$
$3167$	$(% E_{2} \in [- 5, 4.07]) \land (A d s . H m i V i s . C E N T R . P R E S S_B \in [0.2, 69.8]) \land (T E M P_P I \in [12.9, 20.6])$	$97 %$
	$(T E M P_E_{1} \in [170.53, 195.02]) \land (T E M P_P I \in [26.1, 31]) \land (T E M P_E_{4} \in [180, 197.1])$	$94 %$
$3197$	$(T E M P_P S \in [36.88, 37.3]) \land (A d s . H m i V i s . E X T R . P R E S S_I \in [7.83, 170.8])$	$96 %$
	$(T E M P_P S \in [36.9, 37.3]) \land (% C U \in [24.77, 100])$	$98 %$
	$(% P S \in [40.55, 100]) \land (A d s . H m i V i s . E X T R . P R E S S_I \in [17.58, 161.93])$	$97 %$
$3099$	$(A d s . H m i V i s . C E N T R . P O T_M 1 B \in [75.7, 78.3]) \land (% E_{5} \in [- 8.1, 6.4]) \land (% E_{1} \in [- 33.4, 29.4])$	$93 %$
	$(A d s . H m i V i s . C E N T R . P O T_M 1 B \in [58.8, 78.3]) \land (T E M P_C I \in [21.7, 28.5]) \land (% T B \in [0, 15.4])$	$96 %$

Table 2. Decision Tree (DT) based rule learner (RUL = 0 s).

Alarm	Rules	Precision
$1017$	$(T E M P_E_{2} > 178.79) \land (% Z P < = 2.46) \land (% T B < = 4.3)$	$89 %$
$3042$	$(T E M P_C I > 47.55) \land (T E M P_S T < = 49.15) \land (P O R T_P S > 0.8) \land (T E M P_E_{5} > 158.35) \land (T E M P_F U < = 167.2)$	$90 %$
	$(T E M P_C I > 47.55) \land (T E M P_S T < = 49.15) \land (P O R T_P S < = 0.8) \land (% E_{2} < = 11.125) \land$ $(A d s . H m i V i s . E X T R . P R E S S_O < = 15.76) \land (A d s . H m i V i s . C E N T R . P R E S S_B < = 15.03)$	$89 %$
$3167$	$(A d s . H m i V i s . C C M . C O R S A_A P > 114.22) \land (T E M P_P S < = 25.92) \land (T E M P_C I < = 44.95) \land$ $\land (A d s . H m i V i s . C E N T R . P R E S S_B > 0.17) \land (% T B < = 17.01)$	$94 %$
	$(A d s . H m i V i s . C C M . C O R S A_A P > 114.34) \land (T E M P_P S < = 25.92) \land (T E M P_C I < = 44.95) \land$ $\land (A d s . H m i V i s . C E N T R . P R E S S_B < = 0.17) \land (A d s . H m i V i s . E X T R . P R E S S_O < = 14.36) \land (T E M P_P I > 18.26) \land$ $(A d s . H m i V i s . E X T R . P R E S S_I > 7.4)$	$88 %$
$3197$	$(T E M P_P S > 36.89) \land (A d s . H m i V i s . E X T R . P R E S S_I > 5.57) \land % P I > - 4.69)$	$92 %$
	$(T E M P_P S > 36.89) \land (A d s . H m i V i s . E X T R . P R E S S_I < = 5.57) \land (A d s . H m i V i s . E X T R . P R E S S_O > 5.62)$	$91 %$
$3099$	$(% P S > 40.7) \land (A d s . H m i V i s . C E N T R . P O T_M 1 B > 61.1)$	$94 %$

Table 3. Separate-and-Conquer (SeCo) (RUL = 0 s).

Alarm	Rules	Precision
$1017$	$(% U G \in [0.87, 6.19]) \land (A d s . H m i V i s . C E N T R . P R E S S_B \in [4.11, 65.78]) \land (T E M P_E_{2} \in [175.8, 192.32])$	$93 %$
	$(% Z P = 0) \land (T E M P_E_{1} \in [185.73, 194.97]) \land (% E_{6} \in [- 12.3, 25.71]) \land (A d s . H m i V i s . C E N T R . P O T_M 1 B \in [54.62, 81.78])$	$90 %$
$3042$	$(P R E S_P I \in [0.3, 4.2]) \land (% E_{1} \in [- 20.17, 5.75]) \land (T E M P_E_{10} \in [170.19, 187.16]) \land (T E M P_Z P \in [27.46, 189.15])$	$91 %$
	$(P R E S_P I \in [0, 3.89]) \land (% E_{1} \in [- 24.8, 8.3]) \land (% E_{2} \in [- 37.89, 12.21]) \land % E_{4} \in [- 20.12, 15.33])$	$93 %$
	$(T E M P_C I \in [49.6, 50.1]) \land (P R E S_P I \in [0.67, 4.07]) \land (% E_{3} \in [- 32.64, 17.43]) \land (T E M P_F L \in [175.4, 199.4])$	$92 %$
	$(A d s . H m i V i s . C C M . C O R S A_A P \in [113.78, 114.85]) \land (T E M P_C I \in [22.42, 44.86]) \land (T E M P_P I \in [27.41, 37.16]) \land$ $(A d s . H m i V i s . C E N T R . P R E S S_B \in [15.46, 65.85])$	$90 %$
$3167$	$(% E_{1} \in [173, 184.3]) \land (% E_{2} \in [- 15.8, 24.52]) \land (A d s . H m i V i s . C E N T R . P R E S S_B \in [0.14, 25.7]) \land$ $(T E M P_P I \in [24.49, 32.6]) \land (T E M P_E_{1} \in [178.5, 193.02]) \land (T E M P_E_{4} \in [180.55, 199.51])$	$94 %$
$3197$	$(T E M P_P S \in [35.79, 38.13]) \land (A d s . H m i V i s . E X T R . P R E S S_I \in [4.63, 169.18]) \land (% C U \in [28.23, 96.74]) \land$ $(A d s . H m i V i s . E X T R . P R E S S_0 \in [17.25, 148.32])$	$91 %$
	$(T E M P_P S \in [37.35, 38.69]) \land (A d s . H m i V i s . E X T R . P R E S S_I \in [5.92, 172.5]) \land (% C U \in [24.77, 100]) \land$ $(A d s . H m i V i s . C E N T R . P O T_M 1 B \in [71.48, 77.84]) \land (T E M P_C I \in [20.43, 29.94])$	$93 %$
	$(% P S \in [40.55, 100]) \land (A d s . H m i V i s . C E N T R . P O T_M 1 B \in [74.75, 79.18]) \land (A d s . H m i V i s . E X T R . P R E S S_O \in [3.74, 12.5])$ $\land (A d s . H m i V i s . E X T R . P R E S S_I \in [17.58, 161.93])$	$90 %$
$3099$	$(A d s . H m i V i s . C E N T R . P O T_M 1 B \in [75.7, 78.3]) \land (T E M P_C I \in [22.25, 26.29]) \land (% T B \in [1.75, 13.16]) \land$ $(% E_{5} \in [- 8.1, 6.4]) \land (% E_{1} \in [- 33.4, 29.4])$	$90 %$
	$(T E M P_P S \in [32.41, 39.79]) \land (A d s . H m i V i s . E X T R . P R E S S_I \in [12.97, 167.17]) \land (T E M P_C I \in [21.7, 28.5]) \land$ $(% T B \in [0, 15.4]) \land (% P S \in [44.17, 96.51])$	$93 %$

Table 4. Relaxed Separate-and-Conquer (RSC) (RUL = 60 s).

Alarm	Rules	Precision
	$(T E M P_E_{4} \in [182.5, 197.1]) \land (% Z P = 0)$	$96 %$
$1017$	$(% U G \in [0, 7.59]) \land (T E M P_P I \in [12.9, 19.6])$	$95 %$
	$(T E M P_C I \in [49.6, 50.1]) \land (% E_{5} \in [- 8.08, 12.72])$	$97 %$
$3042$	$(T E M P_C I \in [47.9, 50.1]) \land (% U G \in [0, 13.25]) \land (% F L \in [0, 5.95])$	$93 %$
	$(A d s . H m i V i s . C C M . C O R S A_A P \in [114.09, 114.52]) \land (T E M P_C I \in [40.5, 42])$	$95 %$
$3167$	$(T E M P_E_{1} \in [170.6, 194.9]) \land (T E M P_P I \in [26.1, 31]) \land (T E M P_E_{2} \in [174, 196.6])$	$94 %$
	$(T E M P_P S \in [36.82, 37.3]) \land (A d s . H m i V i s . E X T R . P R E S S_I \in [7.8, 172.47])$	$93 %$
$3197$	$(T E M P_P S \in [36.82, 37.3]) \land (% E_{2} \in [4.58, 100])$	$95 %$
	$(T E M P_C I \in [21.7, 37.2]) \land (T E M P_S T \in [54.4, 61.2])$	$90 %$
$3099$	$(% P S \in [45.03, 100]) \land (% P I \in [- 100, - 62.01])$	$91 %$
	$(T E M P_C I \in [21.7, 28.5]) \land (A d s . H m i V i s . E X T R . P R E S S_O \in [4.8, 133]) \land (% Z F \in [10.2, 100])$	$89 %$

Table 5. Decision Tree (DT) based rule learner (RUL = 60 s).

Alarm	Rules	Precision
$1017$	$(T E M P_E_{4} > 182.45) \land (% U G > 8.84) \land (T E M P_E_{10} < = 193.86)$	$88 %$
	$(T E M P_E_{4} > 182.45) \land (% U G < = 8.84) \land (% Z P < = 10.08)$	$90 %$
	$(T E M P_C I > 47.65) \land (T E M P_S T > 49.25) \land (T E M P_P S < = 23.85) \land (% E_{5} < = 12.89) \land (T E M P_F U < = 180)$	$88 %$
$3042$	$(T E M P_C I < = 47.65) \land (% C U > 29) \land (% E_{2} < = 5.64) \land (% E_{3} > 2.17)$	$85 %$
	$(T E M P_C I > 47.65) \land (T E M P_S T < = 49.25) \land (T E M P_E_{4} < = 166.05) \land (% E_{4} < = 13.54) \land (% F L < = 21.73)$	$91 %$
$3167$	$(T E M P_P S < = 26.15) \land (A d s . H m i V i s . C C M . C O R S A_A P < = 114) \land (T E M P_E_{1} > 170.54) \land (T E M P_P I > 26.08)$	$93 %$
	$(T E M P_P S < = 26.15) \land (A d s . H m i V i s . C C M . C O R S A_A P > 114) \land (T E M P_C I < = 44.95) \land (% E_{1} < = 43.57) \land$ $(A d s . H m i V i s . C E N T R . P R E S S_B > 0.18)$	$92 %$
$3197$	$(T E M P_P S > 36.85) \land (T E M P_S T > 57.95) \land (A d s . H m i V i s . E X T R . P R E S S_I > 5.57) \land (% E_{1} > 8.37) \land (% F L > 6.98)$	$89 %$
	$(T E M P_P S > 36.85) \land (T E M P_S T > 57.95) \land (A d s . H m i V i s . E X T R . P R E S S_I < = 5.57) \land (% E_{4} < = 5.78)$	$85 %$
$3099$	$(% P S > 41.15) \land (A d s . H m i V i s . C E N T R . P O T_M 1 B > 58.78) \land (T E M P_F U > 182.1)$	$90 %$

Table 6. Separate-and-Conquer (SeCo) (RUL = 60 s).

Alarm	Rules	Precision
	$(T E M P_E_{1} \in [167.75, 193.43]) \land (T E M P_E_{4} \in [180.43, 194.98]) \land (T E M P_P I \in [10.15, 17.34])$	$93 %$
$1017$	$(% U G \in [0.9, 9.13]) \land (T E M P_E_{1} \in [166.12, 196.57]) \land (T E M P_C I \in [44.76, 52.21]) \land (T E M P_P I \in [11.69, 18.51])$	$89 %$
	$(T E M P_C I \in [45.6, 53.72]) \land (T E M P_E_{2} \in [174, 196.6]) \land (% F L \in [0.8, 7.98]) \land (% E_{5} \in [- 12.61, 17.54])$	$90 %$
$3042$	$(T E M P_C I \in [48.21, 54.66]) \land (% E_{2} \in [- 10.62, 18.04]) \land (% U G \in [0.98, 10.15]) \land (% F L \in [1.13, 10.65])$	$92 %$
	$(A d s . H m i V i s . C C M . C O R S A_A P \in [116.76, 121.39]) \land (T E M P_C I \in [37.18, 43.76]) \land$ $(A d s . H m i V i s . C E N T R . P R E S S_B \in [3.32, 20.23]) \land (T E M P_E_{6} \in [173.74, 199.32])$	$91 %$
$3167$	$(T E M P_E_{1} \in [174.57, 195.28]) \land (% E_{3} \in [- 10.72, 18.65]) \land (T E M P_P I \in [22.18, 33.78]) \land (T E M P_E_{2} \in [168.76, 189.25])$	$88 %$
	$(A d s . H m i V i s . C E N T R . P R E S S_A \in [10.74, 22.86]) \land (A d s . H m i V i s . E X T R . P R E S S_I \in [14.97, 162.83]) \land$ $(T E M P_P S \in [30.35, 39.71])$	$90 %$
$3197$	$(T E M P_P S \in [31.73, 35.12]) \land (T E M P_C I \in [40.81, 58.54]) \land (% E_{4} \in [31.72, 73.18]) \land$ $(A d s . H m i V i s . E X T R . P R E S S_I \in [15.14, 178.52]) \land (T E M P_P I \in [10.86, 22.86])$	$91 %$
	$(T E M P_C I \in [19.83, 41.45]) \land (A d s . H m i V i s . E X T R . P R E S S_I \in [17.89, 65.71]) \land (T E M P_S T \in [50.07, 82.13])$	$92 %$
$3099$	$(% P S \in [78.13, 80.54]) \land (% P I \in [- 75.34, - 12.53]) \land (A d s . H m i V i s . E X T R . P R E S S_O \in [21.75, 87.35]) \land (% Z F \in [13.65, 94.5])$	$93 %$

Table 7. Relaxed Separate-and-Conquer (RSC) (RUL = 120 s).

Alarm	Rules	Precision
$1017$	$(T E M P_E_{4} \in [182.47, 197.1]) \land (% Z P = 0)$	$91 %$
	$(% U G \in [0, 7.61]) \land (T E M P_F L \in [187.4, 199.3]) \land (% E_{1} \in [- 32.79, 0])$	$95 %$
$3042$	$(T E M P_C I \in [48.5, 50.1]) \land (T E M P_E_{5} \in [21.2, 165.5]) \land (% E_{1} \in [- 33.34, 12.81])$	$96 %$
	$(T E M P_C I \in [49.4, 50.1]) \land (% E_{4} \in [- 6.65, 12.14])$	$94 %$
$3167$	$(A d s . H m i V i s . C C M . C O R S A_A P \in [113.8, 114.5]) \land (T E M P_C I \in [21.7, 42.1])$	$91 %$
	$(T E M P_E_{2} \in [175.4, 196.9]) \land (T E M P_E_{1} \in [169.9, 195]) \land (T E M P_P I \in [26.1, 31])$	$95 %$
$3197$	$(T E M P_P S \in [36.8, 37.3]) \land (% E_{2} \in [4.57, 100])$	$94 %$
	$(T E M P_E_{1} \in [21.3, 164.9]) \land (A d s . H m i V i s . C C M . C O R S A_A P \in [113.47, 114.52])$	$96 %$
$3099$	$(T E M P_C I \in [21.7, 40.5]) \land (T E M P_P S \in [17.8, 25.3])$	$92 %$
	$(% U G \in [19.73, 100]) \land (T E M P_C U \in [184.4, 188.8]) \land (% E_{3} \in [5.49, 100])$	$96 %$
	$(% F L \in [7.99, 8.17]) \land (T E M P_C I \in [21.7, 28.5])$	$94 %$

Table 8. Decision Tree (DT) based rule learner (RUL = 120 s).

Alarm	Rules	Precision
$1017$	$(T E M P_E_{4} > 182.56) \land (% U G < = 9.67) \land (T E M P_E_{5} > 181.58)$	$89 %$
	$(T E M P_C I > 47.55) \land (T E M P_S T < = 49.25) \land (% E_{4} < = 13.59) \land (T E M P_P I > 21.09)$	$90 %$
$3042$	$(T E M P_C I < = 47.55) \land (% C U < = 30.14) \land (% Z P > 15.55) \land (T E M P_E_{9} < = 164.66)$	$88 %$
	$(T E M P_C I < = 47.55) \land (% C U > 30.14) \land (% E_{2} < = 4.74) \land (A d s . H m i V i s . E X T R . P R E S S_I < = 10.38)$	$89 %$
	$(T E M P_P I > 26.08) \land (A d s . H m i V i s . C C M . C O R S A_A P < = 114.48) \land (T E M P_E_{1} > 170.59) \land (T E M P_E_{4} > 179.15)$	$91 %$
$3167$	$(T E M P_P I > 26.08) \land (A d s . H m i V i s . C C M . C O R S A_A P > 114.48) \land (T E M P_P S < = 25.9) \land$ $(T E M P_C I < = 44.95) \land (T E M P_E_{7} > 184.75)$	$90 %$
	$(A d s . H m i V i s . E X T R . P R E S S_I > 5.56) \land (A d s . H m i V i s . E X T R . P R E S S_O < = 5.79) \land (T E M P_P S > 36.75) \land$ $(% C U > 22.87) \land (A d s . H m i V i s . C E N T R . P R E S S_B > 0.22) \land (% Z F < = 8.38)$	$92 %$
$3197$	$(A d s . H m i V i s . E X T R . P R E S S_I > 5.56) \land (A d s . H m i V i s . E X T R . P R E S S_O < = 5.79) \land (T E M P_P S > 36.75) \land$ $(% C U < = 22.87) \land (T E M P_F U < = 183.96)$	$91 %$
$3099$	$(% P S > 40.55) \land (A d s . H m i V i s . C E N T R . P O T_M 1 B > 58.68) \land (% Z P < = 11.93)$	$93 %$

Table 9. Separate-and-Conquer (SeCo) (RUL = 120 s).

Alarm	Rules	Precision
$1017$	$(T E M P_E_{1} \in [173.52, 195.67]) \land (T E M P_E_{2} \in [167.13, 183.73]) \land (% E_{5} \in [- 26.15, 3.75]) \land (% Z P \in [0, 74, 2.69])$	$89 %$
	$(% U G \in [0.3, 9.62]) \land (T E M P_F L \in [176.86, 197.81]) \land (% E_{1} \in [- 35.7, 0.76]) \land (% E_{2} \in [- 40.35, 1.6])$	$93 %$
$3042$	$(T E M P_C I \in [42.19, 53.9]) \land (T E M P_E_{1} \in [36.81, 112.52]) \land (% E_{3} \in [- 38.7, 17.51]) \land (T E M P_P S \in [22.5, 36.51])$	$92 %$
	$(T E M P_C I \in [47.4, 58.42]) \land (% U G \in [40.19, 88.61]) \land (% E_{3} \in [- 15.8, 47.5]) \land$ $\land (T E M P_P S \in [16.5, 42.62]) \land (% E_{4} \in [- 18.75, 35.9])$	$91 %$
$3167$	$(A d s . H m i V i s . C C M . C O R S A_A P \in [109.54, 119.83]) \land (T E M P_E_{7} \in [168.9, 194.61]) \land$ $(A d s . H m i V i s . C E N T R . P R E S S_B \in [6.28, 23.75]) \land (T E M P_C I \in [19.35, 38.6])$	$90 %$
	$(T E M P_E_{6} \in [187.4, 192.84]) \land (A d s . H m i V i s . C E N T R . P R E S S_A \in [8.86, 27.71]) \land$ $(T E M P_P I \in [29.82, 33.78]) \land (T E M P_E_{8} \in [125.7, 186.6])$	$94 %$
$3197$	$(T E M P_P S \in [33.56, 43.68]) \land (A d s . H m i V i s . E X T R . P R E S S_I \in [18.76, 146.67]) \land$ $(% E_{2} \in [7.8, 87.4]) \land (% E_{6} \in [3.78, 87.92])$	$93 %$
	$(A d s . H m i V i s . C E N T R . P R E S S_B \in [4.81, 84.65]) \land (% E_{1} \in [16.85, 86.9]) \land (T E M P_C U \in [163.7, 193.89]) \land$ $(% E_{6} \in [5.25, 76.92]) \land (A d s . H m i V i s . C C M . C O R S A_A P \in [103.87, 87.3]) \land (T E M P_E_{3} \in [27.2, 136.85])$	$91 %$
$3099$	$(T E M P_C I \in [22.87, 67.85]) \land (T E M P_P S \in [7.89, 18.82]) \land (% U G \in [29.15, 74.63]) \land (% E_{1} \in [8.14, 77.34])$	$89 %$
	$(% F L \in [2.96, 18.7]) \land (T E M P_C I \in [24.12, 35.9]) \land (T E M P_C U \in [167.8, 191.3])$	$88 %$

Table 10.

T P

and

F P

calculation.

Table 10.

T P

and

F P

calculation.

	RSC	DT Based	SeCo
$T P$	$97 %$	$90 %$	$92 %$
$F P$	$0.01 %$	$0.2 %$	$0.1 %$

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Razgon, M.; Mousavi, A. Relaxed Rule-Based Learning for Automated Predictive Maintenance: Proof of Concept. Algorithms 2020, 13, 219. https://0-doi-org.brum.beds.ac.uk/10.3390/a13090219

AMA Style

Razgon M, Mousavi A. Relaxed Rule-Based Learning for Automated Predictive Maintenance: Proof of Concept. Algorithms. 2020; 13(9):219. https://0-doi-org.brum.beds.ac.uk/10.3390/a13090219

Chicago/Turabian Style

Razgon, Margarita, and Alireza Mousavi. 2020. "Relaxed Rule-Based Learning for Automated Predictive Maintenance: Proof of Concept" Algorithms 13, no. 9: 219. https://0-doi-org.brum.beds.ac.uk/10.3390/a13090219

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Relaxed Rule-Based Learning for Automated Predictive Maintenance: Proof of Concept

Abstract

1. Introduction

2. Relaxed Separate-and-Conquer Rule Learning Approach

2.1. Advantages of RSC Versus Methods of Separate-and-Conquer (SeCo) and Weighted Covering

3. Experiments

3.1. Learning the Truth Table of the Given Collection of Rules

3.2. Failure Prediction Using a Real Industrial Dataset

4. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI