Next Article in Journal
Applications of Certain p-Valently Analytic Functions
Previous Article in Journal
Efficient Generation of Roots of Power Residues Modulo Powers of Two
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Combining Data Envelopment Analysis and Machine Learning

by
Nadia M. Guerrero
,
Juan Aparicio
* and
Daniel Valero-Carreras
Center of Operations Research (CIO), Miguel Hernandez University of Elche (UMH), 03202 Elche, Spain
*
Author to whom correspondence should be addressed.
Submission received: 20 February 2022 / Revised: 5 March 2022 / Accepted: 9 March 2022 / Published: 11 March 2022
(This article belongs to the Topic Multi-Criteria Decision Making)

Abstract

:
Data Envelopment Analysis (DEA) is one of the most used non-parametric techniques for technical efficiency assessment. DEA is exclusively concerned about the minimization of the empirical error, satisfying, at the same time, some shape constraints (convexity and free disposability). Unfortunately, by construction, DEA is a descriptive methodology that is not concerned about preventing overfitting. In this paper, we introduce a new methodology that allows for estimating polyhedral technologies following the Structural Risk Minimization (SRM) principle. This technique is called Data Envelopment Analysis-based Machines (DEAM). Given that the new method controls the generalization error of the model, the corresponding estimate of the technology does not suffer from overfitting. Moreover, the notion of ε -insensitivity is also introduced, generating a new and more robust definition of technical efficiency. Additionally, we show that DEAM can be seen as a machine learning-type extension of DEA, satisfying the same microeconomic postulates except for minimal extrapolation. Finally, the performance of DEAM is evaluated through simulations. We conclude that the frontier estimator derived from DEAM is better than that associated with DEA. The bias and mean squared error obtained for DEAM are smaller in all the scenarios analyzed, regardless of the number of variables and DMUs.

1. Introduction

One of the most important issues in the field of statistical learning is the reliability of statistical inference methods. In this framework, a sophisticated theory, the so-called Generalization Theory, explains which factors must be controlled to achieve good generalization. Optimal generalization is achieved when the error generated on evaluating new data through an inference learning method is minimized. The Generalization Theory copes with those factors that allow for the minimization of the prediction or generalization error.
In terms of pattern classifiers, the generalization error is the probability of misclassifying a randomly chosen example that holds with high probability over randomly chosen training sets, and then, a good generalization is achieved when this is minimized. This aim is possible if an upper bound of the generalization error is found, and the parameters on which it depends are controlled in order to reduce it. These bounds are understood as Probably Approximately Correct (PAC) bounds, which specifically means that the probability of the bound failing is small (Probably) when the bound is achieved through the classifier that has a low error rate (Approximately Correct). The standard PAC learning model implements the idea of finding this classifier: it considers a fixed hypothesis (classifier) class together with a required accuracy and confidence, and takes into account the theory that characterizes when a function from this class can be learned from examples (training sample) in terms of a measure called the Vapnik–Chervonenkis dimension (VC dimension).
However, the statistical learning theory (Vapnik [1]) reveals that it is much more interesting not to preselect the class that will contain the target function to be learned. Instead, it is defined a set of hypothesis classes saved as a hierarchy, and the target function to be learned lies in one of them. The Structural Risk Minimization (SRM) copes with the problem of minimizing an upper bound on the expected risk over each of these hypothesis classes (Vapnik [2]). To implement the SRM in Support Vector Machines (SVM), one must consider the structures (classes) that control two factors that appear in the bound of the expected risk: the value of empirical risk and the complexity (the appropriate bound for the generalization error). Thus, under this principle, to select a learning algorithm, it is necessary to have the theoretical bound of the generalization error (PAC bounds) and to deal with the minimization of this bound together with the empirical risk.
In addition, standard regression methods are only concerned with the minimization of the empirical risk. This is the one based on the error produced by the regressors with respect to the observed dataset. This error is defined as the distance between the data to the approximation function; thus, it is a measure of the deviation of the data with respect to the regressors. It is characterized as a residual. The vertical distance is the most common way to measure the regression error, although it is not induced by a mathematical norm. In Support Vector Regression (SVR) (Vapnik [1]), for example, the residuals that participate in the empirical risk are measured through the vertical distance. Other distances, in this case based on a norm, have been used in order to establish these residuals, such as the l 1 - d i s t a n c e , the l 2 - d i s t a n c e and the l - d i s t a n c e (Blanco et al. [3,4]).
The estimation of production functions and measures of efficiency and productivity have been the focus of a relatively large body of articles in the literature in both the economic and engineering contexts, as well as in operations research and statistics. In particular, Data Envelopment Analysis (DEA) (Charnes et al. [5] and Banker et al. [6]) is one of the existing techniques for estimating production functions and measuring efficiency. DEA relies on the construction of a polyhedral technology in the space of inputs and outputs that satisfies certain classical axioms of production theory (e.g., monotonicity and convexity). It is a non-parametric data-driven approach with many advantages from a benchmarking point of view. Additionally, the treatment of the multi-output multi-input framework is relatively straightforward with DEA, in comparison with other methods available. However, Data Envelopment Analysis has been criticized for its non-statistical nature, even being labeled as a pure descriptive tool of the data sample at a frontier level with little inferential power (its inferential power is exclusively based on the property of consistency and the increase in sample size instead of on the fundamentals of the method) (Esteve et al. [7]). DEA suffers from an overfitting problem because of the application of the minimal extrapolation principle, which places the estimator of the production function as close to the dataset as possible. This principle is also related to exclusively minimizing the empirical error (at a frontier level).
Regarding the literature related to this topic, some previous authors have tried to modify the standard DEA technique such that the new approaches work as inferential methods (with the focus on the DGP) rather than as mere descriptive tools. For example, Banker and Maindiratta [8] and Banker [9] associated DEA with maximum likelihood. Simar and Wilson [10,11,12] adapted bootstrapping to DEA. Kuosmanen and Johnson [13,14] introduced the Corrected Concave Nonparametric Least Squares. Unfortunately, despite the importance of machine learning techniques in the current literature, there have been few attempts to adapt DEA to the field of machine learning (see, for example, Esteve et al. [7], or Olesen and Ruggiero [15]). In this sense, our contribution could be seen as a new bridge between these two worlds: machine learning and efficiency measurement.
In this paper, our main objective is to propose, for the first time in the literature, a PAC bound in the context of the estimation of polyhedral technologies in microeconomics and engineering, enabling the possibility of controlling the generalization error of the estimation of the production frontier. Accordingly, we construct a model that controls the empirical error, together with the generalization error, through a PAC bound implementing the philosophy of Structural Risk Minimization by analogy with SVM. Our modeling has several implications:
(a)
For the first time, a bound of the generalization error is implemented to determine the degree of technical inefficiency of a set of Decision Making Units (DMUs).
(b)
We implement the minimization of the balance between the generalization error and the empirical error through a quadratic optimization model that will be called Data Envelopment Analysis-based Machines (DEAM), which has DEA as a particular case.
(c)
Through a computational simulation experience, we show that DEAM outperform DEA regarding bias and mean squared error.
(d)
We estimate production technologies using robust regression models that use the concept of margin. Due to that, the problem of efficiency measurement becomes a classification problem: to be efficient (being located within the margin) or not to be efficient (being located out of the margin).
Finally, we mention that the expected new insights gained by applying our approach (DEAM) are related to the determination of better estimates of production functions in engineering and microeconomics, in terms of bias and mean squared error. Additionally, these gains will also benefit the technical efficiency measures that can be derived from calculating the distance from a given observation to the production function estimate.
The rest of the paper is organized as follows. The following section provides the basic background. Next, in the third section, we introduce a new PAC for the class of piece-wise linear functions. In Section 4, a new approach called Data Envelopment Analysis-based Machines (DEAM) is defined and analyzed. Section 5 shows the main results associated with a computational experience for checking the new approach in comparison with DEA. Section 6 contains a discussion on the main results. Finally, the article ends with the conclusions section.

2. Background

In this section, we briefly introduce elemental notions of Support Vector Regression, Statistical Learning and Data Envelopment Analysis.

2.1. Support Vector Regression (SVR)

Machine learning (ML) is a methodology that studies computer processes that learn from experience and make improvements automatically. ML works with computer algorithms based on a learning sample (training data) and can make predictions about the behavior of future data. The study of this behavior is produced in two different scenarios: the scenario of supervised learning in which training data are vectors of predictors and responses, and the scenario of unsupervised learning, where no responses are considered in the data sample. In the first field, the objective of learning techniques is to determine the functional relationship between the predictors and the responses. In this case, the nature of the responses, if they come from a binary variable or are real values, determines the kind of problem to solve: a classification problem or a regression problem, respectively. In the second field, since there are no responses, the objective is to gain knowledge about the processes lying behind data generation, such as density estimation or clustering. Our paper largely focuses on the regression problem within supervised learning, bearing in mind that our data comprise inputs utilized by firms to produce outputs (real values).
Support Vector Machines (Vapnik [1,16]) is a technique that stands out in ML in the world of supervised learning. SVM represents an algorithm constructed on the foundations of statistical learning theory and is in line with the Structural Risk Minimization (SRM) method. SRM is implemented to construct support vector machines, where the objective is to control the value of empirical risk and the value of the VC dimension, which is the regularization term that appears when the generalization error must be minimized rather than minimizing only the empirical error (Vapnik [1,16]). In particular, the definition of the notion of the VC dimension is as follows:
Definition 1 (VC dimension).
Let H be a set of binary-valued functions. A set of points X is shattered by H if for all binary vectors b indexed by X ; there is a function f b H performing b on X . The VC dimension, V C   d i m H , of the set H is the size of the largest shattered set.
With regard to the classification problem (for the regression problem, the generalization error is defined in the same way, because the regression problem can be turned in to a classification problem, as we will go on to explain) in SVM, minimizing the generalization error consists of minimizing the probability of incorrectly classifying any new data that emerges from the unknown distribution that was generated by the learning sample. This aim is possible if a bound of the generalization error is found, and the parameters on which it is dependent are controlled to reduce the bound. These bounds are understood as Probably Approximately Correct (PAC) bounds, which were first proposed by Valiant [17]. The standard PAC learning implements the idea of finding this classifier: it considers a fixed hypothesis (classifier) class together with a required accuracy and confidence, and takes into account the theory that characterizes when a function from this class can be learned from examples (data). In the case of regression, the exercise involves converting the regression problem (estimation function) into a classification problem because bounds in the generalization error are precisely based on the VC dimension or when a margin is considered on the fat-shattering dimension (effective VC dimension).
Next, we show the definition of the fat-shattering dimension. Notice that bold will be utilized for denoting vectors, and non-bold for scalars.
Definition 2 (fat-shattering dimension).
Let F be a set of real-valued functions. A set of points X is γ - s h a t t e r e d by F if there are real numbers r x indexed by x X such that for all binary vectors b indexed by X , there is a function f b F such that
f b x r x + γ i f   b x = 1 r x γ o t h e r w i s e .
The fat-shattering dimension of the set F , f a t F , is a function from the positive real numbers to the integers that maps a value γ to the largest γ - s h a t t e r e d set. The VC dimension corresponds to the largest shattered set, considering γ = 0 , which is the concept first used by Vapnik to state a bound for the generalization error. This is the reason why the fat-shattering dimension is also known as the effective VC dimension.
To convert the regression problem into a classification problem, a threshold θ > 0 that marks the limit needs to be set, such that a mistake will be considered to have been made if it is exceeded by the loss function when testing with new data in the model. The function that determines the distance between the real value of the output and the estimated value of said output through the model is called the loss function. Given a margin γ > 0 , in the case of the training point, if the loss function exceeds the value θ γ , it will be considered as a mistake. Then, γ measures the discrepancy between the two losses: those measured on test data and those measured on training data. Under this re-interpretation of the regression problem, it is possible to use the dimension free bounds already constructed in the case of classification. In our case, we focus on the bound obtained by Shawe-Taylor and Cristianini [18], based on the fat-shattering dimension.
Theorem 1 (Shawe-Taylor and Cristianini [18]).
Let F be a sturdy class of real-valued functions with range a , a and fat-shattering dimension bounded by f a t F γ . Fix θ R with θ > 0 , and a scaling of the output range κ R + . Consider a fixed but unknown probability distribution in the space X × R . Then, with probability 1 ρ over randomly drawn training sets S of size m for all γ with θ γ > 0 , the probability that a training set filtered function f F has an error larger than θ on a randomly chosen input is bounded by
ε m , d , ρ = 2 m d log 2 256 m c γ 2 × log 2 16 e m c γ + log 2 16 m 1.5 a ρ κ
where c = max a , D S , f , γ + κ and
d = f a t F γ / 16 + 16 D S , f , γ + κ γ 2 ,
provided m 2 ε .
In the statement of Theorem 1, D S , f , γ = x , y S ξ x , y , f , γ 2 = ξ 2 and ξ x , y , f , γ = max 0 , e f x , y θ γ , where e f is the loss function that the analyst selects in order to measure how much f exceeds the error margin θ γ . In addition, the theorem introduces the concept of the fat-shattering dimension, f a t F γ , that is, the generalization of the VC dimension, which is sensitive to the size of the margin γ .
Theorem 1 is a general result, which in the case of each function class F , will be particularized: for each function class, the fat-shattering dimension is bounded in a different way, and consequently, the same happens with respect to the expected error proposed in (1). In the case of linear function classes, the fat-shattering dimension is bounded by Bartlett and Shawe-Taylor [19].
Theorem 2 (Bartlett and Shawe-Taylor [19]).
Suppose that X is a ball of radius r and center 0 m in R m , i.e., X = x R m : x r , and consider the set
F = x w x : w 1 , x X ,
Then
f a t F γ r γ 2 .
The most general version of this theorem, in which w is not restricted to be at most 1 , bounds the fat-shattering dimension of linear classifiers as f a t F γ w r γ 2 .
The following two previously published lemmas are significant for our purposes throughout this paper (see Bartlett and Shawe-Taylor [19]).
Lemma 1.
For every input set S   γ - s h a t t e r e d by F = x w x : x X (the linear hypothesis class) and for every subset S 0 S , S 0 S S 0 S γ w holds.
Lemma 2.
For all S R + m with x r for x S , certain S 0 S satisfies that S 0 S S 0 S r .
Then, S γ w S 0 S S 0 S r . In particular, S γ w S r , and it is possible to conclude that all sets of inputs S   γ - s h a t t e r e d by F are bounded. Therefore, the set γ - s h a t t e r e d by F with higher cardinality is also bounded, which is known as the fat-shattering dimension: f a t F γ w r γ 2 .
Now, if γ is fixed in such a way that θ γ > 0 , and disregarding the logarithmic factors in (1), the only term to reduce the expected error is (2). This process can be performed by implementing the minimization of its bound, which in the case of linear functions, is as follows:
d = f a t F γ / 16 + 16 D S , f , γ + κ D γ 2 w 2 + C D 2 .
This expected error bound meets the SRM objective: the minimization process leads to more than minimizing the empirical risk, i.e., D 2 = x , y S ξ x , y , f , γ 2 . Instead, it minimizes the capacity of the estimation function to provide a suitable prediction when a new observation (out of sample) is introduced and that is given by the appearance of the regularization term, that is w 2 , which bounds the fat-shattering dimension (PAC bound). The minimization of this bound corresponds to the objective of the regression problem associated with Support Vector Regression (SVR).
Support Vector Regression (SVR), as with any regression approach, attempts to construct a function that is capable of predicting the behavior of the response variable under the study. SVR sets out to predict the value of a continuous response variable y R + given a vector of covariables x R + m . Hence, SVR establishes a function f ^ : R + m R such that, given x , f ^ x yields the response variable prediction. Under the SVR principle, the linear predictor f ^ can be defined as f ^ x = w * x + b * , where w * R m and b * R are optimal solutions of the optimization model below:
M i n w , b , ξ i , ξ i w 2 + C i = 1 n ξ i 2 i + ξ i 2 y i w x i + b ε + ξ i , i = 1 , , n w x i + b y i ε + ξ i , i = 1 , , n ξ i , ξ i 0 , i = 1 , , n
In performing this methodology, the values of C R + and ε R + are obtained by a cross-validation process. The SVR yields an estimator f ^ x of the response variable given x as well as lower and upper ‘correcting’ surfaces, defined as f ^ x ε and f ^ x + ε , where ε is a margin that enhances the estimator linked to SVR with robustness (see Figure 1). Additionally, observations below the surface f ^ x ε reveal an associated (empirical) error of ξ i > 0 (with ξ i = 0 ), while observations above the surface f ^ x + ε present an (empirical) error of ξ i > 0 (with ξ i = 0 ). Observations between the surfaces f ^ x ε and f ^ x + ε reveal an error of zero (with ξ i = ξ i = 0 ). The objective function, however, represents the combination of regression and regularization involved in SVR, combining the empirical error term i = 1 n ξ i 2 + ξ i 2 and the regularization term w 2 through a weight C , thus balancing both components (Vazquez and Walter [20]). Moreover, although hyperplanes are linear in shape, it must be highlighted that SVR is able to generate estimation functions that are not necessarily linear in the original x , y space, and that can be achieved by using a transformation function ϕ , a conversion arising from the covariable space, ϕ : R + m Z . Figure 1 shows the solution of the linear estimator achieved by an SVR model, as well as the graphical representation of the residuals (empirical error) for two points and the hyperplanes that define the margins (dashed lines).
The next subsection explains how Data Envelopment Analysis (DEA) works.

2.2. Data Envelopment Analysis (DEA)

Let us consider the observation of n Decision Making Units (DMUs). D M U i takes up x i = x i ( 1 ) , , x i ( m ) R + m amounts of inputs to generate y i = y i ( 1 ) , , y i ( s ) R + s amounts of outputs. The relative efficiency of each unit in the sample is evaluated by referring to the so-called production possibility set or technology, which is essentially the set of producible bundles of x , y . It is generally defined as:
T = x , y R + m + s : x   can   produce   y
Under Data Envelopment Analysis (DEA) (Charnes et al. [5] and Banker et al. [6] and more recently, Villa et al. [21], Sahoo et al. [22], and Amirteimoori [23]), T is usually assumed to satisfy free disposability with regard to inputs and outputs; that is, if x , y T , then x , y T with x x and y y . Convexity of T is also generally assumed (see, e.g., Färe and Primont [24]).
Insomuch as the measurement of technical efficiency is concerned, a certain subset of T is of interest. We allude to the weakly efficient set of T , defined as W T : = x , y T : x ^ < x , y ^ > y x ^ , y ^ T (Let z = z 1 , , z q and t = t 1 , , t q . Then, z < t means z j < t j for all j = 1 , , q . ). Some authors (see, for example, Briec and Lesourd [25]) define technical efficiency as the distance from a point in T to the weakly efficient set.
When s = 1 , this context is confined to the central concept of production function f . Accordingly, m input variables are used to yield a univariate output, and hence, we can define the technology as:
T = x , y R + m + 1 : y f x .
According to the selected distance for measuring technical inefficiency, different DEA models emerge (Cooper et al. [26]). The directional distance function (DDF) is a relevant example of them. For m inputs and one output, resorting to the directional vector g = g , g + , where g = 1 m and g + = 1 , the DDF problem has the following structure when the efficiency level of D M U i is assessed, i = 1 , , n :
M a x β i , λ 1 , , λ n β i s . t . k = 1 n λ k x k ( j ) x i ( j ) β i , j = 1 , , m k = 1 n λ k y k y i + β i , k = 1 n λ k = 1 , λ k 0 k = 1 , , n
Given that (6) is a linear program, we can equivalently solve its corresponding dual formulation:
M i n c i , p i , α i p i y i + c i x i + α i s . t . p i y k c i x k α i 0 , k = 1 , , n c i , p i 1 = 1 , p i 0 , c i ( j ) 0 , j = 1 , , m
DEA models must be solved for each D M U i , i = 1 , , n , in the sample.
Figure 2 shows an example of the DDF model with a distance vector g = g , g + = 1 m , 1 . Note that DEA generates a piece-wise linear technology (the region below the line), satisfying free disposability in inputs and outputs and convexity. Note also that the DEA estimate envelops all the observations from above. In this case, with g = 1 m , 1 , the DDF coincides with a particular distance between data and W T : the l -distance (Briec [27] and Briec and Lesourd [25]).
In this paper, our purpose is to construct a method that generates piece-wise linear frontiers as in Figure 2, by implementing the minimization of the generalization error of the model.

3. New PAC Learning with Piece-Wise Linear Hypothesis

This section revolves around two stages in the search for the generalization error bound: the first stage is based on the construction of the class of piecewise linear hypotheses whose elements are hyperplanes that are located as close as possible to the data sample through l - d i s t a n c e , and the second stage is based on the construction of the bound of the fat-shattering dimension of the class of hypothesis constructed in the first stage. The minimization of the bound of the expected error using the bound of the fat-shattering dimension calculated gives rise to the Data Envelopment Analysis-based Machines (DEAM) model as a method for estimating piecewise linear production functions, which minimizes the generalization error as well as the empirical error.
To obtain this bound of the class of functions of our interest, we must derive the fat-shattering dimension bound for the hypothesis class with the piece-wise structure we desire. Then, minimizing the generalization error will be implemented through the minimization of the fat-shattering dimension bound. For this task, a previous step must be taken: a class of piece-wise linear hypothesis must be defined. A piece-wise linear hypothesis target is defined by a combination of n hyperplanes H p p = 1 , , n that are selected to evaluate the data depending on their input values. The hyperplanes will be defined for each input value x R + m as follows:
H p x = x , y R m + 1 : w p x x + β p x δ p x y = 0
Then, if we suppose δ p x > 0 , p x 1 , , n , each output value estimation through the set of n hyperplanes H p p = 1 , , n can be written as a function of the input value vector x :
h x = w p x δ p x x + β p x δ p x , p x 1 , , n ,
with w p x R + m and β p x R . The value of p x 1 , , n , in our case, is chosen by considering two desired conditions that are inherited from production theory:
h x = w p x δ p x x + β p x δ p x 0 ,
and
w p x δ p x x + β p x δ p x w p δ p x + β p δ p , p 1 , , n .
Condition (8) ensures that the estimation of the output value associated with an input x R + m will be always non-negative. Additionally, condition (9) guarantees that the estimation h x through the hyperplane H p x is less or equal than the estimation through any other hyperplane H p . Condition (9) is the one that imposes concavity on the model. This type of condition was the key for stating concavity in the general multiple-regressor modeling in microeconomics (Afriat [28]; Kuosmanen et al. [13]). In particular, if the production function is concave, then the technology defined from this production function is convex.
The function class of piece-wise linear hypothesis can be constructed as follows:
F = x w p x δ p x x + β p x δ p x : x R , w p x δ p x x + β p x δ p x 0 , w p x δ p x x + β p x δ p x w p δ p x + β p δ p , p x , p 1 , , n .
Now, we can proceed with the second step: to establish a bound for the fat-shattering dimension of this function class to control the generalization error. Before proving the main theorem of this section, we need to state a necessary technical lemma. In the results, r R + is the radius of the ball centered in 0 m that bounds the input data in the data sample.
Lemma 3. 
If an input learning sample, S = x 1 , , x d is γ -shattered through F defined in (10), then every subset S 0 S satisfies
S 0 S S 0 S M i n p 1 , , n γ β p δ p M a x p 1 , , n w p δ p 2 r ,
denoting as S 0 and S S 0 the sum of the elements in S 0 and S S 0 , respectively, and as S the cardinal of the set S .
Proof. 
See Appendix A. ☐
Next, we prove the main theorem of this section. In particular, we state the bound for the fat-shattering dimension for piece-wise linear hypothesis classes.
Theorem 3.
Let X be the ball of radius R and center 0 m in R m , i.e., X = x R m : x r , and let the hypothesis class be as follows
F = x w p x δ p x x + β p x δ p x : x R , w p x δ p x x + β p x δ p x 0 , w p x δ p x x + β p x δ p x w p δ p x + β p δ p , p x , p 1 , , n ,
then
f a t F γ r M i n p 1 , , n γ β p δ p M a x p 1 , , n w p δ p 2 r 2 .
Proof. 
See Appendix A. ☐
The next section involves the task of achieving a model that minimizes the established generalization error through the l -distance.

4. Data Envelopment Analysis-Based Machines (DEAM)

Data Envelopment Analysis-based Machines (DEAM) can be defined from the idea of minimizing the expected error proposed in (1). If we do not consider the logarithmic factors, we can directly focus on minimizing d in this expression, for which a bound on the generalization error has been found in the case of the piece-wise linear hypothesis class F defined in (12):
d = f a t F γ / 16 + 16 D S , f , γ + κ γ 2 r M i n p 1 , , n γ 16 β p δ p M a x p 1 , , n w p δ p 2 r 2 A + 16 D S , f , γ + κ γ 2 B
Because of the complexity of implementing an optimization model in which the objective function has the aim of minimizing the above bound, we will break up the minimization of the whole bound into different objectives, which will be collected in an aggregation function that will conform the objective function of the final optimization program associated with DEAM, which will be shown later in this section.
Once the number of different hyperplanes in each hypothesis is set as the number of elements in the learning sample S = n , minimizing the bound of the fat-shattering dimension requires minimizing part A in (14). This is equivalent to maximizing M i n p 1 , , n γ 16 β p δ p M a x p 1 , , n w p δ p . Regarding this last expression, we must maximize the numerator and minimize the denominator, as follows:
(i)
The vector of coefficients (slopes) corresponding to the hyperplane H p is w p , δ p . We can consider, without loss of generality, that w p , δ p 1 = 1 . Then, minimizing M a x p 1 , , n w p δ p , is equivalent to minimizing M a x p 1 , , n 1 1 w p 1 since δ p 0 , p 1 , , n . Focusing on that last equivalence, this objective can be directly translated into minimizing M a x p 1 , , n w p .
(ii)
Maximizing M i n p 1 , , n γ 16 β p δ p with a fixed value of the margin γ is equivalent to minimizing M a x p 1 , , n β p δ p . Because of w p , δ p 1 = 1 , by minimizing M a x p 1 , , n w p in (i), at the same time, the maximization of the elements δ p p 1 , , n is achieved. In this way, it is only necessary to minimize M a x p 1 , , n β p to maximize M a x p 1 , , n β p δ p .
Finally, a way of implementing (i) and (ii) is minimizing u + v , where w p u , β p v , p 1 , , n . Accordingly, minimizing the bound of the fat-shattering dimension, A + B , leads to minimizing u + v + C D 2 , where D 2 = D S , f , γ 2 = ξ 2 2 and C is a parameter to be tuned by, for example, a cross-validation process. As a loss function we use the following: ξ x , y , f , γ = max 0 , D x , y , f θ γ . Finally, the objective function has the following structure:
z u , v , ξ 1 , , ξ n = u + v + C ξ 2 2 .
Accordingly, we introduce the optimization model that defines DEAM:
M i n w , β , δ , ξ , ξ , u , v u + v + C i = 1 n ξ i 2 + i = 1 n ξ i 2 s . t . w i 1 u i = 1 , , n 16.1 β i v i = 1 , , n 16.2 δ p y i w p x i + β p i , p = 1 , , n 16.3 w i , δ i 0 i = 1 , , n 16.4 w i x i + β i δ i y i ε + ξ i i = 1 , , n 16.5 δ i y i w i x i β i ε + ξ i i = 1 , , n 16.6 ξ i , ξ i 0 i = 1 , , n 16.7 w i , δ i 1 = 1 i = 1 , , n 16.8 w i x i + β i δ i y i w p x i + β p δ p y i i , p = 1 , , n 16.9
Model (16) determines a maximum of n different hyperplanes. The intersection of the half-spaces defined from these hyperplanes gives rise to the estimator of the underlying (convex) production technology. The number of hyperplanes to be considered in the implementation of the DEAM model can be seen as a key parameter of our approach since the results could be different depending on it. However, we suggest using n hyperplanes, which coincide with the number of DMUs. This is due to the experimental evidence found in the simulation study carried out in Section 5. We analyzed 2000 databases, and in all these cases, the number of hyperplanes at optimum were less than the number of DMUs in the corresponding data sample. This situation can be identified because some hyperplanes are repeated at the optimal solution of each problem.
Let us now explain each constraint of model (16) in detail. Constraints (16.1) and (16.2) come from w p u , β p v , p = 1 , , n , respectively. The norm l 1 is used to be consistent with constraint (16.8). Additionally, this type of norm is associated with the definition of linear constraints, which are easier to be solved from a computational point of view. Constraint (16.3) is equivalent to y i w p δ p x i + β p δ p , i , p = 1 , , n , i.e., it ensures that the hyperplanes envelop the data sample from above. Condition (16.4) forces that the n hyperplanes are monotonic non-decreasing and will be responsible for the satisfaction of the property of free disposability, as we will show later in the text (see Proposition 2 below). Constraints (16.5), (16.6), (16.7), and (16.8) allow for characterizing ξ x , y , f , γ as max 0 , D x , y , f θ γ . The parameter ε   = θ γ 0 will be chosen by cross validation. Let us now interpret specifically the value at optimum of the decision variable ξ i . Let us pay attention to constraint (16.5). If w i x i + β i δ i y i ε 0 , then ξ i = w i x i + β i δ i y i ε since i = 1 n ξ i 2 is minimized in the objective function. In this way, considering (16.8), (16.3) and ε 0 , ξ i can be interpreted as the l -distance from the observation x i , y i to the hyperplane H i ε :
ξ i = w i x i + β i δ i y i ε w i , δ i 1 = D l x i , y i , H i ε ,
where H i ε = x , y R m + 1 : w i x + β i δ i y ε = 0 (Mangasarian [29]). If w i x i + β i δ i y i ε < 0 , then ξ i = 0 by (16.7) and the minimization of i = 1 n ξ i 2 . Additionally, regarding the value of ξ i , i = 1 , , n , by constraints (16.3), (16.6), ε 0 and the minimization of i = 1 n ξ i 2 , we obtain ξ i = 0 for all i = 1 , , n at optimum. This point has computational implications on the model since constraint (16.6) can be removed from it because (16.3) holds. Finally, constraint (16.9) guarantees that, for each x i , y i in the data sample, the hyperplane of the piece-wise linear production function associated with that point is the closest one to x i , y i . Note that constraint (16.9), by (16.3) and (16.8), is equivalent to writing D l x i , y i , H i D l x i , y i , H p   i , p = 1 , , n (see Mangasarian [29]).
Figure 3 shows the shape of the function that will be generated by the model as an estimate of the underlying production function. Note that the estimate satisfies monotonicity and concavity, as happens with the DEA estimator. However, the DEAM estimator does not satisfy minimal extrapolation. Additionally, it implements a certain idea of robustness because of the margin notion inherited from SVR. Additionally, Figure 3 shows the possible interpretation of ξ i as D l x i , y i , H i ε . In particular, ξ i is the ‘radius’ of the squared ball in the figure.
As the technology generated by DEA, DEAM provides a piece-wise linear technology that can be defined as T D E A M : = x , y R + m + 1 : w p * x + β p * δ p * y 0 , p 1 , , n , given an optimal solution w p * , β p * , δ p * , ξ p * , ξ p * p = 1 , , n , u * , v * of model (16).
The next propositions state that the derived technology from model (16) satisfies convexity and free disposability.
Proposition 1. 
T D E A M is a convex set.
Proof. 
The intersection of half-spaces is a convex set. ☐
Proposition 2. 
T D E A M satisfies free disposability in inputs and outputs.
Proof. 
The result holds because w p , δ p 0 for all p 1 , , n (see Kuosmanen and Johnson [13]). ☐
Additionally, by constraint (16.3), we have that w p * x i + β p * δ p * y i 0 , i , p = 1 , , n . Therefore, for any observation x i , y i , we have that w p * x i + β p * δ p * y i 0 , p = 1 , , n , which implies that x i , y i T D E A M since T D E A M = x , y R + m + 1 : w p * x + β p * δ p * y 0 , p 1 , , n . In this way, we can establish the following corollary.
Corollary 1. 
The production possibility set generated by DEA is a subset of the production possibility set generated by DEAM.
Proof. 
The result holds because the production possibility set generated by DEA and the production possibility set yielded by DEAM satisfy convexity, free disposability, and contain all observations, but only the technology related to DEA meets minimal extrapolation. ☐
In this way, we have that DEAM does not satisfy the minimal extrapolation principle, but its associated estimation of the technology always contains the observations.
As for the measurement of technical inefficiency of the observations, due to the nature of the technique used and based on the original ideas derived from Support Vector Regression, any x i , y i located within the margin will be identified as technically efficient (with ξ i * = 0 ). Otherwise, i.e., if x i , y i is located below the margin (see Figure 3), we have that ξ i * is the l -distance from the observation to the (efficient) frontier of a ‘robust’ technology. This robust technology is defined by the translation of the original technology T D E A M downward following the value of the margin ε . If we define this translated technology as T D E A M ε = x , y R + m + 1 : w p * x + β p * δ p * y ε 0 , p 1 , , n , then ξ i * = D l x i , y i , W T D E A M ε (this result can be derived from Aparicio and Pastor [30]).
Now, we show the relationship between the Directional Distance Function (DDF) in DEA, model, and the DEAM model (16): The DDF model always yields a feasible solution of the model associated with Data Envelopment Analysis-based Machines.
Theorem 4. 
Let c i * , α i * , p i * i = 1 , , n be a set of optimal solutions of model (7) for each DMUi, i = 1, …, n. Then, c i * , α i * , p i * , ϑ i * , ϑ i * i = 1 , , n , a * , b * , with ϑ i * = p i * y i + c i * x i + α i * , ϑ i * = p i * y i c i * x i α i * = 0 , i = 1 , , n , a * = max i = 1 , , n c i * , b * = max i = 1 , , n α i * is a feasible solution of model (16).
Proof. 
Let c i * , α i * , p i * i = 1 , , n be a set of optimal solutions of model (7) for each DMUi, i = 1, …, n. By the characterization of a * and b * as a * = max i = 1 , , n c i * and b * = max i = 1 , , n α i * , the following inequalities are true:
c i * a *
α i * b *
Then, c i * , α i * , p i * , ϑ i * , ϑ i * i = 1 , , n , a * , b * satisfies (16.1) and (16.2) in the DEAM model. Because of the fact that c i * , α i * , p i * i = 1 , , n is a set of optimal solutions of model (7) for each DMUi, i = 1, …, n, the constraints of this model are satisfied for this solution:
p i * y k c i * x k + α i * k = 1 , , n ; i = 1 , , n ( 19.1 ) c i * , p i * 1 = 1 i = 1 , , n ( 19.2 ) c i * , p i * 0 i = 1 , , n ( 19.3 )
Then, c i * , α i * , p i * , ϑ i * , ϑ i * i = 1 , , n , a * , b * trivially satisfies (16.3), (16.4) and (16.8) in the DEAM model. Because of the definition of the variables ϑ i * and ϑ i * as ϑ i * = p i * y i + c i * x i + α i * , ϑ i * = p i * y i c i * x i α i * = 0 , i = 1 , , n , we have that:
p i y i + c i x i + α i ϑ i * + ε
and
p i * y i c i * x i α i * ϑ i * + ε ,
i = 1 , , n , and ε 0 . Then, c i * , α i * , p i * , ϑ i * , ϑ i * i = 1 , , n , a * , b * satisfies (16.5) and (16.6) in the DEAM model. Additionally, we have
0 p i * y i + c i * x i + α i * = ϑ i * i = 1 , , n
and,
0 p i * y i c i * x i α i * = ϑ i * i = 1 , , n .
Constraint (22) is satisfied by (19.1), and (23) is trivially satisfied. Then, c i * , α i * , p i * , ϑ i * , ϑ i * i = 1 , , n , a * , b * satisfies (16.7) in the DEAM model. Finally, the objective in (7) is to minimize ϑ i = p i y i + c i x i + α i , i = 1 , , n ,
that implies
ϑ i * = p i * y i + c i * x i + α i * p k y i + c k x i + α k k = 1 , , n .
Then, c i * , α i * , p i * , ϑ i * , ϑ i * i = 1 , , n , a * , b * satisfies (16.9) in the DEAM model. Consequently, c i * , α i * , p i * , ϑ i * , ϑ i * i = 1 , , n , a * , b * is a feasible solution of (16). ☐
However, it can be shown that the DDF model (7) does not always yield an optimal solution of model (16).

5. Computational Experience

This section compares the performance of DEA and DEAM for estimating production functions. For this task, we designed five typical production scenarios in Table 1.
The simulations implement Cobb–Douglas production functions, which are frequently used in econometrics for establishing the relation between the maximum amount of outputs that can be produced from a set of inputs. Thereby, scenario I implements a mono-input mono-output case, while the other scenarios represent multi-input mono-output cases. For each scenario, we ran 100 trials t = 1 , , 100 with sample sizes: n 25 , 50 , 75 , 100 . The inputs were calculated randomly from U n i 1 , 10 . For simulating inefficiencies, we selected a random distribution exp 1 / 3 for u . Mean squared error (MSE) and bias were the two measures employed to assess the performance of each method.
The DEAM model (16), as other machine learning techniques, needs to find the best model through a cross-validation process. For this task and exclusively for the DEAM model, we implemented a five-fold cross validation using a certain grid of hyperparameters. This grid was arbitrarily set as: C   1 ,   10 ,   50 ,   100 ,   10 6 and ε 0 ,   0.001 ,   0.01 ,   0 .1 ,   0 .2 ,   0 .3 ,   0 .4 ,   0 .5 ,   1 . Note that DEA does not need to apply a cross-validation process. Instead, DEA uses the whole dataset to evaluate efficiency scores.
Table 2 sums up the results obtained for each scenario when DEA (without cross validation) and DEAM (with cross validation) are applied. The first two columns present the type of scenario and the sample size. The following columns show the mean and standard deviation (in brackets) of MSE obtained by DEA and DEAM. Fraction of trial reports the proportion of trials in which DEAM either improves upon or equals the MSE given by the DEA method, while the next column illustrates the percentage of improvement of DEAM with respect to DEA. The four subsequent columns are similar to the previous ones, but with regard to bias.
Regarding the results, the DEAM method performed better than DEA, with improvements ranging from 5% to 45% on average in MSE and 2% to 28% in bias. This fact increased when the number of inputs were higher. In addition, the results illustrate how the model worked better when the number of DMUs was around 50–75. Scenario I, i.e., the single input single output framework, shows small differences between the two methods. Nevertheless, in the trials, DEAM outperformed DEA in more than 95% of the cases. In contrast, the best analyzed situation was scenario V (one output and five inputs) with n = 25 , showing a 45% reduction in MSE and 28% in bias, on average. This last result could be interpreted in favor of the DEAM approach as an indication that DEAM also seemed to outperform DEA with respect to the curse of dimensionality (Charles et al. [31]).

6. Discussion

In this section, we briefly discuss the main results of this paper and how they can be interpreted from the perspective of previous studies, mainly those based on Data Envelopment Analysis. Our findings and their implications are also discussed. Some limitations of our approach are highlighted.
In this paper, we have introduced a new way of estimating production frontiers in engineering and microeconomics, which is based upon the same fundamentals of Support Vector Machines (SVM), which is a well-known machine learning technique. Our numerical results have demonstrated that the frontier estimator derived from the new methodology (DEAM) is better than that associated with Data Envelopment Analysis (DEA), which represents the standard non-parametric technique for determining technical efficiency in the literature. The bias and mean squared error obtained for DEAM are smaller in all the scenarios analyzed, regardless of the number of variables and DMUs.
In comparison with the standard literature, the new methodology is more flexible. It generates production possibility sets that satisfy convexity, free disposability in inputs and outputs, and contain all the observations, but they do not meet the postulate of minimal extrapolation. In contrast, DEA satisfies all the above properties. In particular, minimal extrapolation is the reason why DEA can be seen as an overfitted model to estimate the underlying Data Generating Process (DGP) that is behind the generation of the data sample. DEAM does not suffer from this overfitting problem. However, it is not evident where the production possibility set, estimated by a non-overfitted model, should be located in the input–output space to correctly approximate the underlying technology, which, by definition, is unknown to us. In this regard, in this paper, we have implemented for the first time a strategy based on the idea of Structural Risk Minimization (Vapnik [1]) and cross validation, introducing a new PAC (Probably Approximately Correct) bound in production theory with the aim of solving the overfitting problem linked to DEA.
Some other authors have tried to modify the standard DEA technique such that the new approaches work as inferential methods (with the focus on the DGP) rather than as mere descriptive tools. For example, Banker and Maindiratta [8] and Banker [9] associated DEA with maximum likelihood. Simar and Wilson [10,11,12] adapted bootstrapping to DEA. Kuosmanen and Johnson [13,14] introduced the Corrected Concave Nonparametric Least Squares. Unfortunately, despite the importance of machine learning techniques in the current literature, there have been few attempts to adapt DEA to the field of machine learning (see, for example, Esteve et al. [7], or Olesen and Ruggiero [15]). In this sense, DEAM has allowed us to build a new bridge between these two worlds: machine learning and efficiency measurement.
Finally, we would like to highlight a clear limitation associated with the new approach. DEAM is linked to an intensive computational procedure based on cross validation. This feature contrasts sharply with the simplicity of Data Envelopment Analysis.

7. Conclusions and Future Work

In this paper, for the first time, a bound on the generalization error for a piece-wise linear hypothesis has been established in the context of Support Vector Regression (SVR), by also considering typical axioms from production theory: convexity and free disposability. It shapes a new nexus between non-parametric frontier analysis and machine learning in the line recently followed by Esteve et al. [7], Valero-Carreras et al. [32], and Olesen and Ruggiero [15]. The new formulation on the bound of the generalization error of this kind of hypothesis gives rise to a new way of bounding the whole expected error when we approximate a target function through a piece-wise linear function, also controlling the empirical error. Minimizing this bound led to the definition of a new model, called Data Envelopment Analysis-based Machines (DEAM), which generates production function estimations that seek a balance between the empirical error and the generalization error.
Classical non-parametric techniques, such as DEA, suffer from the overfitting problem because they assume the axiom of minimal extrapolation (Banker et al. [6], Afriat [28], and Farrell [33]). The DEAM model, however, is more flexible when it comes to estimating production frontiers through a cross-validation process, disregarding the minimal extrapolation axiom, as was shown by a computational experience in this paper.
Finally, we finish by mentioning several lines that pose interesting avenues for further research. The first one is the possibility of extending the method to model multi-output situations. This could be interesting for dealing with more realistic production situations, considering information on the correlation among several outputs. Second, we could use other transformation functions (kernel methods) for the input space, in the same way as standard Support Vector Regression.

Author Contributions

Conceptualization, N.M.G. and J.A.; methodology, N.M.G. and J.A.; software, D.V.-C.; validation, N.M.G. and D.V.-C.; formal analysis, N.M.G., J.A. and D.V.-C.; investigation, N.M.G., J.A. and D.V.-C.; resources, N.M.G., J.A. and D.V.-C.; data curation, N.M.G. and D.V.-C.; writing—original draft preparation, N.M.G., J.A. and D.V.-C.; writing—review and editing, N.M.G., J.A. and D.V.-C.; visualization, N.M.G. and D.V.-C.; supervision, J.A.; project administration, J.A.; funding acquisition, J.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ministerio de Ciencia e Innovación/Agencia Estatal de Investigación/10.13039/501100011033 grant number PID2019-105952GB-I00, by Generalitat Valenciana grant number ACIF/2020/155, and by Miguel Hernández University of Elche grant number 01623/2020.

Acknowledgments

The authors are grateful to the two anonymous reviewers for providing constructive comments and helping in improving the contents and presentation of this paper. Additionally, the authors are thankful for grant PID2019-105952GB-I00 funded by Ministerio de Ciencia e Innovación/Agencia Estatal de Investigación/10.13039/501100011033. D. Valero-Carreras is thankful for the financial support from the Generalitat Valenciana under grant ACIF/2020/155. Finally, N. Guerrero is thankful for the financial support from the Miguel Hernández University of Elche under grant 01623/2020.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof of Lemma 3. 
Let S = x 1 , , x d be γ -shattered by
F = x w p x δ p x x + β p x δ p x : x R , w p x δ p x x + β p x δ p x 0 , w p x δ p x x + β p x δ p x w p δ p x + β p δ p , p x , p 1 , , n
witnessed by r 1 , , r d R . Then, for all b = b 1 , , b d 1 , 1 d , there are w p b p 1 , , n , δ p b p 1 , , n and β p b p 1 , , n satisfying for all i 1 , , d the following inequality:
b i w p x i δ p x i b x i + β p x i δ p x i b r i γ
Let set S 0 S , and consider two cases:
  • Case 1: If r i : x i S 0 r i : x i S S 0 , then b i = 1 if and only if x i S 0
  • Case 2: If r i : x i S 0 < r i : x i S S 0 , then b i = 1 if and only if x i S S 0
Let us suppose that r i : x i S 0 r i : x i S S 0 , with b i = 1 if and only if x i S 0 (CASE 1). For all x i S 0 , we have
b i w p x i δ p x i b x i + β p x i δ p x i b r i = 1 w p x i δ p x i b x i + β p x i δ p x i b r i = w p x i δ p x i b x i + β p x i δ p x i b r i x i S 0 S γ ,
that is
w p x i δ p x i b x i + β p x i δ p x i b r i + γ .
Then, taking the sum over the elements in the set S 0 , we obtain the expression i / x i S 0 w p x i δ p x i b x i + β p x i δ p x i b i / x i S 0 r i + γ ; which yields the following inequality:
i / x i S 0 w p x i δ p x i b x i + β p x i δ p x i b r i : x i S 0 + S 0 γ .
From F , we have w p x i δ p x i b x i + β p x i δ p x i b w p δ p b x i + β p δ p b , for all p 1 , , P . Thereby, the inequality
w p δ p b i / x i S 0 x i S 0 + S 0 β p δ p b = i / x i S 0 w p δ p b x i + β p δ p b i / x i S 0 w p x i δ p x i b x i + β p x i δ p x i b
is satisfied p 1 , , n . Finally,
w p δ p b S 0 + S 0 β p δ p b r i : x i S 0 + S 0 γ .
Now, let x i S S 0 , then
b i w p x i δ p x i b x i + β p x i δ p x i b r i = 1 w p x i δ p x i b x i + β p x i δ p x i b r i = w p x i δ p x i b x i + β p x i δ p x i b + r i x i S S 0 S γ
Then,
w p x i δ p x i b x i + β p x i δ p x i b r i γ .
Following the idea of applying the summary of elements, but now considering x i S S 0 , the inequality i / x i S S 0 w p x i δ p x i b x i + β p x i δ p x i b i / x i S S 0 r i γ holds. It can be rewritten as
i / x i S S 0 w p x i δ p x i b x i + β p x i δ p x i b r i : x i S S 0 S S 0 γ .
Now, there is x i S S 0 such that w p x i δ p x i b x i + β p x i δ p x i b w p x i δ p x i b x i + β p x i δ p x i b for all x i S S 0 . Consequently, we have that S S 0 w p x i δ p x i b x i + β p x i δ p x i b   i / x i S S 0 w p x i δ p x i b x i + β p x i δ p x i b . S S 0 0 because S S 0 is the cardinal of the set S S 0 . Conversely, w p x i δ p x i b x i + β p x i δ p x i b 0 by definition of F . Then, S S 0 w p x i δ p x i b x i + β p x i δ p x i b S S 0 w p x i δ p x i b x i + β p x i δ p x i b . Now, we can guarantee that
S S 0 w p x i δ p x i b x i + β p x i δ p x i b i / x i S S 0 w p x i δ p x i b x i + β p x i δ p x i b ,
and by (A2), the following inequality holds:
S S 0 w p x i δ p x i b x i + β p x i δ p x i b r i : x i S S 0 S S 0 γ .
Considering inequalities (25) and (28), for p x i 1 , , n , we have that ( A B C D A C B D .)
w p x i δ p x i b S 0 + S 0 β p x i δ p x i b + S S 0 w p x i δ p x i b x i + S S 0 β p x i δ p x i b r i : x i S 0 + S 0 γ r i : x i S S 0 + S S 0 γ
and then,
w p x i δ p x i b S 0 + S S 0 x i + S β p x i δ p x i b r i : x i S 0 r i : x i S S 0 + S γ .
Under the supposition in the case 1 that r i : x i S 0 r i : x i S S 0 , we have that
r i : x i S 0 r i : x i S S 0 0 ,
which implies that
r i : x i S 0 r i : x i S S 0 + S γ S γ .
Therefore, for p x i 1 , , n , we have
w p x i δ p x i b S 0 + S S 0 x i + S β p x i δ p x i b S γ ,
that is,
w p x i δ p x i b S 0 + S S 0 x i S γ β p x i δ p x i b .
Under the Cauchy–Schwarz inequality, we have
w p x i δ p x i b S 0 + S S 0 x i w p x i δ p x i b S 0 + S S 0 x i ,
and then, we have
S 0 + S S 0 x i = S 0 S S 0 + S S 0 + S S 0 x i T r i a n g u l a r S 0 S S 0 + + S S 0 + S S 0 x i T r i a n g u l a r S 0 S S 0 + i / x i S S 0 x i + S S 0 x i x i r , i 1 , , n S 0 S S 0 + S S 0 r + S S 0 r = S 0 S S 0 + 2 S S 0 r S S 0 S S 0 S S 0 + 2 S r .
In this way, it is possible to obtain
w p x i δ p x i b S 0 S S 0 + 2 S r w p x i δ p x i b S 0 + S S 0 x i S γ β p x i δ p x i b .
Because w p x i δ p x i b M a x p 1 , , n w p δ p b and γ β p x i δ p x i b M i n p 1 , , n γ β p δ p b , then
M a x p 1 , , n w p δ p b S 0 S S 0 + 2 S r S M i n p 1 , , n γ β p δ p b .
Finally,
S 0 S S 0 S M i n p 1 , , n γ β p δ p b M a x p 1 , , n w p δ p b 2 r ,   S 0 S .
The proof for case 2 is analogous. ☐
Proof of Theorem 3. 
By Lemma 3, we have
S 0 S S 0 S M i n p 1 , , n γ β p δ p M a x p 1 , , n w p δ p 2 r ,
for every subset S 0 S , with S = x 1 , , x d being an input learning sample γ -shattered through F defined in (10). Additionally, by Lemma 2, for all S R + m with x r for x S , some S 0 S satisfies the following condition:
S 0 S S 0 S r .
Then, for certain S 0 S , we have
S 0 S S 0 S M i n p 1 , , n γ β p δ p M a x p 1 , , n w p δ p 2 r   and   S 0 S S 0 S r .
Therefore,
S M i n p 1 , , n γ β p δ p M a x p 1 , , n w p δ p 2 r S r .
Finally,
S r M i n p 1 , , n γ β p δ p M a x p 1 , , n w p δ p 2 r 2 .  
Because this is true for all S   γ -shattered by F , it will be also true for the largest set γ -shattered by F , which means that f a t F γ will be bound in that way:
f a t F γ r M i n p 1 , , n γ β p δ p M a x p 1 , , n w p δ p 2 r 2 .

References

  1. Vapnik, V. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
  2. Vapnik, V. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1992; pp. 831–838. [Google Scholar]
  3. Blanco, V.; Puerto, J.; Salmerón, R. Locating hyperplanes to fitting set of points: A general framework. Comput. Oper. Res. 2018, 95, 172–193. [Google Scholar]
  4. Blanco, V.; Puerto, J.; Rodriguez-Chia, A.M. On lp-Support Vector Machines and Multidimensional Kernels. J. Mach. Learn. Res. 2020, 21, 14. [Google Scholar]
  5. Charnes, A.; Cooper, W.W.; Rhodes, E. Measuring the efficiency of decision making units. Eur. J. Oper. Res. 1978, 2, 429–444. [Google Scholar] [CrossRef]
  6. Banker, R.D.; Charnes, A.; Cooper, W.W. Some models for estimating technical and scale inefficiencies in data envelopment analysis. Manag. Sci. 1984, 30, 1078–1092. [Google Scholar] [CrossRef] [Green Version]
  7. Esteve, M.; Aparicio, J.; Rabasa, A.; Rodriguez-Sala, J.J. Efficiency analysis trees: A new methodology for estimating production frontiers through decision trees. Expert Syst. Appl. 2020, 162, 113783. [Google Scholar] [CrossRef]
  8. Banker, R.D.; Maindiratta, A. Maximum likelihood estimation of monotone and concave production frontiers. J. Product. Anal. 1992, 3, 401–415. [Google Scholar] [CrossRef]
  9. Banker, R.D. Maximum likelihood, consistency and data envelopment analysis: A statistical foundation. Manag. Sci. 1993, 39, 1265–1273. [Google Scholar] [CrossRef]
  10. Simar, L.; Wilson, P.W. Sensitivity analysis of efficiency scores: How to bootstrap in nonparametric frontier models. Manag. Sci. 1998, 44, 49–61. [Google Scholar]
  11. Simar, L.; Wilson, P.W. A general methodology for bootstrapping in non-parametric frontier models. J. Appl. Stat. 2000, 27, 779–802. [Google Scholar]
  12. Simar, L.; Wilson, P.W. Statistical inference in nonparametric frontier models: The state of the art. J. Product. Anal. 2000, 13, 49–78. [Google Scholar] [CrossRef]
  13. Kuosmanen, T.; Johnson, A.L. Data envelopment analysis as nonparametric least-squares regression. Oper. Res. 2010, 58, 149–160. [Google Scholar] [CrossRef] [Green Version]
  14. Kuosmanen, T.; Johnson, A. Modeling joint production of multiple outputs in StoNED: Directional distance function approach. Eur. J. Oper. Res. 2017, 262, 792–801. [Google Scholar]
  15. Olesen, O.B.; Ruggiero, J. The hinging hyperplanes: An alternative nonparametric representation of a production function. Eur. J. Oper. Res. 2022, 296, 254–266. [Google Scholar]
  16. Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
  17. Valiant, L.G. A theory of the learnable. Commun. ACM 1984, 27, 1134–1142. [Google Scholar]
  18. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, MA, USA, 2000. [Google Scholar]
  19. Bartlett, P.; Shawe-Taylor, J. Generalization Performance of Support Vector Machines and Other Pattern Classifiers. Adv. Kernel Methods Support Vector Learn; MIT Press: Cambridge, MA, USA, 1999; pp. 43–54. [Google Scholar]
  20. Vazquez, E.; Walter, E. Multi-output suppport vector regression. IFAC Proc. Vol. 2003, 36, 1783–1788. [Google Scholar] [CrossRef]
  21. Villa, G.; Lozano, S.; Redondo, S. Data envelopment analysis approach to energy-saving projects selection in an energy service company. Mathematics 2021, 9, 200. [Google Scholar] [CrossRef]
  22. Sahoo, B.K.; Saleh, H.; Shafiee, M.; Tone, K.; Zhu, J. An Alternative Approach to Dealing with the Composition Approach for Series Network Production Processes. Asia-Pac. J. Oper. Res. (APJOR) 2021, 38, 2150004. [Google Scholar]
  23. Amirteimoori, A.; Sahoo, B.K.; Charles, V.; Mehdizadeh, S. Stochastic Network Data Envelopment Analysis. In Stochastic Benchmarking; Springer: Cham, Switzerland, 2022; pp. 77–117. [Google Scholar]
  24. Färe, R.; Primont, D. Distance functions. In Multi-Output Production and Duality: Theory and Applications; Springer: Dordrecht, The Netherlands, 1995; pp. 7–41. [Google Scholar]
  25. Briec, W.; Lesourd, J.B. Metric distance function and profit: Some duality results. J. Optim. Theory Appl. 1999, 101, 15–33. [Google Scholar]
  26. Cooper, W.W.; Seiford, L.M.; Tone, K. Data Envelopment Analysis: A Comprehensive Text with Models, Applications, References and DEA-Solver Software; Springer: New York, NY, USA, 2007; Volume 2. [Google Scholar]
  27. Briec, W. Hölder distance function and measurement of technical efficiency. J. Product. Anal. 1999, 11, 111–131. [Google Scholar]
  28. Afriat, S.N. Efficiency estimation of production functions. Int. Econ. Rev. 1972, 13, 568–598. [Google Scholar]
  29. Mangasarian, O.L. Arbitrary-norm separating plane. Oper. Res. Lett. 1999, 24, 15–23. [Google Scholar]
  30. Aparicio, J.; Pastor, J.T. A well-defined efficiency measure for dealing with closest targets in DEA. Appl. Math. Comput. 2013, 219, 9142–9154. [Google Scholar]
  31. Charles, V.; Aparicio, J.; Zhu, J. The curse of dimensionality of decision-making units: A simple approach to increase the discriminatory power of data envelopment analysis. Eur. J. Oper. Res. 2019, 279, 929–940. [Google Scholar]
  32. Valero-Carreras, D.; Aparicio, J.; Guerrero, N.M. Support vector frontiers: A new approach for estimating production functions through support vector machines. Omega 2021, 104, 102490. [Google Scholar]
  33. Farrell, M.J. The measurement of productive efficiency. J. R. Stat. Soc. Ser. A 1957, 120, 253–281. [Google Scholar]
Figure 1. Support Vector Regression.
Figure 1. Support Vector Regression.
Mathematics 10 00909 g001
Figure 2. Illustration of the Directional Distance Function in Data Envelopment Analysis.
Figure 2. Illustration of the Directional Distance Function in Data Envelopment Analysis.
Mathematics 10 00909 g002
Figure 3. Illustration of the DEAM estimation of a production function.
Figure 3. Illustration of the DEAM estimation of a production function.
Mathematics 10 00909 g003
Table 1. Simulated scenarios.
Table 1. Simulated scenarios.
ScenarioInputsProduction Function
I x 1 y = x 1 0.5
II x 1 , x 2 y = x 1 0.35 x 2 0.15
III x 1 , x 2 , x 3 y = x 1 0.30 x 2 0.15 x 3 0.05
IV x 1 , x 2 , x 3 , x 4 y = x 1 0.25 x 2 0.15 x 3 0.05 x 4 0.05
V x 1 , x 2 , x 3 , x 4 , x 5 y = x 1 0.25 x 2 0.10 x 3 0.05 x 4 0.05 x 5 0.05
Table 2. Performance of DEA and DEAM.
Table 2. Performance of DEA and DEAM.
MSEBIAS
Fraction of TrialsImprovement (%) Fraction of TrialsImprovement (%)
ScenarioNumber of Obs.DEADEAM DEAM<=
DEA
DEAM vs. DEADEADEAMDEAM<=
DEA
DEAM vs. DEA
I250.027(0.020)0.024(0.019)1.00011.609%0.125(0.046)0.119(0.046)1.0004.873%
I500.011(0.007)0.010(0.007)0.9908.005%0.076(0.026)0.075(0.026)0.9902.822%
I750.007(0.005)0.007(0.005)0.9907.622%0.060(0.019)0.059(0.019)0.9802.194%
I1000.005(0.004)0.005(0.004)0.9905.231%0.051(0.019)0.050(0.019)0.9501.936%
II250.151(0.084)0.108(0.067)1.00027.109%0.276(0.071)0.240(0.072)1.00013.460%
II500.091(0.043)0.067(0.037)0.98024.012%0.206(0.045)0.184(0.045)0.99010.587%
II750.060(0.029)0.040(0.024)1.00032.846%0.160(0.032)0.138(0.035)1.00014.252%
II1000.049(0.022)0.033(0.019)1.00032.636%0.140(0.031)0.122(0.030)1.00013.285%
III250.451(0.236)0.287(0.199)0.96035.967%0.470(0.126)0.380(0.125)0.96019.215%
III500.270(0.121)0.165(0.090)0.99036.812%0.347(0.077)0.280(0.072)0.98019.075%
III750.211(0.091)0.119(0.050)0.99039.786%0.291(0.056)0.229(0.050)0.98020.996%
III1000.171(0.076)0.112(0.053)1.00032.405%0.257(0.047)0.213(0.043)1.00016.971%
IV251.046(0.457)0.804(1.070)0.88014.949%0.727(0.177)0.623(0.264)0.86012.086%
IV500.728(0.246)0.471(0.265)0.96035.859%0.571(0.113)0.469(0.146)0.88018.295%
IV750.605(0.191)0.384(0.154)0.99035.084%0.497(0.079)0.403(0.079)0.96018.539%
IV1000.462(0.162)0.308(0.114)1.00030.776%0.418(0.068)0.342(0.064)0.99017.912%
V251.896(0.766)1.009(0.563)0.98044.803%0.984(0.224)0.703(0.211)0.98028.043%
V501.396(0.478)0.922(0.566)0.90032.353%0.801(0.140)0.648(0.218)0.87018.492%
V751.057(0.303)0.750(0.315)0.95028.473%0.673(0.107)0.567(0.152)0.88015.677%
V1000.914(0.261)0.624(0.211)0.98029.296%0.613(0.090)0.502(0.087)0.97017.543%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Guerrero, N.M.; Aparicio, J.; Valero-Carreras, D. Combining Data Envelopment Analysis and Machine Learning. Mathematics 2022, 10, 909. https://0-doi-org.brum.beds.ac.uk/10.3390/math10060909

AMA Style

Guerrero NM, Aparicio J, Valero-Carreras D. Combining Data Envelopment Analysis and Machine Learning. Mathematics. 2022; 10(6):909. https://0-doi-org.brum.beds.ac.uk/10.3390/math10060909

Chicago/Turabian Style

Guerrero, Nadia M., Juan Aparicio, and Daniel Valero-Carreras. 2022. "Combining Data Envelopment Analysis and Machine Learning" Mathematics 10, no. 6: 909. https://0-doi-org.brum.beds.ac.uk/10.3390/math10060909

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop