Polyhedral DC Decomposition and DCA Optimization of Piecewise Linear Functions

Griewank, Andreas; Walther, Andrea

doi:10.3390/a13070166

Open AccessArticle

Polyhedral DC Decomposition and DCA Optimization of Piecewise Linear Functions

by

Andreas Griewank

and

Andrea Walther

^*

Institut für Mathematik, Humboldt-Universität zu Berlin, 10099 Berlin, Germany

^*

Author to whom correspondence should be addressed.

Algorithms 2020, 13(7), 166; https://0-doi-org.brum.beds.ac.uk/10.3390/a13070166

Submission received: 28 May 2020 / Revised: 5 July 2020 / Accepted: 8 July 2020 / Published: 11 July 2020

(This article belongs to the Special Issue Nonsmooth Optimization in Honor of the 60th Birthday of Adil M. Bagirov)

Download

Browse Figures

Versions Notes

Abstract

:

For piecewise linear functions

f : R^{n} \mapsto R

we show how their abs-linear representation can be extended to yield simultaneously their decomposition into a convex

\overset{ˇ}{f}

and a concave part

\hat{f}

, including a pair of generalized gradients

\overset{ˇ}{g} \in R^{n} ∋ \hat{g}

. The latter satisfy strict chain rules and can be computed in the reverse mode of algorithmic differentiation, at a small multiple of the cost of evaluating f itself. It is shown how

\overset{ˇ}{f}

and

\hat{f}

can be expressed as a single maximum and a single minimum of affine functions, respectively. The two subgradients

\overset{ˇ}{g}

and

- \hat{g}

are then used to drive DCA algorithms, where the (convex) inner problem can be solved in finitely many steps, e.g., by a Simplex variant or the true steepest descent method. Using a reflection technique to update the gradients of the concave part, one can ensure finite convergence to a local minimizer of f, provided the Linear Independence Kink Qualification holds. For piecewise smooth objectives the approach can be used as an inner method for successive piecewise linearization.

Keywords:

DC function; abs-linearization; DCA

1. Introduction and Notation

There is a large class of functions

f : R^{n} \mapsto R

that are called DC because they can be represented as the difference of two convex functions, see for example [1,2]. This property can be exploited in various ways, especially for (hopefully global) optimization. We find it notationally and conceptually more convenient to express these functions as averages of a convex and a concave function such that

f (x) = \frac{1}{2} (\overset{ˇ}{f} (x) + \hat{f} (x)) with \overset{ˇ}{f} (x) convex and \hat{f} (x) concave .

Throughout we will annotate the convex part by superscript

\overset{ˇ}{}

and the concave part by superscript

\hat{}

, which seems rather intuitive since they remind us of the absolute value function and its negative. Since we are mainly interested in piecewise linear functions we assume without much loss of generality that the functions f and the convex and concave components are well defined and finite on all of the Euclidean space

R^{n}

. Allowing both components to be infinite outside their proper domain would obviously generate serious indeterminacies, i.e.,

NaN

s in the numerical sense. As we will see later we can in fact ensure in our setting that pointwise

\hat{f} (x) \leq f (x) \leq \overset{ˇ}{f} (x) for all x \in R^{n},

(1)

which means that we actually obtain an inclusion in the sense of interval mathematics [3]. This is one of the attractions of the averaging notation. We will therefore also refer to

\hat{f}

and

\overset{ˇ}{f}

as the concave and convex bounds of f.

Conditioning of the Decomposition

In parts of the literature the two convex functions

\overset{ˇ}{f}

and

- \hat{f}

are assumed to be nonnegative, which has some theoretical advantages. In particular, see, e.g., [4], one obtains for the square

h = f^{2}

of a DC function f the decomposition

h = \frac{1}{4} {(\overset{ˇ}{f} + \hat{f})}^{2} = \frac{1}{2} \{\underset{\equiv \overset{ˇ}{h}}{\underset{⏟}{\frac{1}{4} ({\overset{ˇ}{f}}^{2} + {\hat{f}}^{2})}} + \underset{\equiv \hat{h}}{\underset{⏟}{\frac{1}{4} [- {(\overset{ˇ}{f} - \hat{f})}^{2}]}}\} .

(2)

The sign conditions of

\overset{ˇ}{f}

and

\hat{f}

are necessary to ensure that the three squares on the right hand side are convex functions. Using the Apollonius identity

f \cdot h = \frac{1}{2} [{(f + h)}^{2} - f^{2} - h^{2}]

one may then deduce in a constructive way that not only sums but also products of DC functions inherit this property. In general, since the convex functions

\overset{ˇ}{f}

and

- \hat{f}

have both supporting hyperplanes one can at least theoretically always find positive coefficients

α

and

β

such that

{\overset{ˇ}{f} (x) + α + β ∥ x ∥}^{2} \geq 0 \geq \hat{f} (x) - α - β {∥ x ∥}^{2} for x \in R^{n} .

Then the average of these modified functions is still f and their respective convexity/concavity properties are maintained. In fact, this kind of proximal shift can be used to show that any twice Lipschitz continuously differentiable function is DC, which raises the suspicion that the property by itself does not provide all that much exploitable structure from a numerical point of view. We believe that for its use in practical algorithms one has to make sure or simply assume that the condition number

κ (\overset{ˇ}{f}, \hat{f}) \equiv sup_{x \in R^{n}} \frac{| \overset{ˇ}{f} (x) | + | \hat{f} (x) |}{| \overset{ˇ}{f} (x) + \hat{f} (x) |} \in [1, \infty]

is not too large. Otherwise, there is the danger that the value of f is effectively lost in the rounding error of evaluating

\overset{ˇ}{f} + \hat{f}

. For sufficiently large quadratic shifts of the nature specified above one has

κ \sim β

. The danger of an excessive growth in

κ

seems akin to the successive widening in interval calculations and similarly stems also from the lack of strict arithmetic rules. For example doubling f and then subtracting it yields the successive decompositions

(2 f) - f = (\overset{ˇ}{f} + \hat{f}) - \frac{1}{2} (\overset{ˇ}{f} + \hat{f}) = (\overset{ˇ}{f} - \frac{1}{2} \hat{f}) + (\hat{f} - \frac{1}{2} \overset{ˇ}{f}) = \frac{1}{2} [(2 \overset{ˇ}{f} - \hat{f}) + (2 \hat{f} - \overset{ˇ}{f})] .

(3)

If in Equation (3) by chance we had originally

- \hat{f} = \frac{1}{2} \overset{ˇ}{f} > 0

so that

f = \frac{1}{2} \overset{ˇ}{f}

with the condition number

κ (\overset{ˇ}{f}, - 0.5 \overset{ˇ}{f}) = 3

we would get after the doubling and subtraction the condition number

κ (2.5 \overset{ˇ}{f}, - 2 \overset{ˇ}{f}) = 9

. So it is obviously important that the original algorithm avoids as much as possible calculations that are ill-conditioned in that they even just partly compensate each other.

Throughout the paper we assume that the functions in question are evaluated by a computational procedure that generates a sequence of intermediate scalars, which we denote generically by

u, v

and w. The last one of these scalar variables is the dependent, which is usually denoted by f. All of them are continuous functions

u = u (x)

of the vector

x \in R^{n}

of independent variables. As customary in mathematics we will often use the same symbol to identify a function and its dependent variable. For the overall objective we will sometimes distinguish them and write

y = f (x)

. For most of the paper we assume that the intermediates are obtained from each other by affine operations or the absolute value function so that the resulting

u (x)

are all piecewise linear functions.

The paper is organized as follows. In the following Section 2 we develop rules for propagating the convex/concave decomposition through a sequence of abs-linear operations applied to intermediate quantities u. This can be done either directly on the pair of bounds

(\overset{ˇ}{u}, \hat{u})

or on their average u and their halved distance

δ u = \frac{1}{2} (\overset{ˇ}{u} - \hat{u})

. In Section 3 we organize such sequences into an abs-linear form for f and then extend it to simultaneously yield the convex/concave decomposition. As a consequence of this analysis we get a strengthened version of the classical

max - min

representation of piecewise linear functions, which reduces to the difference of two polyhedral parts in max- and min-form. In Section 4 we develop strict rules for propagating certain generalized gradient pairs

(\overset{ˇ}{g}, \hat{g})

of

(\overset{ˇ}{u}, \hat{u})

exploiting convexity and the cheap gradient principle [5]. In Section 5 we discuss the consequences for the DCA when using limiting gradients

(\overset{ˇ}{g}, \hat{g})

, solving the inner, linear optimization problem (LOP) exactly, and ensuring optimality via polyhedral reflection. In Section 6 we demonstrate the new results on the nonconvex and piecewise linear chained Rosenbrock version of Nesterov [6]. Section 7 contains a summary and preliminary conclusion with outlook. In the Appendix A we give the details of the necessary and sufficient optimality test from [7] in the present DC context.

2. Propagating Bounds and/or Radii

In Equation (3) we already assumed that doubling is done componentwise and that for a difference

v = w - u

of DC functions w and u, one defines the convex and concave parts by

\overset{ˇ}{(w - u)} = \overset{ˇ}{w} - \hat{u} and \hat{(w - u)} = \hat{w} - \overset{ˇ}{u},

respectively. This yields in particular for the negation

\overset{ˇ}{(- u)} = - \hat{u} and \hat{(- u)} = - \overset{ˇ}{u} .

(4)

For piecewise linear functions we need neither the square formula Equation (2) nor the more general decompositions for products. Therefore we will not insist on the sign conditions even though they would be also maintained automatically by Equation (4) as well as the natural linear rules for the convex and concave parts of the sum and the multiple of a DC function, namely

\begin{matrix} \overset{ˇ}{(w + u)} & = (\overset{ˇ}{w} + \overset{ˇ}{u}) & and & \hat{(w + u)} & = (\hat{w} + \hat{u}), \\ \overset{ˇ}{(c u)} & = c \overset{ˇ}{u} & and & \hat{(c u)} & = c \hat{u} & if c \geq 0, \\ \overset{ˇ}{(c u)} & = c \hat{u} & and & \hat{(c u)} & = c \overset{ˇ}{u} & if c \leq 0 . \end{matrix}

However, the sign conditions would force one to decompose simple affine functions

u (x) = a^{⊤} x + β

as

u (x) = max (0, a^{⊤} x + β) + min (0, a^{⊤} x + β) \equiv \frac{1}{2} (\overset{ˇ}{u} (x) + \hat{u} (x)),

(5)

which does not seem such a good idea from a computational point of view.

The key observation for this paper is that as is well known (see e.g., [8]), one can propagate the absolute value operation according to the identity

\begin{matrix} | u | & = & max (u, - u) = \frac{1}{2} max (\overset{ˇ}{u} + \hat{u}, - \overset{ˇ}{u} - \hat{u}) \\ = & max (\overset{ˇ}{u}, - \hat{u}) + \frac{1}{2} (\hat{u} - \overset{ˇ}{u}) \\ \Leftrightarrow & \overset{ˇ}{| u |} = 2 max (\overset{ˇ}{u}, - \hat{u}) and \hat{| u |} = \hat{u} - \overset{ˇ}{u} . \end{matrix}

(6)

Here the equality in the second line can be verified by shifting the difference

\frac{1}{2} (\hat{u} - \overset{ˇ}{u})

into the two arguments of the max. Again we see that when applying the absolute value operation to an already positive convex function

u = \frac{1}{2} \overset{ˇ}{u} \geq 0

we get

\overset{ˇ}{| u |} = 2 \overset{ˇ}{u}

and

\hat{| u |} = - \overset{ˇ}{u}

so that the condition number grows from

κ (\overset{ˇ}{u}, 0) = 1

to

κ (2 \overset{ˇ}{u}, - \overset{ˇ}{u}) = 3

. In other words, we observe once more the danger that both component functions drift apart. This looks a bit like simultaneous growth of numerator and denominator in rational arithmetic, which can sometimes be limited through cancelations by common integer factors. It is currently not clear when and how a similar compactification of a given convex/concave decomposition can be achieved. The corresponding rule for the maximum is similarly easy derived, namely

\begin{matrix} max (u, w) & = \frac{1}{2} max (\overset{ˇ}{u} + \hat{u}, \overset{ˇ}{w} + \hat{w}) = \frac{1}{2} (max (\overset{ˇ}{u} - \hat{w}, \overset{ˇ}{w} - \hat{u}) + (\hat{u} + \hat{w})) . \end{matrix}

When u and w as well as their decomposition are identical we arrive at the new decomposition

u = max (u, u) = \frac{1}{2} ((\overset{ˇ}{u} - \hat{u}) + 2 \hat{u})

, which obviously represents again some deterioration in the conditioning.

While it was pointed out in [4] that the DC functions

u = \frac{1}{2} (\overset{ˇ}{u} + \hat{u})

themselves form an algebra, their decomposition pairs

(\overset{ˇ}{u}, \hat{u})

are not even an additive group, as only the zero

(0, 0)

has a negative partner, i.e., an additive inverse. Naturally, the pairs

(\overset{ˇ}{u}, \hat{u})

form the Cartesian product between the convex cone of convex functions and its negative, i.e., the cone of concave functions. The DC functions are then the linear envelope of the two cones in some suitable space of locally Lipschitz continuous functions. It is not clear whether this interpretation helps in some way, and in any case we are here mainly concerned with piecewise linear functions.

Propagating the Center and Radius

Rather than propagating the pairs

(\overset{ˇ}{u}, \hat{u})

through an evaluation procedure as defined in [5] to calculate the function value

f (x)

at a given point x, it might be simpler and better for numerical stability to propagate the pair

u = \frac{1}{2} (\overset{ˇ}{u} + \hat{u}) \land δ u = \frac{1}{2} (\overset{ˇ}{u} - \hat{u}) \Leftrightarrow \overset{ˇ}{u} = u + δ u \land \hat{u} = u - δ u .

(7)

This representation resembles the so-called central form in interval arithmetic [9] and we will call therefore u the central value and

δ u

the radius. In other words, u is just the normal piecewise affine intermediate function and the

δ u

is a convex distance function to the hopefully close convex and concave part. Should the potential blow-up discussed above actually occur, this will only effect

δ u

but not the central value u itself. Moreover, at least theoretically one might decide to reduce

δ u

from time to time making sure of course that the corresponding

\overset{ˇ}{u}

and

\hat{u}

as defined in Equation (7) stay convex and concave, respectively. The condition number now satisfies the bound

\begin{matrix} κ (u + δ u, u - δ u) & = & sup_{x} \frac{| u + δ u | + | u - δ u |}{2 | u |} \\ = & sup_{x} \frac{1}{2} \{| 1 + \frac{δ u}{u} | + | 1 - \frac{δ u}{u} |\} \leq 1 + sup_{x} | \frac{δ u}{u} | . \end{matrix}

Recall here that all intermediate quantities

u = u (x)

are functions of the independent variable vector

x \in R^{n}

. Naturally, we will normally only evaluate the intermediate pairs u and

δ u

at a few iterates of whatever numerical calculation one performs involving f so that we can only sample the ratio

ρ u (x) \equiv | δ u (x) / u (x) |

pointwise, where the denominator is hopefully nonzero. We will also refer to this ratio as the relative gap of the convex/concave decomposition at a certain evaluation point x. The arithmetic rules for propagating radii of the central form in central convex/concave arithmetic are quite simple.

Lemma 1 (Propagation rules for central form).

With

c, d, x \in R

two constants and an independent variable we have

\begin{matrix} v = c + d x & \Rightarrow δ v = 0 & \Rightarrow ρ v = 0 i f v \neq 0 \\ v = u \pm w & \Rightarrow δ v = δ u + δ w & \Rightarrow ρ v \leq \frac{| u | + | w |}{| u \pm w |} max (ρ u, ρ w) \\ v = c u & \Rightarrow δ v = | c | δ u & \Rightarrow ρ v = ρ u i f c \neq 0 \\ v = | u | & \Rightarrow δ v = | u | + 2 δ u & \Rightarrow ρ v \in [1, 1 + 2 ρ u] . \end{matrix}

(8)

Proof.

The last rule follows from Equation (6) by

\begin{matrix} δ (| u |) & = & \frac{1}{2} (\overset{ˇ}{| u |} - \hat{| u |}) = max (\overset{ˇ}{u}, - \hat{u}) - \frac{1}{2} (\hat{u} - \overset{ˇ}{u}) \\ = & max (\overset{ˇ}{u} - δ u, - \hat{u} - δ u) + 2 δ u \\ = & max (u, - u) + 2 δ u = | u | + 2 δ u . \end{matrix}

☐

The first equation in Equation (8) means that for all quantities u that are affine functions of the independent variables x the corresponding radius

δ u

is zero so that

\overset{ˇ}{u} = u = \hat{u}

until we reach the first absolute value. Notice that

δ v

does indeed grow additively for the subtraction just like for the addition. By induction it follows from the rules above for an inner product that

δ (\sum_{j = 1}^{m} c_{j} u_{j}) = \sum_{j = 1}^{m} | c_{j} | δ u_{j},

(9)

where the

c_{j} \in R

are assumed to be constants. As we can see from the bounds in Lemma 1 the relative gap can grow substantially whenever one performs an addition of values with opposite sign or applies the absolute value operation. In contrast to interval arithmetic on smooth functions one sees that the relative gap, though it may be zero or small initially immediately jumps above 1 when one hits the first absolute value operation. This is not really surprising since the best concave lower bound on

u (x) = | x |

itself is

\hat{u} (x) = 0

so that

δ u = | x |

,

\overset{ˇ}{u} (x) = 2 | x |

and thus

ρ u (x) = 1

constantly. On the positive side one should notice that throughout we do not lose sight of the actual central values

u (x)

, which can be evaluated with full arithmetic precision. In any case we can think of neither

ρ

nor

κ \leq 1 + ρ

as small numbers, but we must be content if they do not actually explode too rapidly. Therefore they will be monitored throughout our numerical experiments.

Again we see that the computational effort is almost exactly doubled. The radii can be treated as additional variables that occur only in linear operations and stay nonnegative throughout. Notice that in contrast to the (nonlinear) interval case we do not loose any accuracy by propagating the central form. It follows immediately by induction from Lemma 1 that any function evaluated by a evaluation procedure that comprises a finite sequence of

initializations to independent variables
multiplications by constants
additions or subtractions
absolute value applications

is piecewise affine and continuous. We will call these operations and the resulting evaluation procedure abs-linear. It is also easy to see that the absolute values

| \cdot |

can be replaced by the maximum

max (\cdot, \cdot)

or the minimum

min (\cdot, \cdot)

or the positive part function

max (0, \cdot)

or any combination of them, since they can all be mutually expressed in terms of each other and some affine operations. Conversely, it follows from the min-max representation established in [10] (Proposition 2.2.2) that any piecewise affine function f can be evaluated by such an evaluation procedure. Consequently, by applying the formulas Equations (4)–(6) one can propagate at the same time the convex and concave components for all intermediate quantities. Alternatively, one can propagate the centered form according to the rules given in Lemma 1. These rules are also piecewise affine so that we have a finite procedure for simultaneously evaluating

\overset{ˇ}{u}

and

\hat{u}

or u and

δ u

as piecewise linear functions. The combined computation requires about 2–3 times as many arithmetic operations and twice as many memory accesses. Of course due to the interdependence of the two components it is not possible to evaluate just one of them without the other. As we will see the same is true for the generalized gradients to be discussed later in Section 4.

3. Forming and Extending the Abs-Linear Form

In practice all piecewise linear objectives can be evaluated by a sequence of abs-linear operations, possibly after min and max have been rewritten as

min (u, w) = \frac{1}{2} (u + w - | u - w |) and max (u, w) = \frac{1}{2} (u + w + | u - w |) .

(10)

Our only restriction is that the number s of intermediate scalar quantities, say

z_{i}

, is fixed, which is true for example in the

max - min

representation. Then we can immediately cast the procedure in matrix-vector notation as follows:

Lemma 2 (Abs-Linear Form).

Any continuous piecewise affine function

f : x \in R^{n} \mapsto y \in R

can be represented by

\begin{matrix} z & = c + Z x + M z + L | z |, \\ y & = d + a^{⊤} x + b^{⊤} z, \end{matrix}

(11)

where

z \in R^{s}, Z \in R^{s \times n}, M, L \in R^{s \times s}

strictly lower triangular,

d \in R, a \in R^{n}, b \in R^{s}

and

| z |

denotes the componentwise modulus of the vector z.

It should be noted that the construction of this general abs-linear form requires no analysis or computation whatsoever. However, especially for our purpose of generating a reasonably tight DC decomposition, it is advantages to reduce the size of the abs-normal form by eliminating all intermediates

z_{j}

with

j < s

for which

| z_{j} |

never occurs on the right hand side. To this end we may simply substitute the expression of

z_{j}

given in the j-th row in all places where

z_{j}

itself occurs on the right hand side. The result is what we will call a reduced abs-normal form, where after renumbering, all remaining

z_{j}

with

j < s

are switching variables in that

| z_{j} |

occurs somewhere on the right hand side. In other words, all but the last column of the reduced, strictly lower triangular matrix L are nontrivial. Again, this reduction process is completely mechanical and does not require any nontrivial analysis, other than looking up which columns of the original L were zero. The resulting reduced system is smaller and probably denser, which might increase the computation effort for evaluating f itself. However, in view of Equation (9) we must expect that for the reduced form the radii will grow slower if we first accumulate linear coefficients and then take their absolute values. Hence we will assume in the remainder of this paper that the abs-normal form for our objective f of interest is reduced.

Based on the concept of abs-linearization introduced in [11], a slightly different version of a (reduced) abs-normal form was already proposed in [12]. Now in the present paper, both z and y depend directly on z via the matrix M and the vector b, but y does no longer depend directly on

| z |

. All forms can be easily transformed into each other by elementary modifications. The intermediate variables

z_{i}

can be calculated successively for

1 \leq i \leq s

by

\begin{matrix} z_{i} = c_{i} + Z_{i} x + M_{i} z + L_{i} | z |, \end{matrix}

(12)

where

Z_{i}

,

M_{i}

and

L_{i}

denote the ith rows of the corresponding matrix. By induction on i one sees immediately that they are piecewise affine functions

z_{i} = z_{i} (x)

, and we may define for each x the signature vector

σ (x) = {(sgn (z_{i} (x)))}_{i = 1 \dots s} \in {- 1, 0, 1}^{s} .

Consequently we get the inverse images

P_{σ} \equiv {x \in R^{n} : sgn (z (x)) = σ} for σ \in {- 1, 0, 1}^{s},

(13)

which are relatively open polyhedra that form collectively a disjoint decomposition of

R^{n}

. The situation for the second example of Nesterov is depicted in Figure 3 in the penultimate section. There are six polyhedra of full dimension, seven polyhedra of co dimension 1 drawn in blue and two points, which are polyhedra of dimension 0. The point

(0, - 1)

with signature

(0, - 1, 0)

is stationary and the point

(1, 1)

with signature

(1, 0, 0)

is the minimizer as shown in [7]. The arrows indicate the path of our reflection version of the DCA method as described in Section 5.

When

σ

is definite, i.e., has no zero components, which we will denote by

0 \notin σ

, it follows from the continuity of

z (x)

that

P_{σ}

has full dimension n unless it is empty. In degenerate situations this may also be true for indefinite

σ

but then the closure of

P_{σ}

is equal to the extended closure

{\bar{P}}_{\tilde{σ}} \equiv {x \in R^{n} : σ (x) ≺ \tilde{σ}} \supset close (P_{\tilde{σ}})

(14)

for some definite

0 \notin \tilde{σ} ≻ σ

. Here the (reflexive) partial ordering ≺ between the signature vectors satisfies the equivalence

\overset{˚}{σ} ≺ σ \Leftrightarrow {\overset{˚}{σ}}_{i} σ_{i} \leq σ_{i}^{2} for i = 1 \dots s \Leftrightarrow {\bar{P}}_{\overset{˚}{σ}} \subset {\bar{P}}_{σ}

as shown in [13]. One can easily check that for any

σ ≻ \overset{˚}{σ}

there exists a unique signature

{(σ ▹ \overset{˚}{σ})}_{i} = \{\begin{matrix} σ_{i} & if {\overset{˚}{σ}}_{i} \neq 0 \\ - σ_{i} & if {\overset{˚}{σ}}_{i} = 0 \end{matrix} for i = 1 \dots s

(15)

We call

\tilde{σ} \equiv σ ▹ \overset{˚}{σ}

the reflection of

σ

at

\overset{˚}{σ}

, which satisfies also

\tilde{σ} ≻ \overset{˚}{σ}

and we have in fact

{\bar{P}}_{\tilde{σ}} \cap {\bar{P}}_{σ} = {\bar{P}}_{\overset{˚}{σ}}

. Hence the relation between

σ

and

\tilde{σ}

is symmetric in that also

σ = \tilde{σ} ▹ \overset{˚}{σ}

. Therefore we will call

(σ, \tilde{σ})

a complementary pair with respect to

\overset{˚}{σ}

. In the very special case

z_{i} = x_{i}

for

i = 1 \dots n = s - 1

the

{\bar{P}}_{σ}

are orthants and their reflections at the origin

{0} = {\bar{P}}_{0} \subset R^{n}

are their geometric opposites

{\bar{P}}_{\tilde{σ}}

with

\tilde{σ} = - σ

. Here one can see immediately that all edges, i.e., one-dimensional polyhedra, have Cartesian signatures

\pm e_{i}

for

i = 1 \dots n

and belong to

{\bar{P}}_{σ}

or

{\bar{P}}_{\tilde{σ}}

for any given

σ

. Notice that

\overset{˚}{x}

is a local minimizer of a piecewise linear function if and and only if it is a local minimizer along all edges of nonsmoothness emanating form it. Consequently, optimality of f restricted to a complementary pair is equivalent to local optimality on

R^{n}

, not only in this special case, but whenever the Linear Independence Kink Qualification (LIKQ) holds as introduced in [13] and defined in the Appendix A. This observation is the basis of the implicit optimality condition verified by our DCA variant Algorithm 1 through the use of reflections. The situation is depicted in Figure 3 where the signatures

(- 1, - 1, - 1)

and

(1, - 1, 1)

as well as

(1, - 1, 1)

and

(1, 1, - 1)

form complementary pairs at

(0, - 1)

and

(1, 1)

, respectively. At both reflection points there are four emanating edges, which all belong to one of the three polyhedra mentioned.

Applying the propagation rules from Lemma 1, one obtains with

δ x = 0 \in R^{n}

the recursion

\begin{matrix} δ z_{1} & = δ (c_{1} + Z_{1} x) = 0 \\ δ z_{i} & = (| M_{i} | + 2 | L_{i} |) δ z + | L_{i} | | z | for i = 2 \dots s, \end{matrix}

where the modulus is once more applied componentwise for vectors and matrices. Hence, we have again in matrix vector notation

\begin{matrix} δ z & = (| M | + 2 | L |) δ z + | L | | z |, \end{matrix}

(16)

which yields for

δ z

the explicit expression

δ z = {(I - | M | - 2 | L |)}^{- 1} | L | | z | = \sum_{j = 0}^{ν} {(| M | + 2 | L |)}^{j} | L | | z | \geq 0 .

(17)

Here,

ν

is the so-called switching depth of the abs-linear form of f, namely the largest

ν \in N

such that

{(| M | + | L |)}^{ν} \neq 0

, which is always less than s due to the strict lower triangularity of M and L. The unit lower triangular

(I - | M | - 2 | L |)

is an M-matrix [14], and interestingly enough does not even depend on x but directly maps

| z | = | z (x) |

to

δ z = δ z (x)

. For the radius of the function itself, the propagation rules from Lemma 1 then yield

\begin{matrix} δ f (x) = δ y & = {| b |}^{⊤} δ z \geq 0 . \end{matrix}

(18)

This nonnegativity implies the inclusion Equation (1) already mentioned in Section 1, i.e.:

Theorem 1 (Inclusion by convex/concave decomposition).

For any piecewise affine function f in abs-linear form, the construction defined in Section 2 yields a convex/concave inclusion

\begin{matrix} \hat{f} (x) \leq f (x) \equiv \frac{1}{2} (\overset{ˇ}{f} (x) + \hat{f} (x)) \leq \overset{ˇ}{f} (x) . \end{matrix}

Moreover, the convex and the concave parts

\overset{ˇ}{f} (x)

and

\hat{f} (x)

have exactly the same switching structure as

f (x)

in that they are affine on the same polyhedra

P_{σ}

defined in (13).

Proof.

Equations (16) and (17) ensure that

δ f (x)

is nonnegative at all

x \in R^{n}

such that

\begin{matrix} \hat{f} (x) = f (x) - δ f (x) \leq f (x) \leq f (x) + δ f (x) \leq \overset{ˇ}{f} (x) . \end{matrix}

It follows from Equation (17) that the radii

δ z_{i} (x)

are like the

| z_{i} (x) |

piecewise linear with the only nonsmoothness arising through the switching variables

z (x)

themselves. Obviously this property is inherited by

δ f (x)

and the linear combinations

\overset{ˇ}{f} (x) = f (x) + δ f (x)

and

\hat{f} (x) = f (x) - δ f (x)

, which completes the proof. ☐

Combining Equations (16) and (18) with the abs-linear form of the piecewise affine function f and defining

\tilde{z} = (z, δ z) \in R^{2 s}

, one obtains for the calculation of

\tilde{f} (x) \equiv \tilde{y} \equiv (y, δ y)

the following abs-linear form

\begin{matrix} \tilde{z} & = \tilde{c} + \tilde{Z} x + \tilde{M} \tilde{z} + \tilde{L} | \tilde{z} |, \end{matrix}

(19)

\begin{matrix} \tilde{y} & = \tilde{d} + {\tilde{a}}^{⊤} x + {\tilde{b}}^{⊤} \tilde{z} \end{matrix}

(20)

with the vectors and matrices defined by

\begin{matrix} \tilde{c} & = [\begin{matrix} c \\ 0 \end{matrix}] \in R^{2 s}, \tilde{Z} = [\begin{matrix} Z \\ 0 \end{matrix}] \in R^{2 s \times n}, \tilde{M} = [\begin{matrix} M & 0 \\ 0 & | M | + 2 | L | \end{matrix}] \in R^{2 s \times 2 s}, \\ \tilde{L} & = [\begin{matrix} L & 0 \\ | L | & 0 \end{matrix}] \in R^{2 s \times 2 s}, \tilde{d} = [\begin{matrix} d \\ 0 \end{matrix}] \in R^{2}, \tilde{a} = [\begin{matrix} a & 0 \end{matrix}] \in R^{n \times 2}, \tilde{b} = [\begin{matrix} b & 0 \\ 0 & | b | \end{matrix}] \in R^{2 s \times 2} . \end{matrix}

Then, Equations (19) and (20) yield

\begin{matrix} [\begin{matrix} z \\ δ z \end{matrix}] & = [\begin{matrix} c \\ 0 \end{matrix}] + [\begin{matrix} Z \\ 0 \end{matrix}] x + [\begin{matrix} M & 0 \\ 0 & | M | + 2 | L | \end{matrix}] [\begin{matrix} z \\ δ z \end{matrix}] + [\begin{matrix} L & 0 \\ | L | & 0 \end{matrix}] [\begin{matrix} | z | \\ | δ z | \end{matrix}] = [\begin{matrix} c + Z x + M z + L | z | \\ (| M | + 2 | L |) δ z + | L | | z | \end{matrix}] \\ [\begin{matrix} y \\ δ y \end{matrix}] & = \tilde{d} + {\tilde{a}}^{⊤} x + {\tilde{b}}^{⊤} \tilde{z} = [\begin{matrix} d \\ 0 \end{matrix}] + [\begin{matrix} a^{⊤} x \\ 0 \end{matrix}] + {[\begin{matrix} b & 0 \\ 0 & | b | \end{matrix}]}^{⊤} [\begin{matrix} z \\ δ z \end{matrix}] = [\begin{matrix} d + a^{⊤} x + b^{⊤} z \\ {| b |}^{⊤} δ z \end{matrix}], \end{matrix}

i.e., Equations (16) and (18). As can be seen, the matrices

\tilde{M}

and

\tilde{L}

have the required strictly lower triangular form. Furthermore, it is easy to check, that the switching depth of the abs-linear form of f carries over to the abs-linear form for

\tilde{f}

in that also

(| \tilde{M} | + | \tilde{L} {|)}^{ν} \neq 0 = (| \tilde{M} | + | \tilde{L} {|)}^{ν + 1}

. However, notice that this system is not reduced since the s radii are not switching variables, but globally nonnegative anyhow. We can now obtain explicit expressions for the central values, radii, and bounds for a given signature

σ

.

Corollary 1 (Explicit representation of the centered form).

For any definite signature

σ 0

and all

x \in P_{σ}

we have with

Σ = diag (σ)

\begin{matrix} z_{σ} (x) & = {(I - M - L Σ)}^{- 1} (c + Z x) a n d | z_{σ} (x) | = Σ z_{σ} (x) \geq 0 \end{matrix}

(21)

\begin{matrix} δ z_{σ} (x) & = {(I - | M | - 2 | L |)}^{- 1} | L | Σ {(I - M - L Σ)}^{- 1} (c + Z x) \geq 0 \end{matrix}

(22)

\begin{matrix} \nabla z_{σ} & = {(I - M - L Σ)}^{- 1} Z \Rightarrow \nabla_{σ} f = a^{⊤} + b^{⊤} {(I - M - L Σ)}^{- 1} Z \end{matrix}

(23)

\begin{matrix} \nabla f_{σ} & = a^{⊤} + [b^{⊤} {+ | b |}^{⊤} {(I - | M | - 2 | L |)}^{- 1} | L | Σ] {(I - M - L Σ)}^{- 1} Z \end{matrix}

(24)

\begin{matrix} \nabla {\hat{f}}_{σ} & = a^{⊤} + [b^{⊤} - {| b |}^{⊤} {(I - | M | - 2 | L |)}^{- 1} | L | Σ] {(I - M - L Σ)}^{- 1} Z, \end{matrix}

(25)

where the restrictions of the functions and their gradients to

P_{σ}

are denoted by subscript σ. Notice that the gradients are constant on these open polyhedra.

Proof.

Equations (21) and (23) follow directly from Equation (12), the abs-linear form (11) and the properties of

Σ

. Combining Equation (16) with (21) yields Equation (22). Since

\overset{ˇ}{f} (x) = f (x) + δ f (x)

and

\hat{f} (x) = f (x) - δ f (x)

, Equations (24) and (25) follow from the representation in abs-linear form and Equation (23). ☐

As one can see the computation of the gradient

\nabla f_{σ}

requires the solution of one unit upper triangular linear system and that of both

\nabla {\overset{ˇ}{f}}_{σ}

and

\nabla {\hat{f}}_{σ}

one more. Naturally, upper triangular systems are solved by back substitution, which corresponds to the reverse mode of algorithmic differentiation as described in the following section. Hence, the complexity for calculating the gradients is exactly the same as that for calculating the functions, which can be obtained by one forward substitution for

f_{σ}

and an extra one for

δ f_{σ}

and thus

{\overset{ˇ}{f}}_{σ}

and

{\hat{f}}_{σ}

. The given

\nabla f_{σ}, \nabla {\overset{ˇ}{f}}_{σ}

and

\nabla {\hat{f}}_{σ}

are proper gradients in the interior of the full dimensional domains

P_{σ}

. For some or even many

σ

the inverse image

P_{σ}

of the map

x \mapsto sgn (z (x))

may be empty, in which case the formulas in the corollary do not apply. Checking the nonemptiness of

P_{σ}

for a given signature

σ

amounts to checking the consistency of a set of linear inequalities, which costs the same as solving an LOP and is thus nontrivial. Expressions for the generalized gradients at points in lower dimensional polyhedra are given in the following Section 4. There it is also not required that the abs-linear normal form has been reduced, but one may consider any given sequence of abs-linear operations.

The Two-Term Polyhedral Decomposition

It is well known ([15], Theorem 2.49) that all piecewise linear and globally convex or concave functions can be represented as the maximum or the minimum of a finite collection of affine functions, respectively. Hence, from the convex/concave decomposition we get the following drastic simplification of the classical min-max representation given, e.g., in [10].

Corollary 2 (Additive max/min decomposition of PL functions).

For every piecewise affine function

f : R^{n} \mapsto R

there exist

k \geq 0

affine functions

α_{i} + a_{i}^{⊤} x

for

i = 1 \dots k

and

l \geq 0

affine functions

β_{j} + b_{j}^{⊤} x

for

j = 1 \dots l

such that at all

x \in R^{n}

f (x) = \underset{\equiv \frac{1}{2} \overset{ˇ}{f} (x)}{\underset{⏟}{max_{i = 1 \dots k} (α_{i} + a_{i}^{⊤} x)}} + \underset{\equiv \frac{1}{2} \hat{f} (x)}{\underset{⏟}{min_{j = 1 \dots l} (β_{j} + b_{j}^{⊤} x)}}

(26)

where furthermore

\hat{f} (x) \leq f (x) \leq \overset{ˇ}{f} (x)

.

The max-part of this representation is what is called a polyhedral function in the literature [15]. Since the min-part is correspondingly the negative of a polyhedral function we may also refer to Equation (26) as a DP decomposition, i.e., the difference of two polyhedral functions.

We are not aware of a publication that gives a practical procedure for computing such a collection of affine functions

α_{i} + a_{i}^{⊤} x

,

i = 1 \dots k

, and

β_{j} + b_{j}^{⊤} x

,

j = 1 \dots l

, for a given piecewise linear function f. Of course the critical question is in which form the function f is specified. Here as throughout our work we assume that it is given by a sequence of abs-linear operations. Then we can quite easily compute for each intermediate variable v representations of the form

\begin{matrix} v & = & \sum_{i = 1}^{\bar{m}} max_{1 \leq j \leq k_{i}} (α_{i j} + a_{i j}^{⊤} x) + \sum_{i = 1}^{\bar{n}} min_{1 \leq j \leq l_{i}} (β_{i j} + b_{i j}^{⊤} x) \end{matrix}

(27)

\begin{matrix} = & max_{\binom{j_{i} \in I_{i}}{1 \leq i \leq \bar{m}}} \sum_{i = 1}^{\bar{m}} (α_{i j_{i}} + a_{i j_{i}}^{⊤} x) + min_{\binom{j_{i} \in J_{i}}{1 \leq i \leq \bar{n}}} \sum_{i = 1}^{\bar{n}} (β_{i j_{i}} + b_{i j_{i}}^{⊤} x) . \end{matrix}

(28)

with index sets

I_{i} = {1, \dots, k_{i}}, 1 \leq i \leq \bar{m},

and

J_{i} = {1, \dots, l_{i}}, 1 \leq i \leq \bar{n},

since one has to consider all possibilities of selecting one affine function each from one of the

\bar{m}

max and

\bar{n}

min groups, respectively. Obviously, (28) involves

\prod_{i = 1}^{m} k_{i}

and

\prod_{i = 1}^{n} ℓ_{i}

affine function terms in contrast to the first representation (27) which contains just

\sum_{i = 1}^{m} k_{i}

and

\sum_{i = 1}^{n} ℓ_{i}

of them. Still the second version conforms to the classical representation of convex and concave piecewise linear functions, which yields the following result:

Corollary 3 (Explicit computation of the DP representation).

For any piecewise linear function given as abs-linear procedure one can explicitly compute the representation (26) by implementing the rules of Lemma 1.

Proof.

We will consider the representations (27) from which (26) can be directly obtained in the form (28). Firstly, the independent variables

x_{j}

are linear functions of themselves with gradient

a = e_{j}

and inhomogeneity

α = 0

. Then for multiplications by a constant

c > 0

we have to scale all affine functions by c. Secondly, addition requires appending the expansions of the two summands to each other without any computation. Taking the negative requires switching the sign of all affine functions and interchanging the max and min group. Finally, to propagate through the absolute values we have to apply the rule (6), which means switching the signs in the min group, expressing it in terms of max and merging it with the existing max group. Here merging means pairwise joining each polyhedral term of the old max-group with each term in the switched min-group. Then the new min-group is the old one plus the old max-group with its sign switched. ☐

We see that taking the absolute value or, alternatively, maxima or minima generates the strongest growth in the number of polyhedral terms and their size. It seems clear that this representation is generally not very useful because the number of terms will likely blow up exponentially. This is not surprising because we will need one affine function for each element of the polyhedral decompositions of the domain of the max and min term. Typically, many of the affine terms will be redundant, i.e., could be removed without changing the values of the polyhedral terms. Unfortunately, identifying those already requires solving primal or dual linear programming problems, see, e.g., [16]. It seems highly doubtful that this would ever be worthwhile. Therefore, we will continue to advocate dealing with piecewise linear functions in a convenient procedural abs-linear representation.

4. Computation of Generalized Gradients and Constructive Oracle Paradigm

For optimization by variants of the DCA algorithm [17] one needs generalized gradients of the convex and the concave component. Normally, there are no strict rules for propagating generalized gradients through nonsmooth evaluation procedures. However, exactly this is simply assumed in the frequently invoked oracle paradigm, which states that at any point

x \in R^{n}

the function value

f (x)

and an element

g \in \partial f (x)

can be evaluated. We have argued in [18] that this is not at all a reasonable assumption.

On the other hand, it is well understood that for the convex operations: Positive scaling, addition, and taking the maximum the rules are strict and simple. Moreover, then the generalized gradient in the sense of Clarke

\partial \overset{ˇ}{f} (x) \subset R^{n}

is actually a subdifferential in that all its elements define supporting hyperplanes. Similarly

\partial \hat{f} (x)

might be called a superdifferential in that the tangent planes bound the concave part from above.

In other words, we have at all

x \in R^{n}

and for all increments

Δ x

\overset{ˇ}{f} (x + Δ x) \geq \overset{ˇ}{f} (x) + {\overset{ˇ}{g}}^{⊤} Δ x if \overset{ˇ}{g} \in \partial \overset{ˇ}{f} (x)

and

\hat{f} (x + Δ x) \leq \hat{f} (x) + {\hat{g}}^{⊤} Δ x if \hat{g} \in \partial \hat{f} (x),

which imply for

\overset{ˇ}{g} \in \partial \overset{ˇ}{f} (x)

and

\hat{g} \in \partial \hat{f} (x)

that

\hat{f} (x + Δ x) + \overset{ˇ}{f} (x) + {\overset{ˇ}{g}}^{⊤} Δ x \leq 2 f (x + Δ x) \leq \overset{ˇ}{f} (x + Δ x) + \hat{f} (x) + {\hat{g}}^{⊤} Δ x,

(29)

where the lower bound on the left is a concave function and the upper bound is convex, both with respect to

Δ x

. Notice that the generalized superdifferential

\partial \hat{f}

being the negative of the subdifferential of

- \hat{f}

is also a convex set.

Now the key question is how we can calculate a suitable pair of generalized gradients

(\overset{ˇ}{g}, \hat{g}) \in \partial \overset{ˇ}{f} (x) \times \partial \hat{f} (x)

. As we noted above the convex part and the negative of the concave part only undergo convex operations so that for

v = c u

\partial \overset{ˇ}{v} = \{\begin{matrix} c \partial \overset{ˇ}{u} & if c > 0 \\ 0 & if c = 0 \\ c \partial \hat{u} & if c < 0 \end{matrix} and \partial \hat{v} = \{\begin{matrix} c \partial \hat{u} & if c > 0 \\ 0 & if c = 0 \\ c \partial \overset{ˇ}{u} & if c < 0 \end{matrix}

(30)

and for

v = u + w

\partial \overset{ˇ}{v} = \partial \overset{ˇ}{u} + \partial \overset{ˇ}{w} and \partial \hat{v} = \partial \hat{u} + \partial \hat{w} .

(31)

Finally, for

v = | u |

we find by Equation (6) that

\partial \hat{v} = \partial \hat{u} - \partial \overset{ˇ}{u}

as well as

\frac{1}{2} \partial \overset{ˇ}{v} = \partial max (\overset{ˇ}{u}, - \hat{u}) = \{\begin{matrix} \partial \overset{ˇ}{u} & if u > 0 \\ conv \{\partial \overset{ˇ}{u} \cup (- \partial \hat{u})\} & if u = 0, \\ - \partial \hat{u} & if u < 0 \end{matrix}

(32)

where we have used that

u = \frac{1}{2} (\overset{ˇ}{u} + \hat{u})

in Equation (32). The sign of the arguments u of the absolute value function are of great importance, because they determine the switching structure. For this reason, we formulated the cases in terms of u rather than in the convex/concave components. The operator

conv {\cdot}

denotes taking the convex hull or envelope of a given usually closed set. It is important to state that within an abs-linear representation the multipliers c will stay constant independent of the argument x, even if they were originally computed as partial derivatives by an abs-linearization process and thus subject to round-off error. In particular their sign will remain fixed throughout whatever algorithmic calculation we perform involving the piecewise linear function f. So, actually the case

c = 0

could be eliminated by dropping this term completely and just initializing the left hand side v to zero.

Because we have set identities we can propagate generalized gradient pairs

(\nabla \overset{ˇ}{u}, \nabla \hat{u}) \in \partial \overset{ˇ}{u} \times \partial \hat{u}

and perform the indicated algebraic operations on them, starting with the Cartesian basis vectors

\nabla {\overset{ˇ}{x}}_{j} = \nabla {\hat{x}}_{j} = \nabla x_{j} = e_{j} since {\overset{ˇ}{x}}_{j} = {\hat{x}}_{j} = x_{j} for j = 1 \dots n .

The result of this propagation is guaranteed to be an element of

\partial \overset{ˇ}{f} \times \partial \hat{f}

. Recall that in the merely Lipschitz continuous case generalized gradients cannot be propagated with certainty since for example the difference

v = w - u

generates a proper inclusion

\partial v \subset \partial w - \partial u

. In that vein we must emphasize that the average

\frac{1}{2} (\nabla \overset{ˇ}{f} + \nabla \hat{f})

need not be a generalized gradient of

f = \frac{1}{2} (\overset{ˇ}{f} + \hat{f})

as demonstrated by the possibility that

\hat{f} = - \overset{ˇ}{f}

algebraically but we happen to calculate different generalized gradients of

\overset{ˇ}{f}

and

- \hat{f}

at a particular point x. In fact, if one could show that

\partial f = \frac{1}{2} (\partial \overset{ˇ}{f} + \partial \hat{f})

one would have verified the oracle paradigm, whose use we consider unjustified in practice. Instead, we can formulate another corollary for sufficiently piecewise smooth functions.

Definition 1.

For any

d \in N

, the set of functions

f : R^{n} \mapsto R, y = f (x),

defined by an abs-normal form

\begin{matrix} z & = & F (x, z, | z |), \\ y & = & φ (x, z), \end{matrix}

with

F \in C^{d} (R^{n + s + s})

and

φ \in C^{d} (R^{n + s})

, is denoted by

C_{abs}^{d} (R^{n}) .

Once more, this definition differs slightly from the one given in [7] in that y depends only on z and not on |z| in order to match the abs-linear form used here. Then one can show the following result:

Corollary 4 (Constructive Oracle Paradigm).

For any function

f \in C_{abs}^{2} (R^{n})

and a given point x there exist a convex polyhedral function

\overset{ˇ}{Δ f} (x; Δ x)

and a concave polyhedral function

\hat{Δ f} (x; Δ x)

such that

f (x + Δ x) - f (x) = \frac{1}{2} (\overset{ˇ}{Δ f} (x; Δ x) + + \hat{Δ f} (x; Δ x)) + {O (∥ Δ x ∥}^{2})

Moreover, both terms and their generalized gradients at

Δ x = 0

or anywhere else can be computed with the same order of complexity as f itself.

Proof.

In [11], we show that

f (x + Δ x) - f (x) = Δ f (x; Δ x) + O (∥ Δ x ∥^{2}),

where

Δ f (x; Δ x)

is a piecewise linearization of f developed at x and evaluated at

Δ x .

Applying the convex/concave decomposition of Theorem 1, one obtains immediately the assertion with a convex polyhedral function

\overset{ˇ}{Δ f} (x; Δ x)

and a concave polyhedral function

\hat{Δ f} (x; Δ x)

evaluated at

Δ x

The complexity results follow from the propagation rules derived so far. ☐

We had hoped that it would be possible to use this approximate decomposition into polyhedral parts to construct at least locally an exact decomposition of a general function

f \in C_{abs}^{d} (R^{n}) .

into a convex and compact part. The natural idea seems to add a sufficiently large quadratic term

{β (∥ Δ x ∥}^{2}

to

f (x + Δ x) - f (x) - \frac{1}{2} \hat{Δ f} (x; Δ x) = \frac{1}{2} \overset{ˇ}{Δ f} (x; Δ x) + {O (∥ Δ x ∥}^{2})

such that it would become convex. Then the same term could be subtracted from

\hat{Δ f} (x; Δ x)

maintaining its concavity. Unfortunately, the following simple example shows that this is not possible.

Example 1 (Half pipe).

The function

\begin{matrix} f : R^{2} & \mapsto R, & f (x_{1}, x_{2}) & = max (x_{2}^{2} - max (x_{1}, 0), 0) \\ = \{\begin{matrix} x_{2}^{2} & i f x_{1} \leq 0 \\ x_{2}^{2} - x_{1} & i f 0 \leq x_{1} \leq x_{2}^{2} \\ 0 & i f 0 \leq x_{2}^{2} \leq x_{1} \end{matrix}, \end{matrix}

(33)

in the class

C_{abs}^{\infty} (R^{n})

is certainly nonconvex as shown in Figure 1. As already observed in [19] this generally nonsmooth function is actually Fréchet differentiable at the origin x = 0 with a vanishing gradient

▽ f (0) = 0

Hence, we have

f (Δ x) = {O (∥ Δ x ∥}^{2}

and may simply choose constantly

\overset{ˇ}{Δ f} (0; Δ x) \equiv 0 \equiv \hat{Δ f} (0; Δ x) .

However, neither by adding

β {∥ Δ x ∥}^{2}

nor any other smooth function to

f (Δ x)

can we eliminate the downward facing kink along the vertical axis

Δ x_{1} = 0 .

In fact, it is not clear whether this example has any DC decomposition at all.

Applying the Reverse Mode for Accumulating Generalized Gradients

Whenever gradients are propagated forward through a smooth evaluation procedure, i.e., for functions in

C^{2} (R^{n})

, they are uniquely defined as affine combinations of each other, starting from Cartesian basis vectors for the components of x. Given only the coefficients of the affine combinations one can propagate corresponding adjoint values, or impact factors backwards, to obtain the gradient of a single dependent with respect to all independents at a small multiple of the operations needed to evaluate the dependent variable by itself. This cheap gradient result is a fundamental principle of computational mathematics, which is widely applied under various names, for example discrete adjoints, back propagation, and reverse mode differentiation. For a historical review see [20] and for a detailed description using similar notation to the current paper see our book [5]. For good reasons, there has been little attention to the reverse mode in the context of nonsmooth analysis, where one can at best obtain subgradients. The main obstacle is again that the forward propagation rules are only sharp when all elementary operations maintain convexity, which is by the way the only constructive way of verifying convexity for a given evaluation procedure. While general affine combinations and the absolute value are themselves convex functions, they do not maintain convexity when applied to a convex argument.

The last equation of Lemma 1 shows that one cannot directly propagate a subgradient of the convex radius functions

δ u

because there is a reference to

v = | u |

itself, which does not maintain convexity except when it is redundant due to its argument having a constant sign. However, it follows from the identity

δ u = \frac{1}{2} (\overset{ˇ}{u} - \hat{u})

that for all intermediates u

\nabla \overset{ˇ}{u} \in \partial \overset{ˇ}{u} \land \nabla \hat{u} \in \partial \hat{u} \Rightarrow \frac{1}{2} (\nabla \overset{ˇ}{u} - \nabla \hat{u}) \in \partial δ u .

Hence one can get affine lower bounds of the radii, although one would probably prefer upper bounds to limit the discrepancy between the convex and concave parts. When

v = | u |

and

u = 0

we may choose according to Equation (32) any convex combination

\frac{1}{2} \nabla \overset{ˇ}{v} = (1 - μ) \nabla \overset{ˇ}{u} - μ \nabla \hat{u} for 0 \leq μ \leq 1 .

(34)

It is tempting but not necessarily a good idea to always choose the weight μ equal to

\frac{1}{2}

for simplicity.

Before discussing the reasons for this at the end of this subsection, let us note that from the values of the constants c, the intermediate values u, and the chosen weights μ it is clear how the next generalized gradient pair

(\nabla \overset{ˇ}{v}, \nabla \hat{v})

is computed as a linear combination of the generalized gradients of the inputs for each operation, possibly with a switch in their roles. That means after only evaluating the function f itself, not even the bounds

\overset{ˇ}{f}

and

\hat{f}

, we can compute a pair of generalized gradients in

\partial \overset{ˇ}{f} \times \partial \hat{f}

using the reverse mode of algorithmic differentiation, which goes back to at least [21] though not under that name. The complexity of this computation will be independent of the number of variables and relative to the complexity of the function f itself. All the operations are relatively benign, namely scaling by constants, interchanges and additions and subtractions. After all the reverse mode is just a reorganization of the linear algebra in the forward propagation of gradients. Hence, it appears that we can be comparatively optimistic regarding the numerical stability of this process.

To be specific we will indicate the (scalar) adjoint value of all intermediates

\overset{ˇ}{u}

and

\hat{u}

as usual by

\bar{\overset{ˇ}{u}} \in R

and

\bar{\hat{u}} \in R

. They are all initialized to zero except for either

\bar{\overset{ˇ}{y}} = 1

or

\bar{\hat{y}} = 1

. Then at the end of the reverse sweep, the vectors

{({\bar{x}}_{j})}_{j = 1}^{n}

represent either

\nabla \overset{ˇ}{y}

or

\nabla \hat{y}

, respectively. For computational efficiency one may propagate both adjoint components simultaneously, so that one computes with sextuplets consisting of

\overset{ˇ}{u}, \hat{u}

and their adjoints with respect to

\overset{ˇ}{y}

and

\hat{y}

. In any case we have the following adjoint operations. For

v = u + w

(\bar{\overset{ˇ}{w}}, \bar{\hat{w}}) += (\bar{\overset{ˇ}{v}}, \hat{\overset{ˇ}{v}}) and (\bar{\overset{ˇ}{u}}, \bar{\hat{u}}) += (\bar{\overset{ˇ}{v}}, \bar{\hat{v}}),

for

v = c u

(\bar{\overset{ˇ}{u}}, \bar{\hat{u}}) += \{\begin{matrix} c (\bar{\overset{ˇ}{v}}, \bar{\hat{v}}) & if c > 0 \\ (0, 0) & if c = 0 \\ c (\bar{\hat{v}}, \bar{\overset{ˇ}{v}}) & if c < 0 \end{matrix},

and finally for

v = | u |

(\bar{\overset{ˇ}{u}}, \bar{\hat{u}}) += \{\begin{matrix} (2 \bar{\overset{ˇ}{v}} - \bar{\hat{v}}, \bar{\hat{v}}) & if u > 0 \\ (- \bar{\hat{v}} + 2 (1 - μ) \bar{\overset{ˇ}{v}}, \bar{\hat{v}} - 2 μ \bar{\overset{ˇ}{v}}) & if u = 0 \\ (- \bar{\hat{v}}, \bar{\hat{v}} - 2 \bar{\overset{ˇ}{v}}) & if u < 0 \end{matrix} .

(35)

Of course, the update for the critical case

u = 0

of the absolute value is just the convex combination for the two cases

u > 0

and

u < 0

weighted by μ. Due to round-off errors it is very unlikely that the critical case

u = 0

ever occurs in floating point arithmetic. Once more, the sign of the arguments u of the absolute value function are of great importance, because they determine on which faces of the polyhedral functions

\overset{ˇ}{f}

and

\hat{f}

the current argument x is located. In some situations one prefers a gradient that is limiting in that it actually occurs as a proper gradient on one of the adjacent smooth pieces. For example, if we had simply

f (x) = v = | x |

for

x \in R

and chose

μ = \frac{1}{2}

we would get

\overset{ˇ}{v} = 2 | x |, \hat{v} = 0

and find by Equation (34) that

\nabla \overset{ˇ}{v} = 2 (\frac{1}{2} - \frac{1}{2}) = 0

at

x = \overset{ˇ}{x} = \hat{x} = 0

. This is not a limiting gradient of

\overset{ˇ}{v}

since

\partial \overset{ˇ}{v} = [- 2, 2]

, whose interior contains the particular generalized gradient 0.

5. Exploiting the Convex/concave Decomposion for the DC Algorithm

In order to minimize the decomposed objective function f we may use the DCA algorithm [17] which is given in its basic form using our notation by

\begin{matrix} Choose x_{0} \in R^{n} \\ For k = 0, 1, 2, \dots \\ Calculate g_{k} \in - \partial (\frac{1}{2} \hat{f}) (x_{k}) \\ Calculate x_{k + 1} \in \partial {(\frac{1}{2} \overset{ˇ}{f})}^{*} (g_{k}) \end{matrix}

where

{(\frac{1}{2} \overset{ˇ}{f})}^{*}

denotes the Fenchel conjugate of

(\frac{1}{2} \overset{ˇ}{f})

. For a convex function

h : R^{n} \mapsto R

one has

\begin{matrix} w \in \partial h^{*} (y) \Leftrightarrow w \in \underset{x \in R^{n}}{argmin} {h (x) - y^{⊤} x}, \end{matrix}

see [15], Chapter 11. Hence, the classic DCA reduces in our Euclidean scenario to a simple recurrence

x_{k + 1} \in \underset{x \in R^{n}}{argmin} \{\overset{ˇ}{f} (x) + {\hat{g}}_{k}^{⊤} x\} for some {\hat{g}}_{k} \in \partial \hat{f} (x_{k}) .

(36)

The objective function on the left hand side is a constantly shifted convex polyhedral upper bound on

2 f (x)

since

\overset{ˇ}{f} (x) + {\hat{g}}_{k}^{⊤} x = 2 f (x) - (\hat{f} (x) - {\hat{g}}_{k}^{⊤} x) \geq 2 f (x) - \hat{f} (x_{k}) + {\hat{g}}_{k}^{⊤} x_{k} .

(37)

It follows from Equation (29) and

x_{k + 1}

being a minimizer that

\begin{matrix} f (x_{k + 1}) & \leq & \frac{1}{2} (\overset{ˇ}{f} (x_{k + 1}) + \hat{f} (x_{k}) + {\hat{g}}_{k}^{⊤} (x_{k + 1} - x_{k})) \\ \leq & \frac{1}{2} (\overset{ˇ}{f} (x_{k}) + \hat{f} (x_{k})) = f (x_{k}) . \end{matrix}

Now, since (36) is an LOP, an exact solution

x_{k + 1}

can be found in finitely many steps, for example by a variant of the Simplex method. Moreover, we can then assume that

x_{k + 1}

is one of finitely many vertex points of the epigraph of

\overset{ˇ}{f}

. At these vertex points, f itself attains a finite number of bounded values. Provided f itself is bounded below, we can conclude that for any choice of the

{\hat{g}}_{k} \in \partial {\hat{f}}_{σ^{(k)}}

the resulting function values

f (x_{k})

can only be reduced finitely often so that

f (x_{k}) = f (x_{k - 1})

and w.l.o.g.

x_{k} = x_{k - 1}

eventually. We then choose the next

{\hat{g}}_{k} = \nabla {\hat{f}}_{σ^{(k)}}

with

σ^{(k)} = σ^{(k - 1)} ▹ σ (x_{k})

as the reflection of

σ^{(k - 1)}

at

σ (x_{k})

as defined in (15). If then again

f (x_{k + 1}) = f (x_{k})

it follows from Corollary A2 that

x_{k}

is a local minimizer of f and we may terminate the optimization run. Hence we obtain the DCA variant listed in Algorithm 1, which is guaranteed to reach local optimality under LIKQ. It is well defined even without this property and we conjecture that otherwise the final iterate is still a stationary point of f. The path of the algorithm on the example discussed in Section 5 is sketched in Figure 3. It reaches the stationary point

(0, - 1)

where

σ = (0, - 1, 0)

from within the polyhedron with the signature

(- 1, - 1, - 1)

and then continues after the reflection

(1, - 1, 1) = (- 1, - 1, - 1) ▹ (0, - 1, 0)

. From within that polyhedron the inner loop reaches the point

(1, 1)

with signature

(1, 0, 0)

, whose minimality is established after a search in the polyhedron

{\bar{P}}_{(1, 1, - 1)}

.

If the function

f (x)

is unbounded below, so will be one of the inner convex problems and the convex minimizer should produce a ray of infinite descent instead of the next iterate

x_{k + 1}

. This exceptional scenario will not be explicitly considered in the remainder of the paper. The reflection operation is designed to facilitate further descent or establish local optimality. It is discussed in the context of general optimality conditions in the following subsection.

Algorithm1 Reflection DCA

Require:

x_{0} \in R^{n},

1:: Set $f_{- 1} = \infty$ and Evaluate $f_{0} = f (x_{0})$
2:: for $k = 0, 1, \dots$ do
3:: if $f_{k} < f_{k - 1}$ then ▹ Normal iteration with function reduction
4:: Choose $0 \notin σ ≻ σ (x_{k})$ ▹ Here different heuristics may be applied
5:: Compute ${\hat{g}}_{k} = \nabla {\hat{f}}_{σ}$ ▹ Apply formula of Corollary 1
6:: else ▹ The starting point was already optimal
7:: Reflect $\tilde{σ} = σ ▹ σ (x_{k})$ ▹ The symbol ▹ is defined in Equation (15).
8:: Update ${\hat{g}}_{k} = \nabla {\hat{f}}_{\tilde{σ}}$
9:: end if
10:: Calculate $x_{k + 1} \in argmin \{\overset{ˇ}{f} (x) + {\hat{g}}_{k}^{⊤} x| x \in R^{n}\}$ ▹ Apply any LOP finite solver
11:: Set $f_{k + 1} = f (x_{k + 1})$
12:: if $f_{k + 1} = f_{k} = f_{k - 1}$ then ▹ Local optimality established
13:: Stop
14:: end if
15:: endfor

5.1. Checking Optimality Conditions

Stationarity of

x_{k}

happens when the convex function

\overset{ˇ}{f} (x) + {\hat{g}}_{k}^{⊤} x

is minimal at

x_{k}

so that for all large k

0 \in \partial \overset{ˇ}{f} (x_{k}) + {\hat{g}}_{k} \Leftrightarrow {\hat{g}}_{k} \in \partial \hat{f} (x_{k}) \cap (- \partial \overset{ˇ}{f} (x_{k})) \neq \emptyset .

(38)

The nonemptiness condition on the right hand side is known as criticality of the DC decomposition at

x_{k}

, which is necessary but not sufficient even for local optimality of

f (x)

at

x_{k}

. To ensure the latter one has to verify that all

{\hat{g}}_{k} \in \partial \hat{f} (x_{k})

satisfy the criticality condition (38) so that

\partial \hat{f} (x_{k}) \subset - \partial \overset{ˇ}{f} (x_{k}) \Leftrightarrow \partial^{L} \hat{f} (x_{k}) \subset - \partial \overset{ˇ}{f} (x_{k}) .

(39)

The left inclusion is a well known local minimality condition [22], which is already sufficient in the piecewise linear case. The right inclusion is equivalent to the left one due to the convexity of

\partial \overset{ˇ}{f} (x_{k})

.

If

\overset{ˇ}{f}

and

\hat{f}

were unrelated convex and concave polyhedral functions, one would normally consider it extremely unlikely that

\hat{f}

were nonsmooth at any one of the finitely many vertices of the polyhedral domain decomposition of

\overset{ˇ}{f}

. For instance when

\hat{f}

is smooth at

x_{k}

we find that

\partial \hat{f} (x_{k}) = {{\hat{g}}_{k}}

is a singleton so that criticality according to Equation (38) is already sufficient for local minimality according to Equation (39). As we have seen in Theorem 1 the two parts have exactly the same switching structure. That means they are nonsmooth on the same skeleton of lower dimensional polyhedra. Hence, neither

\partial^{L} \overset{ˇ}{f} (x_{k})

nor

\partial^{L} \hat{f} (x_{k})

will be singletons at minimizing vertices of the upper bound so that checking the validity of Equation (39) appears to be a combinatorial task at first sight.

However, provided the Linear Independence Kink Qualification (LIKQ) defined in [7] is satisfied at the candidate minimizer

x_{k}

, the minimality can be tested with cubic complexity even in case of a dense abs-linear form. Moreover, if the test fails one can easily calculate a descent direction d. The details of the optimality test in our context including the calculation of a descent direction are given in the Appendix A. They differ slightly from the ones in [7]. Rather than applying the optimality test Proposition A1 explicitly, one can use its Corollary A2 stating that if

\overset{˚}{x}

with

\overset{˚}{σ} = σ (\overset{˚}{x})

is a local minimizer of the restriction of f to a polyhedron

{\bar{P}}_{σ}

with definite

σ ≻ \overset{˚}{σ}

then it is a local minimizer of the unrestricted f if and only if it also minimizes the restriction of f to

{\bar{P}}_{\tilde{σ}}

with the reflection

\tilde{σ} = σ ▹ \overset{˚}{σ}

. The latter condition must be true if

\overset{˚}{x}

also minimizes

f (x) + \nabla {\hat{f}}_{\tilde{σ}}

, which can be checked by solving that convex problem. If that test fails the optimization can continue.

5.2. Proximal Rather Than Global

By some authors the DCA algorithm has been credited with being able to reach global minimizers with a higher probability than other algorithms. There is really no justification for this optimism in the light of the following observation. Suppose the objective

f (x) = \frac{1}{2} (\overset{ˇ}{f} (x) + \hat{f} (x))

has an isolated local minimizer

x_{*}

. Then there exists an

ε > 0

such that the level set

{x \in R^{n} : f (x) \leq f (x_{*}) + ε}

has a bounded connected component containing

x_{*}

, say

L_{ε}

. Now suppose DCA is started from any point

x_{0} \in L_{ε}

. Since

f_{0} (x) \equiv \frac{1}{2} (\overset{ˇ}{f} (x) + \hat{f} (x_{0}) + \hat{g} {(x_{0})}^{⊤} (x - x_{0}))

is by Equation (37) a convex upper bound on

f (x)

its level set

{f_{0} (x) \leq f (x_{0})}

will be contained in

L_{ε}

. Hence any step from

x_{0}

that reduces the upper bound

f_{0} (x)

must stay in the same component, so there is absolutely no chance to move away from the catchment

L_{ε}

of

x_{0}

towards another local minimizer of f, whether global or not. In fact, by adding the convex term

\frac{1}{2} (\hat{f} (x_{0}) + \hat{g} {(x_{0})}^{⊤} (x - x_{0}) - \hat{f} (x)) \geq 0,

which vanishes at

x_{0}

, to the actual objective

f (x)

one performs a kind of regularization, like in the proximal point method. This means the step is actually held back compared to a larger step that might be taken by a method that only requires the reduction of

f (x)

itself.

Hence we may interpret DCA as a proximal point method where the proximal term is defined as an affinely shifted negative of the concave part. Since in general the norm and the coefficient defining the proximal term may be quite hard to select, this way of defining it may make a lot of sense. However, it is certainly not global optimization. Notice that in this argument we have used neither the polyhedrality nor the inclusion property. So it applies to a general DC decomposition on Euclidean space. Another conclusion from the "holding back" observation is that it is probably not worthwhile to minimize the upper bound very carefully. One might rather readjust the shift

{\hat{g}}^{⊤} x

after a few or even just one iteration.

6. Nesterov’s Piecewise Linear Example

According to [6], Nesterov suggested three Rosenbrock-like test functions for nonsmooth optimization. One of them given by

\begin{matrix} f (x) = & \frac{1}{4} | x_{1} - 1 | + \sum_{i = 1}^{n - 1} |x_{i + 1} - 2 | x_{i} | + 1| \end{matrix}

(40)

is nonconvex and piecewise linear. It is shown in [6] that this function has

2^{n - 1}

Clarke stationary points only one of which is a local and thus the global minimizer. Numerical studies showed that optimization algorithms tend to be trapped at one of the stationary points making it an interesting test problem. We have demonstrated in [23] that using an active signature strategy one can guarantee convergence to the unique minimizer from any starting point albeit using in the worst case

2^{n}

iterations as all stationary points are visited. Let us first write the problem in the new abs-linear form.

Defining the

s = 2 n

switching variables

z_{i} = F_{i} (x, | z |) = x_{i} for 1 \leq i < n, z_{n} = F_{n} (x, | z |) = x_{1} - 1,

and

z_{n + i} = F_{n + i} (x, | z |) = x_{i + 1} - 2 | z_{i} | + 1 for 1 \leq i < n, z_{s} = \frac{1}{4} | z_{n} | + \sum_{i = 1}^{n - 1} | z_{n + i} |

the resulting objective function is then simply identical to

y = f (x) = z_{s}

. With the vectors and matrices

\begin{matrix} c^{⊤} & = & (0, - 1, e_{n - 1}^{⊤}, 0) \in R^{(n - 1) + 1 + (n - 1) + 1}, Z = [\begin{matrix} I_{n - 1} & 0 \\ I_{n - 1} & 0 \\ 0 & 1 \\ 0 & 0 \end{matrix}] \in R^{s \times (n - 1) + 1}, \\ M & = & 0, L = [\begin{matrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ - 2 I_{n - 1} & 0 & 0 & 0 \\ 0 & \frac{1}{4} & e_{n - 1}^{⊤} & 0 \end{matrix}] \in R^{s \times (n - 1) + 1 + (n - 1) + 1}, d = 0 \in R, \\ a & = & 0, b^{⊤} = (0, \dots, 0, 1) \in R^{(2 n - 1) + 1}, \end{matrix}

where Z and L have different row partitions, one obtains an abs-linear form (11) of f. Here,

I_{k}

denotes the identity matrix of dimension k,

e^{⊤} = (1, \dots, 1) \in R^{k}

the vector containing only ones and the symbol 0 pads with zeros to achieve the specified dimensions. One can easily check that

{| L |}^{2} \neq 0 = {| L |}^{3}

, hence this example has switching depth

ν = 2

. The geometry of the situation is depicted in Figure 3, which was already briefly discussed in Section 3 and Section 5.

Since the corresponding extended abs-linear form for

\tilde{f} = (y, δ y)

does not provide any new insight we do not state it here. Directly in terms of the original equations we obtain for the radii

δ z_{i} = 0 for 1 \leq i \leq n, δ z_{n + i} = 2 | z_{i} | = 2 | x_{i} | for 1 \leq i < n

(41)

and

\begin{matrix} δ f = δ z_{s} & = & \frac{1}{4} | z_{n} | + \sum_{i = 1}^{n - 1} (| z_{n + i} | + 2 δ z_{n + i}) \\ = & \frac{1}{4} | x_{1} - 1 | + \sum_{i = 1}^{n - 1} (| x_{i + 1} - 2 | x_{i} | + 1 | + 4 | x_{i} |) . \end{matrix}

(42)

Thus, from Equation (7) we get the convex and concave part explicitly as

\begin{matrix} {\overset{ˇ}{z}}_{i} & = z_{i} = {\hat{z}}_{i} for 1 \leq i \leq n, \\ \begin{matrix} {\overset{ˇ}{z}}_{n + i} = x_{i + 1} + 1 \\ {\hat{z}}_{n + i} = x_{i + 1} - 4 | z_{i} | + 1 = x_{i + 1} - 4 | x_{i} | + 1 \end{matrix}\} for 1 \leq i < n \end{matrix}

and most importantly

\begin{matrix} \overset{ˇ}{f} & = z_{s} + δ z_{s} = \frac{1}{2} | x_{1} - 1 | + 2 \sum_{i = 1}^{n - 1} (|x_{i + 1}^{} - 2 | x_{i} | + 1| + 2 | x_{i} |) \\ \hat{f} & = z_{s} - δ z_{s} = - 4 \sum_{i = 1}^{n - 1} | x_{i} | . \end{matrix}

Clearly

\hat{f}

is a concave function and to check the convexity of

\overset{ˇ}{f}

we note that

\begin{matrix} |x_{i + 1}^{} - 2 | x_{i} | + 1| + 2 | x_{i} | & = & |2 | x_{i} |^{} - 1 - x_{i + 1}^{}| + (2 | x_{i} | - 1 - x_{i + 1}^{}) + x_{i + 1} + 1 \\ = & 1 + x_{i + 1}^{} + 2 max (0, 2 | x_{i} | - x_{i + 1}^{} - 1) . \end{matrix}

(43)

The last expression is the sum of an affine function and the positive part of the sum of the absolute value and an affine function, which must therefore also be convex. The corresponding term in Equation (42) is the same with the convex function

2 | x_{i} |

added, so that

δ f

is also convex in agreement with the general theory. Finally, one verifies easily that

\begin{matrix} \hat{f} \leq f = \frac{1}{2} (\overset{ˇ}{f} + \hat{f}) \leq \overset{ˇ}{f}, \end{matrix}

which is the whole idea of the decomposition. It would seem that the automatic decomposition by propagation through the abs-linear procedure yields a rather tight result. The function f as well as the lower and upper bound given by the convex/concave decomposition are illustrated on the left hand side of Figure 2. Notice that the switching structure is indeed identical for all three as stated in Theorem 1. On the right hand side of Figure 2, the difference

2 δ f

between the upper, convex and lower, concave bound is shown, which is indeed convex.

It is worthwhile to look at the condition number of the decomposition, namely we get the following trivial bound

\begin{matrix} κ (\overset{ˇ}{f}, \hat{f}) & = & sup_{x \in R^{n}} \frac{\frac{1}{2} | x_{1} - 1 | + 2 \sum_{i = 1}^{n - 1} (|x_{i + 1}^{} - 2 | x_{i} | + 1| + 4 | x_{i}|)}{\frac{1}{2} | x_{1} - 1 | + 2 \sum_{i = 1}^{n - 1} | x_{i + 1} - 2 | x_{i} | + 1 |} \\ = & 1 + sup_{x \in R^{n}} \frac{8 \sum_{i = 1}^{n - 1} | x_{i} |}{\frac{1}{4} | x_{1} - 1 | + 2 \sum_{i = 1}^{n - 1} | x_{i + 1} - 2 | x_{i} | + 1 |} = \infty . \end{matrix}

The disappointing right hand side value follows from the fact that at the well known unique global optimizer

x_{*} = (1, 1, \dots, 1) \in R^{n}

the numerator is zero and the denominator positive. However, elsewhere, we can bound the conditioning as follows.

Lemma 3.

In case of the example (40) there is a constant

c \in R

such that

κ (\overset{ˇ}{f} (x), \hat{f} (x)) \leq 1 + \frac{c}{min (∥ x - x_{*} ∥, 3)} .

(44)

Proof.

Since the denominator is piecewise linear and vanishes only at the minimizer

x_{*}

there must be a constant

c_{0} > 0

such that for

∥ x - x_{*} ∥_{\infty} \leq 3

\frac{8 \sum_{i = 1}^{n - 1} | x_{i} |}{\frac{1}{4} | x_{1} - 1 | + 2 \sum_{i = 1}^{n - 1} | x_{i + 1} - 2 | x_{i} | + 1 |} \leq \frac{8 \sum_{i = 1}^{n - 1} | x_{i} |}{c_{0} {∥ x - x_{*} ∥}_{\infty}} \leq \frac{{8 (n - 1) ∥ x ∥}_{\infty}}{c_{0} {∥ x - x_{*} ∥}_{\infty}} \leq \frac{32 (n - 1)}{c_{0} {∥ x - x_{*} ∥}_{\infty}},

which takes the value

32 (n - 1) / (3 c_{0})

on the boundary. On the other hand we get for

{∥ x ∥}_{\infty} \geq 2

and thus in particular

∥ x - x_{*} ∥_{\infty} \geq 3

\frac{8 \sum_{i = 1}^{n - 1} | x_{i} |}{\frac{1}{4} | x_{1} - 1 | + 2 \sum_{i = 1}^{n - 1} | x_{i + 1} - 2 | x_{i} | + 1 |} \leq \frac{{4 (n - 1) ∥ x ∥}_{\infty}}{{max}_{1 \leq i < n} | 2 | x_{i} | - x_{i + 1} - 1 |} \leq \frac{4 (n - 1)}{2 - 1 - 1 / 2} \leq 8 (n - 1) .

Assuming without loss of generality that

c_{0} \leq 4 / 3

we can combine the two bounds to obtain the assertion with

c \equiv 32 (n - 1) / c_{0}

. ☐

Hence, we see the condition number

κ (\overset{ˇ}{f} (x), \hat{f} (x))

is nicely bounded and the decomposition should work as long as our optimization algorithm has not yet reached its goal

x_{*}

. It is verified in the companion article [24], that the DCA exploiting the observations made in this paper reaches the global minimizer in finitely many steps. It was already shown in [7] that the LIKQ condition is satisfied everywhere and that the optimality test singles out the unique minimizer correctly. In Figure 3, the arrows indicate the path of our reflection version of the DCA method as described in Section 5.

7. Summary, Conclusions and Outlook

In this paper the following new results were achieved

For every piecewise linear function f given as an abs-linear evaluation procedure, rules for simultaneously evaluating its representation as the average of a concave lower bound $\hat{f}$ and a convex upper bound $\overset{ˇ}{f}$ are derived.
The two bounds can be constructively expressed as a single maximum and minimum of affine functions, which drastically simplifies the classical $min - max$ representation. Due to its likely combinatorial complexity we do not recommend this form for practical calculations.
For the two bounds $\overset{ˇ}{f}$ and $\hat{f}$ , generalized gradients $\overset{ˇ}{g}$ and $\hat{g}$ can be propagated forward or reverse through the convex or concave operations that define them. The gradients are not unique but guaranteed to yield supporting hyperplanes and thus provide a verified version of the oracle paradigm.
The DCA algorithm can be implemented such that a local minimizer is reached in finitely many iterations, provided the Linear Independence Kink Qualification (LIKQ) is satisfied. It is conjectured that without this assumption the algorithm still converges in finitely many steps to a Clarke stationary point. Details on this can be found in the companion paper [24].

These results are illustrated on the piecewise linear Rosenbrock variant of Nesterov.

On a theoretical level it would be gratifying and possibly provide additional insight, to prove the result of Corollary A3 directly using the explicit representations of the generalized differentials of the convex and concave part given in Corollary 1. Moreover, it remains to be explored what happens when LIKQ is not satisfied. We have conjectured in [25] that just verifying the weaker Mangasarian Fromovitz Kink Qualification (MFKQ) represents an NP hard task. Possibly, there are other weaker conditions that can be cheaply verified and facilitate the testing for at least local optimality.

Global optimality can be characterized theoretically in terms of

ε -

subgradients, albeit with ε arbitrarily large [26]. There is the possibility that the alternative definition of ε-gradients given in [18] might allow one to constructively check for global optimality. It does not really seem clear how these global optimality conditions can be used to derive corresponding algorithms.

The implementation of the DCA algorithm can be optimized in various ways. Notice that for applying the Simplex method in standard form, one could use for the representation as DC function the max-part in the more economical representation Equation (27) introducing

\bar{m}

additional variables, rather than the potentially combinatorial Equation (28) to assemble the constraint matrix. In any case it seems doubtful that solving each sub problem to completion is a good idea, especially as the resulting step in the outer iteration is probably much too small anyhow. Therefore, the generalized gradient of the concave part, which defines the inner problem, should probably be updated much more frequently. Moreover, the inner solver might be an SQOP type active signature method or a matrix free gradient method with momentum term, as is used in machine learning, notwithstanding the nonsmoothness of the objective. Various options in that range will be discussed and tested in the companion article [24].

Finally, one should always keep in mind that the task of minimizing a piecewise linear function will most likely occur as an inner problem in the optimization of a piecewise smooth and nonlinear function. As we have shown in [27] the local piecewise linear model problem can be obtained easily by a slight generalization of automatic or algorithmic differentiation, e.g., ADOL-C [28] and Tapenade [29].

Author Contributions

Conceptualization, A.G. and A.W.; methodology, A.G. and A.W.; writing—original draft preparation; writing—review and editing, A.G. and A.W. All authors have read and agreed to the published version of the manuscript.

Funding

We acknowledge support by the German Research Foundation (DFG) and the Open Access Publication Fund of Humboldt-Universität zu Berlin.

Acknowledgments

We thank Napsu Karmitsa and Sona Taheri for inviting us to participate in this special issue in honor of Adil M. Bagirov. We also thank the three anonymous referees, who asked for various corrections and clarifications, which made the paper much more self-contained and readable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Polynomial Optimality Test Based on Abs-Linear Form

As illustrated for the Nesterov test function, it may be advantageous to use intermediate variables

z_{i}

that are not arguments of the absolute value themselves. For simplicity, we assume that these switching variables that do not impose nonsmoothness are located in the last components of z and that only the

\tilde{s} \leq s

components

z_{1}, \dots z_{\tilde{s}}

are arguments of the absolute value. Let us abbreviate the current iterate

x_{k}

with

\overset{˚}{x} \equiv x_{k}

and denote the corresponding switching vector by

\overset{˚}{z} = z (\overset{˚}{x})

, the signature vector

\overset{˚}{σ} = sgn (\overset{˚}{z})

and the active index set by

α \equiv {i \leq \tilde{s} : {\overset{˚}{σ}}_{i} = 0}

with cardinality

m \equiv | α | \leq \tilde{s}

. Consequently, there are exactly

2^{m}

definite signatures by

σ ≻ \overset{˚}{σ}

and the same number of limiting gradients for the three generalized differentials

\partial \overset{ˇ}{f}, \partial \hat{f}

, and

\partial f

.

For all

x \in P_{\overset{˚}{σ}}

, the signature

\overset{˚}{σ}

is constant and we can use Corollary 1 to define the smooth function

z_{\overset{˚}{σ}} (x) = {(I - M - L \overset{˚}{Σ})}^{- 1} (c + Z x) = \overset{˚}{c} + \overset{˚}{Z} x,

(A1)

where we have pulled out the unit lower triangular factor

(I - M - L \overset{˚}{Σ})

such that

\overset{˚}{Z} = {(I - M - L \overset{˚}{Σ})}^{- 1} Z and \overset{˚}{c} = {(I - M - L \overset{˚}{Σ})}^{- 1} c .

For

x \approx \overset{˚}{x}

to be contained in the extended closure

{\bar{P}}_{\overset{˚}{σ}}

as defined in Equation (14), it must satisfy the m linear equations

P_{α} z (x) = 0 \in R^{m} for P_{α} = {(e_{i}^{⊤})}_{i \in α} \in R^{m \times \tilde{s}}

with

e_{i}

denoting the ith unit vector in

R^{\tilde{s}}

. Thus it is necessary and sufficient for

P_{\overset{˚}{σ}}

to be a polyhedron of dimension

n - m

that the Jacobian

P_{α} \overset{˚}{Z} \in R^{m \times n}

has full row rank m. This rank condition was introduced as LIKQ in [7] and obviously requires that no more than n switches are active at

\overset{˚}{x}

. As discussed in [7], for the point

\overset{˚}{x}

to be a local minimizer of f it is necessary that it solves the trunk problem

min a^{⊤} x + b^{⊤} z s . t . | \overset{˚}{Σ} | z - \overset{˚}{c} - \overset{˚}{Z} x = 0 .

Here

| \overset{˚}{Σ} | \in R^{\tilde{s} \times \tilde{s}}

is the projection onto the

\tilde{s} - m

vector components whose indices do not belong to α so the equality constraint combines (A1) and the constraint

P_{α} z = 0

. Now we get from KKT theory or equivalently LOP duality that

\overset{˚}{x}

is a minimizer on

P_{α}

if and only if for some Lagrange multiplier vector

λ \in R^{\tilde{s}}

a^{⊤} = - λ^{⊤} \overset{˚}{Z} and b^{⊤} = λ^{⊤} | \overset{˚}{Σ} | .

(A2)

Since

I = | \overset{˚}{Σ} | + P_{α}^{⊤} P_{α}

we derive that

λ^{⊤} (I - | \overset{˚}{Σ} |) \overset{˚}{Z} = λ_{α}^{⊤} P_{α} \overset{˚}{Z} = - a^{⊤} - b^{⊤} \overset{˚}{Z} .

(A3)

where

λ_{α} \equiv P_{α} λ

. This is a generally overdetermined system of n equations in the m components of

λ_{α}

. If it is solvable the full multiplier vector

λ = P_{α}^{⊤} λ_{α} + | \overset{˚}{Σ} | b

is immediately available. Because of the assumed full rank of the Jacobian

P_{α} \overset{˚}{Z}

we have

m \leq n

, and if

\overset{˚}{x}

is a vertex in that

m = n

the tangential stationarity condition (A3) is automatically satisfied.

Now it is necessary and sufficient for local minimality that

\overset{˚}{x}

is also a minimizer of f on all polyhedra

{\bar{P}}_{σ}

with definite

σ ≻ \overset{˚}{σ}

. Any such

σ ≻ \overset{˚}{σ}

can be written as

σ = \overset{˚}{σ} + γ

with

γ \in {- 1, 0, 1}^{\tilde{s}}

structurally orthogonal to

\overset{˚}{σ}

such that for

Γ = diag (γ)

we have the matrix equations

Σ = \overset{˚}{Σ} + Γ and \overset{˚}{Σ} Γ = 0 = | \overset{˚}{Σ} | Γ .

Then we can express the

z (x) = z_{σ} (x)

for

x \in P_{σ}

as

\begin{matrix} z_{σ} (x) = z_{\overset{˚}{σ} + γ} (x) & = & {(I - M - L \overset{˚}{Σ} - L Γ)}^{- 1} (c + Z x) \\ = & {(I - \overset{˚}{L} Γ)}^{- 1} (\overset{˚}{c} + \overset{˚}{Z} x), \end{matrix}

with

\overset{˚}{L} \equiv {(I - M - L \overset{˚}{Σ})}^{- 1} L

. Now

\overset{˚}{x}

must be the minimizer of f on

{\bar{P}}_{σ}

, i.e., solve the problem

min a^{⊤} x + b^{⊤} z s . t . (I - \overset{˚}{L} Γ) z = \overset{˚}{c} + \overset{˚}{Z} x, P_{α} Γ z \geq 0 \in R^{m} .

(A4)

Notice that the inequalities are only imposed on the sign constraints that are active at

\overset{˚}{x}

since the strict inequalities are maintained in a neighborhood of

\overset{˚}{x}

due to the continuity of

z (x)

. Then we get again from KKT theory or equivalently LOP duality that still

a^{⊤} = - λ^{⊤} \overset{˚}{Z}

and for a second multiplier vector

0 \leq μ \in R^{m}

the equalities

a^{⊤} = - λ^{⊤} \overset{˚}{Z} and b^{⊤} = λ^{⊤} (I - \overset{˚}{L} Γ) + μ^{⊤} P_{α} Γ .

(A5)

Multiplying from the right by the projection

| \overset{˚}{Σ} |

we find that the conditions (A2) and (A3) must still hold so that λ remains exactly the same. Moreover, multiplying from the right by

Γ P_{α}^{⊤}

we get with

P_{α} P_{α}^{⊤} = I_{m}

and

Γ Γ = P_{α}^{⊤} P_{α}

after some rearrangement the inequality

{(λ - b)}^{⊤} Γ P_{α}^{⊤} = λ^{⊤} \overset{˚}{L} P_{α}^{⊤} - μ^{⊤} \leq λ^{⊤} \overset{˚}{L} P_{α}^{⊤} .

(A6)

Now the key observation is that this condition is linear in Γ and is strongest for the choice

γ_{i} = sgn (λ_{i} - b_{i})

for

i \in α

yielding the inequalities

| λ_{i} - b_{i} | \leq e_{i}^{⊤} {\overset{˚}{L}}^{⊤} λ for i \in α .

(A7)

In other words,

\overset{˚}{x}

is a solution of the branch problems (A4) if and only if it is for the worst case where

γ_{i} = sgn (λ_{i} - b_{i})

for

i \in α

. When coincidentally

λ_{i} = b_{i}

we can define

γ_{i}

arbitrarily. Note that the complementarity condition

μ^{⊤} P_{α} z (\overset{˚}{x}) = 0

associated with Equation (A4) is automatically satisfied at

\overset{˚}{x}

for any μ, since

P_{α} \overset{˚}{z} = 0

by definition of the active index set α. These observations yield immediately:

Proposition A1 (Necessary and sufficient minimality condition).

Assume LIKQ holds in that

P_{α} \overset{˚}{Z}

has full row rank

m = | α |

. Then the point

\overset{˚}{x}

is a local minimizer of f if and only if we have tangential stationarity in that

a + {\overset{˚}{Z}}^{⊤} b

belongs to the range of

{\overset{˚}{Z}}^{⊤} P_{α}^{⊤}

and normal growth holds in that

| P_{α} (λ - b) | \leq P_{α} {\overset{˚}{L}}^{⊤} λ

.

The verification that LIKQ holds and subsequently the test whether tangential stationarity is satisfied can be based on a QR decomposition of the active Jacobian

P_{α} \overset{˚}{Z} \in R^{m \times n}

. The main expense here is the calculation of

\overset{˚}{Z}

itself, which requires one forward substitution on

(I - M - L \overset{˚}{Σ})

for each of n columns of Z and hence at most

n s^{2} / 2

fused multiply adds. Very likely this effort will already be made by any kind of active set method for reaching the candidate point

\overset{˚}{x}

. Once the multiplier vector λ is obtained the remaining test (A7) for normal growth is almost for free so that we have a polynomial minimality criterion provided LIKQ holds. Otherwise one may assume a weaker generalization of the Mangasarian Fromovitz constrained qualification called MFKQ in [25]. However, we have conjectured in [19] that verifying MFKQ is probably already NP-hard.

Corollary A1 (Descent direction in the nonoptimal case).

Suppose that LIKQ holds. If tangential stationarity is violated there exits some direction

d \in R^{n}

such that

P_{α} \overset{˚}{Z} d = 0

but

(a^{⊤} + b^{⊤} \overset{˚}{Z}) d < 0

, which implies descent in that

f (\overset{˚}{x} + τ d) < f (\overset{˚}{x})

for

τ ≳ 0

. If tangential stationarity holds but normal growth fails there exists at least one

i \in α

with

| λ_{i} - b_{i} | > e_{i}^{⊤} {\overset{˚}{L}}^{⊤} λ

. Defining

γ = sgn (λ_{i} - b_{i}) e_{i} \in R^{\tilde{s}}

, any d satisfying

P_{α} {(I - \overset{˚}{L} Γ)}^{- 1} \overset{˚}{Z} d = P_{α} γ

is a descent direction.

Proof.

In the first case it is clear that

\overset{˚}{x} + τ d \in P_{\overset{˚}{σ}}

for

τ ≳ 0

since the components of

z (\overset{˚}{x} + τ d)

with indices in α stay zero and the others vary only slightly. Then the directional derivative of

f (.)

at

\overset{˚}{x}

in direction

τ d

is given by

τ a^{⊤} d + τ b^{⊤} \overset{˚}{Z} d = τ (a^{⊤} d + b^{⊤} \overset{˚}{Z} d) < 0,

which proves the first assertion. Otherwise, λ is well defined and we can choose

i \in α

with

| λ_{i} - b_{i} | > e_{i}^{⊤} {\overset{˚}{L}}^{⊤} λ

. Setting

γ = γ_{i} e_{i}

with

γ_{i} = sgn (λ_{i} - b_{i}) e_{i}

, one obtains for d with

P_{α} {(I - \overset{˚}{L} Γ)}^{- 1} \overset{˚}{Z} d = γ

that

\overset{˚}{x} + τ d \in P_{\overset{˚}{σ} + γ}

for

τ ≳ 0

. On that polyhedron the Lagrange multiplier vector μ is also well defined by Equation (A6) but we have

μ_{i} = e_{i}^{⊤} {\overset{˚}{L}}^{⊤} λ - (λ_{i} - b_{i}) γ_{i} = e_{i}^{⊤} {\overset{˚}{L}}^{⊤} λ - | λ_{i} - b_{i} | < 0 .

Then we get the directional derivative of

f (.)

at

\overset{˚}{x}

in direction

τ d

\begin{matrix} τ a^{⊤} d + τ b^{⊤} {(I - \overset{˚}{L} Γ)}^{- 1} \overset{˚}{Z} d & = τ (- λ^{⊤} \overset{˚}{Z} d + λ^{⊤} \overset{˚}{Z} d + μ^{⊤} P_{α}^{⊤} Γ {(I - \overset{˚}{L} Γ)}^{- 1} \overset{˚}{Z} d) \\ = τ μ_{i} γ_{i}^{2} < 0, \end{matrix}

where we have used identity (A5). Hence we have again descent, which completes the proof. ☐

Corollary A2 (Optimality via Reflection).

Suppose an

\overset{˚}{x}

where LIKQ holds has been reached by minimizing

f (x) + {\hat{g}}^{⊤} x

with

\hat{g} = \nabla {\hat{f}}_{σ}

for

0 \notin σ ≻ \overset{˚}{σ}

. Then

\overset{˚}{x}

is a local minimizer of f on

R^{n}

if and only if it is also a minimizer of

\overset{ˇ}{f} (x) + \nabla {\hat{f}}_{\tilde{σ}}^{⊤} x

with

\tilde{σ} = σ ▹ \overset{˚}{σ}

as defined in (15).

Proof.

By assumption

\overset{˚}{x}

solves one of the branch problems of f itself. Hence we must have tangential stationarity (A5) with the corresponding

Γ = diag (γ)

for

γ = σ - \overset{˚}{σ}

. Since

\tilde{σ} - \overset{˚}{σ} = - γ

we conclude from (A6) that

{(λ - b)}^{⊤} Γ P_{α}^{⊤} \leq λ^{⊤} \overset{˚}{L} P_{α}^{⊤} \geq {(λ - b)}^{⊤} (- Γ) P_{α}^{⊤} = - {(λ - b)}^{⊤} Γ P_{α}^{⊤}

which implies that

|{(λ - b)}^{⊤} P_{α}^{⊤}| = |{(λ - b)}^{⊤} Γ P_{α}^{⊤}| \leq λ^{⊤} \overset{˚}{L} P_{α}^{⊤} .

(A8)

Hence both tangential stationarity and normal growth are satisfied, which completes the proof by Proposition A1 as the converse implication is trivial. ☐

The key conclusion is that if an

\overset{˚}{x}

is the solution of two complementary convex problems it must be locally optimal in the full dimensional space

R^{n}

. Hence one can establish local optimality just using the preferred convex solver. If this test fails one naturally obtains descent to function values below

f (\overset{˚}{x})

until eventually a local minimizer is found.

Equivalence to DC Optimality Condition

Using the explicit expressions given in Lemma 1 we find that (see [18])

\begin{matrix} \partial^{L} f (\overset{˚}{x}) = ⋃_{0 = γ^{⊤} \overset{˚}{σ}} \{a^{⊤} + b^{⊤} {(I - \overset{˚}{L} Γ)}^{- 1} \overset{˚}{Z}\}, \end{matrix}

(A9)

where γ ranges over all complements of

\overset{˚}{σ}

such that

\overset{˚}{σ} + γ \in {- 1, 1}^{s}

is definite. Similarly we obtain with

{\tilde{b}}^{⊤} \equiv {| b |}^{⊤} {(I - | M | - 2 | L |)}^{- 1} | L | \geq 0 \in R^{s}

the limiting differentials of the convex and the concave part as

\begin{matrix} \partial^{L} \overset{ˇ}{f} (\overset{˚}{x}) & = & ⋃_{0 = γ^{⊤} \overset{˚}{σ}} \{a^{⊤} + (b^{⊤} + {\tilde{b}}^{⊤} \overset{˚}{Σ} + {\tilde{b}}^{⊤} Γ) {(I - \overset{˚}{L} Γ)}^{- 1} \overset{˚}{Z}\}, \end{matrix}

(A10)

\begin{matrix} \partial^{L} \hat{f} (\overset{˚}{x}) & = & ⋃_{0 = γ^{⊤} \overset{˚}{σ}} \{a^{⊤} + (b^{⊤} - {\tilde{b}}^{⊤} \overset{˚}{Σ} - {\tilde{b}}^{⊤} Γ) {(I - \overset{˚}{L} Γ)}^{- 1} \overset{˚}{Z}\} . \end{matrix}

(A11)

Hence we have an explicit representation for the limiting gradients of f as well as its convex and concave part

\overset{ˇ}{f}

and

\hat{f}

at

\overset{˚}{x}

. It is easy to see that the minimality condition (A5) requires a to be in the range of

{\overset{˚}{Z}}^{⊤}

so that we have again

a^{⊤} = - λ^{⊤} \overset{˚}{Z}

yielding

\begin{matrix} \partial^{L} \overset{ˇ}{f} (\overset{˚}{x}) & = & ⋃_{0 = γ^{⊤} \overset{˚}{σ}} \{(b^{⊤} - λ^{⊤} + λ^{⊤} \overset{˚}{L} Γ + {\tilde{b}}^{⊤} \overset{˚}{Σ} + {\tilde{b}}^{⊤} Γ) {(I - \overset{˚}{L} Γ)}^{- 1} \overset{˚}{Z}\}, \end{matrix}

(A12)

\begin{matrix} \partial^{L} \hat{f} (\overset{˚}{x}) & = & ⋃_{0 = γ^{⊤} \overset{˚}{σ}} \{(b^{⊤} - λ^{⊤} + λ^{⊤} \overset{˚}{L} Γ - {\tilde{b}}^{⊤} \overset{˚}{Σ} - {\tilde{b}}^{⊤} Γ) {(I - \overset{˚}{L} Γ)}^{- 1} \overset{˚}{Z}\} . \end{matrix}

(A13)

We had hoped to be able to derive directly from these expressions that normal growth implies the condition (39), but we have so far not been able to do so. However, we can indirectly derive the following equivalence.

Corollary A3 (First order minimality condition).

Under LIKQ the limiting differential

\partial^{L} \hat{f} (\overset{˚}{x})

is contained in the convex hull of

- \partial^{L} \overset{ˇ}{f} (\overset{˚}{x})

if and only if tangential stationarity and normal growth condition hold according to Proposition A1.

References

Joki, K.; Bagirov, A.; Karmitsa, N.; Mäkelä, M. A proximal bundle method for nonsmooth DC optimization utilizing nonconvex cutting planes. J. Glob. Optim. 2017, 68, 501–535. [Google Scholar] [CrossRef]
Tuy, H. DC optimization: Theory, methods and algorithms. In Handbook of Global Optimization; Springer: Boston, MA, USA, 1995; pp. 149–216. [Google Scholar]
Rump, S. Fast and parallel interval arithmetic. BIT 1999, 39, 534–554. [Google Scholar] [CrossRef]
Bačák, M.; Borwein, J. On difference convexity of locally Lipschitz functions. Optimization 2011, 60, 961–978. [Google Scholar]
Griewank, A.; Walther, A. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2008. [Google Scholar]
Gürbüzbalaban, M.; Overton, M. On Nesterov’s nonsmooth Chebyshev-Rosenbrock functions. Nonlinear Anal. Theory Methods Appl. 2012, 75, 1282–1289. [Google Scholar]
Griewank, A.; Walther, A. First and second order optimality conditions for piecewise smooth objective functions. Optim. Methods Softw. 2016, 31, 904–930. [Google Scholar] [CrossRef]
Strekalovsky, A. Local Search for Nonsmooth DC Optimization with DC Equality and Inequality Constraints. In Numerical Nonsmooth Optimization. State of the Art Algorithms; Springer Nature Switzerland AG: Cham, Switzerland, 2020; pp. 229–261. [Google Scholar]
Hansen, E. (Ed.) The centred form. In Topics in Interval Analysis; Oxford University Press: Oxford, UK, 1969; pp. 102–105. [Google Scholar]
Scholtes, S. Introduction to Piecewise Differentiable Functions; Springer: New York, NY, USA, 2012. [Google Scholar]
Griewank, A. On Stable Piecewise Linearization and Generalized Algorithmic Differentiation. Optim. Methods Softw. 2013, 28, 1139–1178. [Google Scholar] [CrossRef]
Griewank, A.; Bernt, J.U.; Radons, M.; Streubel, T. Solving piecewise linear equations in abs-normal form. Linear Algebra Appl. 2015, 471, 500–530. [Google Scholar] [CrossRef]
Griewank, A.; Walther, A.; Fiege, S.; Bosse, T. On Lipschitz optimization based on gray-box piecewise linearization. Math. Program. Ser. A 2016, 158, 383–415. [Google Scholar] [CrossRef]
Golub, G.; Van Loan, C. Matrix Computations, 4th ed.; Johns Hopkins University Press: Baltimore, MD, USA, 2013. [Google Scholar]
Rockafellar, R.; Wets, R.B. Variational Analysis; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
Fukuda, K.; Gärtner, B.; Szedlák, M. Combinatorial redundancy detection. Ann. Oper. Res. 2018, 265, 47–65. [Google Scholar] [CrossRef] [Green Version]
Le Thi, H.; Pham Dinh, T. DC programming and DCA: Thirty years of developments. Math. Program. Ser. B 2018, 169, 5–68. [Google Scholar] [CrossRef]
Griewank, A.; Walther, A. Beyond the Oracle: Opportunities of Piecewise Differentiation. In Numerical Nonsmooth Optimization. State of the Art Algorithms; Springer: Cham, Switzerland, 2020; pp. 331–361. [Google Scholar]
Walther, A.; Griewank, A. Characterizing and testing subdifferential regularity in piecewise smooth optimization. SIAM J. Optim. 2019, 29, 1473–1501. [Google Scholar] [CrossRef]
Griewank, A. Who invented the reverse mode of differentiation? Doc. Math. 2012, 389–400. [Google Scholar]
Linnainmaa, S. Taylor expansion of the accumulated rounding error. BIT 1976, 16, 146–160. [Google Scholar] [CrossRef]
Sun, W.; Sampaio, R.; Candido, M. Proximal point algorithm for minimization of DC function. J. Comput. Math. 2003, 21, 451–462. [Google Scholar]
Griewank, A.; Walther, A. Finite convergence of an active signature method to local minima of piecewise linear functions. Optim. Methods Softw. 2019, 34, 1035–1055. [Google Scholar] [CrossRef]
Griewank, A.; Walther, A. The True Steepest Descent Methods Revisited; Technical Report; Humboldt-Universität zu Berlin: Berlin, Germany, 2020. [Google Scholar]
Griewank, A.; Walther, A. Relaxing kink qualifications and proving convergence rates in piecewise smooth optimization. SIAM J. Optim. 2019, 29, 262–289. [Google Scholar] [CrossRef] [Green Version]
Niu, Y. Programmation DC & DCA en Optimisation Combinatoire et Optimisation Polynomiale via les Techniques de SDP. Ph.D. Thesis, INSA Rouen, Rouen, France, 2010. [Google Scholar]
Fiege, S.; Walther, A.; Kulshreshtha, K.; Griewank, A. Algorithmic differentiation for piecewise smooth functions: A case study for robust optimization. Optim. Methods Softw. 2018, 33, 1073–1088. [Google Scholar] [CrossRef]
Walther, A.; Griewank, A. Getting Started with ADOL-C. In Combinatorial Scientific Computing; Chapman & Hall/CRC Computational Science Series; CRC Press: Boca Raton, FL, USA, 2012; pp. 181–202. [Google Scholar]
Hascoët, L.; Pascual, V. The Tapenade Automatic Differentiation tool: Principles, Model, and Specification. ACM Trans. Math. Softw. 2013, 39, 20:1–20:43. [Google Scholar]

Figure 1. Half pipe example as defined in Equation (33).

Figure 2. Nesterov–Rosenbrock test function polyhedral inclusion for

n = 2

.

Figure 2. Nesterov–Rosenbrock test function polyhedral inclusion for

n = 2

.

Figure 3. Signatures and reflection-based DCA for Nesterov–Rosenbrock variant (40) with

n = 2

.

Figure 3. Signatures and reflection-based DCA for Nesterov–Rosenbrock variant (40) with

n = 2

.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Griewank, A.; Walther, A. Polyhedral DC Decomposition and DCA Optimization of Piecewise Linear Functions. Algorithms 2020, 13, 166. https://0-doi-org.brum.beds.ac.uk/10.3390/a13070166

AMA Style

Griewank A, Walther A. Polyhedral DC Decomposition and DCA Optimization of Piecewise Linear Functions. Algorithms. 2020; 13(7):166. https://0-doi-org.brum.beds.ac.uk/10.3390/a13070166

Chicago/Turabian Style

Griewank, Andreas, and Andrea Walther. 2020. "Polyhedral DC Decomposition and DCA Optimization of Piecewise Linear Functions" Algorithms 13, no. 7: 166. https://0-doi-org.brum.beds.ac.uk/10.3390/a13070166

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Polyhedral DC Decomposition and DCA Optimization of Piecewise Linear Functions

Abstract

1. Introduction and Notation

Conditioning of the Decomposition

2. Propagating Bounds and/or Radii

Propagating the Center and Radius

3. Forming and Extending the Abs-Linear Form

The Two-Term Polyhedral Decomposition

4. Computation of Generalized Gradients and Constructive Oracle Paradigm

Applying the Reverse Mode for Accumulating Generalized Gradients

5. Exploiting the Convex/concave Decomposion for the DC Algorithm

5.1. Checking Optimality Conditions

5.2. Proximal Rather Than Global

6. Nesterov’s Piecewise Linear Example

7. Summary, Conclusions and Outlook

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Polynomial Optimality Test Based on Abs-Linear Form

Equivalence to DC Optimality Condition

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI