Minimization on bounded convex polyhedra

16.3.2 Minimization on bounded convex polyhedra

The frank_wolfe command finds the minimum of a (twice) differentiable convex function f:ℝⁿ→ℝ on a bounded convex polyhedron defined by the set A⊂ℝⁿ of its vertices by using the method of Frank and Wolfe with backtracking. The problem can be written as:

min

x∈convA

f(x), (2)

where convA denotes the convex hull of A which coincides with the given polyhedron.

Note that (2) can be solved efficiently for n=1 by using the find_minimum command (see Section 16.3.1). Therefore, frank_wolfe requires that n≥ 2.

frank_wolfe takes two mandatory arguments and a sequence of optional arguments:
- obj, which is either an expression f(x) (with at least two variables) or a vector containing two or three functions with arguments x₁,x₂,…,x_n, the first one returning f(x) and the second one returning ∇ f(x). If the rand option is used (see below), then the third function should compute the Hessian matrix ∇²f(x), otherwise it can be omitted.
- A, a finite set of m≥ 2 points in ℝⁿ given as a m× n matrix in which the ith row corresponds to the ith point.
- Optionally, opts, a sequence of optional arguments each of which is one of:
  - ε, a positive floating-point number specifying the tolerance (by default, ε=0.001).
  - N, a positive integer specifying the maximal number of iterations (by default, N=10⁵).
  - initp, a point in ℝⁿ given as a vector of n real numbers, specifying the initial point x₀ (by default, x₀ is the mean of points in A, which is always feasible).
  - exact or exact=niter, a symbol specifying that the exact line search using Brent’s method should be performed. By default, the step_size subroutine by Pedregosa et al. (2020) is used, which is faster. The parameter niter specifies the maximal number of iterations for the exact line-search and its default value is 100. If niter is not specified, then the globally optimal step size in [0,1] is searched for, otherwise a locally optimal one is returned.
  - rand or rand=npts, a symbol specifying that the minimization should be initiated from several random points instead of a single point to avoid being stuck in a local minimum. Initial points are found by generating npts (by default 1000) random points in convA at which the objective function is convex. The generated points are clustered and the Frank and Wolfe algorithm is run from the center of each cluster.
frank_wolfe(obj,A, ⟨opts ⟩) returns a feasible point x such that |f(x^∗)−f^∗|≤ ε, where f^∗ is the actual optimal objective function value in (2) (note that local convexity of f is required to guarantee this). If in some iteration, say kth, the step-size is zero to working precision, or the maximal number of iterations is exceeded, the return value will be a vector of two elements: a string containing the warning message and the last point x_k.
If obj is given as an expression, then the gradient and Hessian information are computed automatically.
If the algorithm fails to minimize to the given precision, it is a good idea to call frank_wolfe again with the last point as the initial point and watch for a possible improvement of the objective function value.
frank_wolfe behaves well with optimal solution(s) lying at the boundary of the feasible set, since the Frank and Wolfe method is automatically feasible.
Exact line search is several times slower than the default method but usually makes the total number of the main-loop iterations smaller.
The rand option helps handling problems with non-convex objective functions. The algorithm finds an appropriate number of initial points in the set {x∈convA : ∇²f(x) is positive definite} and (re)starts from each of them, thereby finding the global minimum in most cases. If this approach fails, a warning message is printed an the provided initial point is used instead, or one is generated as described above.

Examples

The Lasso Problem.

Solve the following ℓ¹-constrained problem in n dimensions:

min

||Ax−b||², ||x||≤ 1,

where A is a m× n matrix with real coefficients, x∈ℝⁿ and b∈ℝ^m. It can be shown that vertices of the polyhedron defined by ||x||₁≤ 1 are defined by the orthonormal basis vectors and their counterparts (2n vertices in total). Hence the set of vertices V can be represented by a matrix with 2n rows and n columns which is obtained by gluing the identity matrix I_n and its negative −I_n together:

⎡
⎢
⎣

I_n

−I_n

⎤
⎥
⎦

Since the objective function is convex, the problem can be solved by the Frank-Wolfe method. Let n=10 and A be a randomly generated square matrix with coefficients in [−1,1].

n:=10:; A:=ranm(n,n,uniformd(-1,1)):;

Now let y∈ℝⁿ be a random point on the sphere with radius r=1.1. Then define b=Ay. Thus the global minimum of f is located near the polyhedron, but outside it.

y:=ranv(n,randnorm):; y:=1.1*y/l2norm(y):; b:=y*tran(A):;

Define V, x, f and call frank_wolfe. You get a vector which depends on the choices of A and y.

V:=tran(augment(idn(n),-idn(n))):; x:=symbol_array("x",n,purge):; f:=simplify(1/2*l2norm(x*tran(A)-b)^2):; fw:=frank_wolfe(f,V)

⎡
⎣

0.0,0.0,0.08905128279,0.0853981396774,0.0,0.0,0.469342995518,0.0,−0.355655381935,0.0

⎤
⎦

The optimal value of the objective function is:

subs(f,x,fw)

1.0435553106

COBYLA, for instance, returns a weaker solution (see Section 16.3.3), which is expected since it does not benefit from gradient information:

cb:=fMin(f,[l1norm(x)<=1],x,[0.01$n],1e-4):; subs(f,x,cb)

1.16374646004

The strength of the Frank-Wolfe method, besides automatic feasibility, is its ability to detect whether the objective function value is sufficiently close to the (unknown) otimal value.

Non-convex objectives.

Suppose that you want to minimize the function f(x,y)=xcos(y)−ysin(x²) inside the quadrilateral Q with vertices (1,1), (1,3), (3,2) and (3,1). The function f has two local minima in Q, one of which is global. To run frank_wolfe from the points (3/2,3/2) and (5/2,3/2), enter:

purge(x,y):; f,Q:=x*cos(y)-y*sin(x^2),[[1,1],[1,3],[3,2],[3,1]]:; ip1,ip2:=[1.5,1.5],[2.5,1.5]:; min1,min2:=frank_wolfe(f,Q,ip1),frank_wolfe(f,Q,ip2)

⎡
⎣

1.26907401345,2.86473876147

⎤
⎦

⎡
⎣

2.78442772106,2.10749634066

⎤
⎦

To plot the function f, quadrilateral Q, initial points and obtained minima, enter:

plotdensity(f,[x=0.8..3.2,y=0.0..3.5]); polygon(op(B),color=line_width_3+quadrant4,legend="boundary"); point(min1,color=yellow+point_width_2,legend="min1"); point(min2,color=yellow+point_width_2,legend="min2"); point(ip1,legend="ip1"); point(ip2,legend="ip2")

Points min1 and min2 are the global and a local minimum, respectively:

subs(f,[x,y],min1),subs(f,[x,y],min2)

−4.08322337291,−3.52045371621

The algorithm gets stuck in the local minimum min2 when starting from ip₂ because it is not allowed to go uphill. To find the global minimum one must apply multi-start, which you can select by using the rand option. Indeed:

frank_wolfe(f,Q,rand)

⎡
⎣

1.26907407408,2.86474054616

⎤
⎦

As another example, frank_wolfe is used to find the global minimum of the six-hump camel function:

f(x,y)=

⎛
⎜
⎜
⎝

4−2.1x²+

x⁴

⎞
⎟
⎟
⎠

x²+xy+(−4+4y²)y²

in the rectangle [−3,3]×[−2,2] (see here). The function has six local minima, two of which are global: (0.0898,−0.7126) and (−0.0898,0.7126). Indeed:

f,R:=(4-2.1*x^2+x^4/3)*x^2+x*y+(-4+4*y^2)*y^2,[[-3,-2],[-3,2],[3,2],[3,-2]]:; frank_wolfe(f,R,rand=10000)

[0.0898388758644,−0.712673226867]

By evaluating the above commandline several times in a row, you always get one of the two global minima.

Boundary discretization.

When minimizing on a convex domain, you can simplify the problem by discretizing the domain boundary. For example, minimize

f(x,y)=4e^x+5cosy−x²y+4x+5y−25e^{−(x−1)²−(y−2)²}

in the convex set D⊂ℝ² bounded by the closed parametric curve C defined by

x(t)=

⎛
⎜
⎜
⎝

2+cos

⎛
⎜
⎜
⎝

t−

⎞
⎟
⎟
⎠

sin(sint), y(t)=2(2+cost)cos

⎛
⎜
⎜
⎝

⎞
⎟
⎟
⎠

, 0≤ t≤ 2π.

Let A={(x(t_k),y(t_k)):t_k=2kπ/n,k=1,2,…,n} for some n≥ 2. Now the convex hull of A can be used as an approximation of D. To define f and the C (with u and v instead of x and y), enter:

f:=4*exp(x)+5*cos(y)-x^2*y+4x+5y-25*exp(-(x-1)^2-(y-2)^2):; u(t):=(2+cos(t-pi/6))*sin(sin(t)):; v(t):=2*(2+cos(t))*cos(t+pi/6):;

Now generate the hull vertices with, say, n=1000. Since linspace includes both 0 and 2π in s, the former is discarded by using tail.

s:=evalf(tail(linspace(0,2*pi,1000))):; A:=tran(apply(u,s),apply(v,s)):;

Use the rand option when calling frank_wolfe because the function f is not convex has several local minima in D.

p:=frank_wolfe(f,A,rand)

⎡
⎣

0.0565029231227,−1.78083520026

⎤
⎦

To plot the function f, curve C and point p, enter:

densityplot(f,[x=-2..3,y=-3..6]); paramplot([u(t),v(t)],t=0..2pi,display=line_width_3); point(p,display=point_width_2+red)

Input with functions.

In some problems, computing the objective function value and/or its derivative may be too complex for automatic handling. In such cases you should provide giac functions which compute these values. For example, suppose you wish to minimize the function

f(x,y)=−Ai(1+x)cos

⎛
⎜
⎜
⎝

y−

⎞
⎟
⎟
⎠

, x≥ −1,

where Ai is the Airy function of the first kind, on the domain enclosed by the pentagon with vertices corresponding to complex solutions of the equation z⁵=1. To define the boundary, enter:

boundary:=polygon(op(cZeros(z^5=1,z)),display=quadrant4)

To define the objective function, enter:

f(x,y):=-Airy_Ai(1+x)*cos(y-pi/8):;

Computing the gradient of f automatically is difficult because the grad command does not know how to compute the derivative of Ai function, so you will have to provide the gradient function yourself. By the relation between Airy functions and modified Bessel functions, you get:

Ai′(z)=−

√

K_2/3(ζ),

where ζ=2/3z^3/2 and

K_ν(z)=

√

2^νΓ(ν+1/2)

e^−z

√

∫

∞

⎛
⎜
⎜
⎝

⎞
⎟
⎟
⎠

ν−1/2

t^ν−1/2e^−tdt.

The above integral is easily approximated with Gaussian quadrature for z>0 real. Therefore, the derivative of Ai can be computed for positive arguments by using the following giac function:

Ai_diff:=proc(x) local zeta,K,K0,t; zeta:=2/3*x^(3/2); purge(t); K0:=exp(-zeta)/(4^(1/3)*Gamma(7/6)*sqrt(zeta)); K:=K0*int((2t+t^2/zeta)^(1/6)*exp(-t),t=0..inf); return evalf(-x/sqrt(3*pi)*K); end

Now, with some elementary differentiation by hand or by the help of grad, we get the gradient of f:

∇ f(x,y)=

⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣

−Ai′(1+x)cos

⎛
⎜
⎜
⎝

y−

⎞
⎟
⎟
⎠

Ai(1+x)sin

⎛
⎜
⎜
⎝

y−

⎞
⎟
⎟
⎠

⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦

which is computed by the function

g:=proc(x,y) return evalf([-Ai_diff(x+1)*cos(y-pi/8),Airy_Ai(x+1)*sin(y-pi/8)]); end

To get the coordinates of pentagon vertices, enter (note the conversion to floating-point values):

A:=evalf(coordinates(vertices(boundary)))

⎡
⎢
⎢
⎢
⎢
⎢
⎣

0.309016994375	−0.951056516295
−0.809016994375	−0.587785252292
−0.809016994375	0.587785252292
0.309016994375	0.951056516295
1.0	0.0

⎤
⎥
⎥
⎥
⎥
⎥
⎦

Now find the minimum of f inside the pentagon:

frank_wolfe([f,g],evalf(coordinates(vertices(p))))

⎡
⎣

−0.809016994375,0.393184912674

⎤
⎦

Indeed:

plotdensity(f(x,y),[x=-1..1,y=-1..1]); display(boundary,line_width_2); point(p,display=point_width_2+red)

Note that using the rand option in this case would also require a function which computes the Hessian of f at a given point.