Scaling, Stopping, and Testing¶
Scaling¶
In real-world problems, the variables may have different scales.
The preferred solution is to choose units of the variable space so that each component of \(x\) will have roughly the same magnitude.
Rescaling the independent variables such as \(Dx\) and \(D F(x)\)) does not affect the Newton step, but the steepest-descent direction (e.g. trust region step) is affected.
The intuition behind this is that the Newton step goes to the lowest point of a quadratic model, which is unaffected by a change in units, while steepest-descent makes a unit step without regard to the relative lengths of each variable direction.
Dynamic rescaling of the variables may be needed for some problems.
Stopping Criteria
When picking a stopping criteria, use the relative rate of change with a user specified maximum limit to be scale independent as in (7.2.5) and (7.2.7).
Exercise 1¶
Let \(a = 10^6\) and rewrite the original minimization problem as
The algorithms discussed so far need to compute the gradient
By inspection, \(\left\Vert x_+ - x_c \right\Vert\) will be predominantly influenced by \(x_1\), and \(f(x)\) is minimized when \(x_1 = a\) and \(x_2 = a^{-1}\).
Let \(\typ x_1 = a\) and \(\typ x_2 = a^{-1}\). The following experiments illustrate both Newton’s step and steepest-descent (6.2.5) converges within 40 iterations without any scaling; furthermore, it was done without any globally convergent modifications (e.g. backtracking line-search, locally constrained optimal hook step, double dogleg step).
As shown in Exercise 3 and Exercise 4, no scaling is needed for the Newton step, and in this case steepest descent does not need it.
A simpler way to handle the minimization is to solve it for \(x_1\) and \(x_2\) individually i.e. split up the problem.
[1]:
import numpy
def f(x):
x = x.flat
return (x[0] - a)**4 + (x[0] - a)**2 + (x[1] - 1 / a)**4 + (x[1] - 1 / a)**2
def fp(x):
x = x.flat
return numpy.asmatrix([[4 * (x[0] - a)**3 + 2 * (x[0] - a)],
[4 * (x[1] - 1 / a)**3 + 2 * (x[1] - 1 / a)]])
def fpp(x):
x = x.flat
return numpy.asmatrix([[12 * (x[0] - a)**2 + 2, 0],
[0, 12 * (x[1] - 1 / a)**2 + 2]])
a = 10**6
x_c = numpy.asmatrix([[0],[0]])
for i in range(40):
H = fpp(x_c)
g = fp(x_c)
#s = -numpy.linalg.inv(H) * g
s = -(g.T * g).flat[0] / (g.T * H * g).flat[0] * g
x_k = x_c + s
x_c = x_k
if i % 5 == 0:
print('Iteration {}: {}'.format(i, numpy.asarray(x_c).squeeze()))
Iteration 0: [3.33333333e+05 1.66666667e-19]
Iteration 5: [9.12208505e+05 1.71661784e-17]
Iteration 10: [9.88438980e+05 9.97443686e-16]
Iteration 15: [9.98477561e+05 5.75251842e-14]
Iteration 20: [9.99799515e+05 3.31719872e-12]
Iteration 25: [9.99973604e+05 1.91283634e-10]
Iteration 30: [9.99996562e+05 1.10209765e-08]
Iteration 35: [9.99999801e+05 5.64265636e-07]
[2]:
def f(x):
x = x.flat
#Rosenbrock
return 100 * (x[0]**2 - x[1])**2 + (1 - x[0])**2
def fp(x):
x = x.flat
a = 2 * 100 * (x[0]**2 - x[1])
return numpy.asmatrix([[a * 2 * x[0] + 2 * (1 - x[0]) * (-1)],
[a * (-1)]])
def fpp(x):
x = x.flat
return numpy.asmatrix([[1200 * x[0]**2 - 400 * x[1] + 2, -400 * x[0]],
[-400 * x[0], -200]])
alpha = 1.0
x_c = numpy.asmatrix([[-1.2 / alpha],[alpha]])
T = numpy.asmatrix([[1/alpha, 0],[0, alpha]])
for i in range(128):
H = fpp(x_c)
g = fp(x_c)
s = -numpy.linalg.inv(H) * g
#s = -(g) / numpy.linalg.norm(g)
#s = -(g.T * g).flat[0] / (g.T * H * g).flat[0] * g
x_k = x_c + s
x_c = x_k
if i % 12 == 0:
print('Iteration {}: {}'.format(i, numpy.asarray(x_c).squeeze()))
Iteration 0: [-1.0280419 0.97269944]
Iteration 12: [-1.01039923 1.03084902]
Iteration 24: [-1.03936528 1.09008519]
Iteration 36: [-1.06717182 1.14853578]
Iteration 48: [-1.09393971 1.2062699 ]
Iteration 60: [-1.11977053 1.26334675]
Iteration 72: [-1.14475062 1.31981753]
Iteration 84: [-1.16895404 1.37572691]
Iteration 96: [-1.19244488 1.43111411]
Iteration 108: [-1.21527903 1.48601386]
Iteration 120: [-1.23750557 1.54045707]
Exercise 2¶
Let \(f \colon \mathbb{R}^n \rightarrow \mathbb{R}\) and \(T \in \mathbb{R}^{n \times n}\) be nonsingular. For any \(x \in \mathbb{R}^n\), define
(a)¶
Applying definition 4.1.1,
yields
(b)¶
Applying definition 4.1.4,
yields
Exercise 3¶
Let \(D \in \mathbb{R}^{n \times n}\) denote a positive diagonal matrix, and \(m_c(x_c + s) = f(x_c) + g^\top s + \frac{1}{2} s^\top H s\) where \(g = \nabla f(x_c) \in \mathbb{R}^n\) and \(H \equiv \nabla^2 f(x_c) \in S^n_{++}\).
Recall that Lemma 6.4.1 is solving
where \(s(\mu) = -\left( H + \mu I \right)^{-1} g\) is the solution. In order to utilize Lemma 6.4.1 to minimize the same function but constrained to \(\left\Vert Ds \right\Vert_2 \leq \delta_c\), a change of variable is needed.
Define \(\hat{x}_c = Dx_c\); applying the results from Exercise 2 with \(T = D\) yields
By inspection, \(\hat{s} = Ds\) because \(x_c\) is being optimized in the scaled space. Transforming the solution back gives
Exercise 4¶
Let \(F \colon \mathbb{R}^n \rightarrow \mathbb{R}^n\) and \(T_1, T_2 \in \mathbb{R}^{n \times n}\) be nonsingular. For any \(x \in \mathbb{R}^n\), define
Applying definition 4.1.7,
yields
Using (6.5.2) to transform the Newton step for \(\hat{F}\) back to the original variable space gives
which shows that scaling does not affect the Newton step. However, if a norm is used, scaling should still be applied to avoid ignoring small components.
Exercise 5¶
The only thing I can come up is when the optimization problem involves some kind of direct dependency between the variables e.g. \((x - y)\). Any dynamic scaling strategy would be specific to a problem domain hence not robust.