Scaling, Stopping, and Testing

Scaling

  • In real-world problems, the variables may have different scales.

    • The preferred solution is to choose units of the variable space so that each component of \(x\) will have roughly the same magnitude.

    • Rescaling the independent variables such as \(Dx\) and \(D F(x)\)) does not affect the Newton step, but the steepest-descent direction (e.g. trust region step) is affected.

      • The intuition behind this is that the Newton step goes to the lowest point of a quadratic model, which is unaffected by a change in units, while steepest-descent makes a unit step without regard to the relative lengths of each variable direction.

    • Dynamic rescaling of the variables may be needed for some problems.

Stopping Criteria

  • When picking a stopping criteria, use the relative rate of change with a user specified maximum limit to be scale independent as in (7.2.5) and (7.2.7).

Exercise 1

Let \(a = 10^6\) and rewrite the original minimization problem as

\[\min_{x \in \mathbb{R}^2} f(x) = \min_{x \in \mathbb{R}^2} \left( x_1 - a \right)^4 + \left( x_1 - a \right)^2 + \left( x_2 - a^{-1} \right)^4 + \left( x_2 - a^{-1} \right)^2.\]

The algorithms discussed so far need to compute the gradient

\[\begin{split}\DeclareMathOperator{\typ}{typ} \nabla f(x) = \begin{bmatrix} 4 \left( x_1 - a \right)^3 + 2 \left( x_1 - a \right)\\ 4 \left( x_2 - a^{-1} \right)^3 + 2 \left( x_2 - a^{-1} \right)\\ \end{bmatrix}.\end{split}\]

By inspection, \(\left\Vert x_+ - x_c \right\Vert\) will be predominantly influenced by \(x_1\), and \(f(x)\) is minimized when \(x_1 = a\) and \(x_2 = a^{-1}\).

Let \(\typ x_1 = a\) and \(\typ x_2 = a^{-1}\). The following experiments illustrate both Newton’s step and steepest-descent (6.2.5) converges within 40 iterations without any scaling; furthermore, it was done without any globally convergent modifications (e.g. backtracking line-search, locally constrained optimal hook step, double dogleg step).

As shown in Exercise 3 and Exercise 4, no scaling is needed for the Newton step, and in this case steepest descent does not need it.

A simpler way to handle the minimization is to solve it for \(x_1\) and \(x_2\) individually i.e. split up the problem.

[1]:
import numpy

def f(x):
    x = x.flat
    return (x[0] - a)**4 + (x[0] - a)**2 + (x[1] - 1 / a)**4 + (x[1] - 1 / a)**2

def fp(x):
    x = x.flat
    return numpy.asmatrix([[4 * (x[0] - a)**3 + 2 * (x[0] - a)],
                           [4 * (x[1] - 1 / a)**3 + 2 * (x[1] - 1 / a)]])

def fpp(x):
    x = x.flat
    return numpy.asmatrix([[12 * (x[0] - a)**2 + 2, 0],
                           [0, 12 * (x[1] - 1 / a)**2 + 2]])

a = 10**6
x_c = numpy.asmatrix([[0],[0]])
for i in range(40):
    H = fpp(x_c)
    g = fp(x_c)
    #s = -numpy.linalg.inv(H) * g
    s = -(g.T * g).flat[0] / (g.T * H * g).flat[0] * g
    x_k = x_c + s
    x_c = x_k
    if i % 5 == 0:
        print('Iteration {}: {}'.format(i, numpy.asarray(x_c).squeeze()))
Iteration 0: [3.33333333e+05 1.66666667e-19]
Iteration 5: [9.12208505e+05 1.71661784e-17]
Iteration 10: [9.88438980e+05 9.97443686e-16]
Iteration 15: [9.98477561e+05 5.75251842e-14]
Iteration 20: [9.99799515e+05 3.31719872e-12]
Iteration 25: [9.99973604e+05 1.91283634e-10]
Iteration 30: [9.99996562e+05 1.10209765e-08]
Iteration 35: [9.99999801e+05 5.64265636e-07]
[2]:
def f(x):
    x = x.flat
    #Rosenbrock
    return 100 * (x[0]**2 - x[1])**2 + (1 - x[0])**2

def fp(x):
    x = x.flat
    a = 2 * 100 * (x[0]**2 - x[1])
    return numpy.asmatrix([[a * 2 * x[0] + 2 * (1 - x[0]) * (-1)],
                           [a * (-1)]])

def fpp(x):
    x = x.flat
    return numpy.asmatrix([[1200 * x[0]**2 - 400 * x[1] + 2, -400 * x[0]],
                           [-400 * x[0], -200]])

alpha = 1.0
x_c = numpy.asmatrix([[-1.2 / alpha],[alpha]])
T = numpy.asmatrix([[1/alpha, 0],[0, alpha]])
for i in range(128):
    H = fpp(x_c)
    g = fp(x_c)
    s = -numpy.linalg.inv(H) * g
    #s = -(g) / numpy.linalg.norm(g)
    #s = -(g.T * g).flat[0] / (g.T * H * g).flat[0] * g
    x_k = x_c + s
    x_c = x_k
    if i % 12 == 0:
        print('Iteration {}: {}'.format(i, numpy.asarray(x_c).squeeze()))
Iteration 0: [-1.0280419   0.97269944]
Iteration 12: [-1.01039923  1.03084902]
Iteration 24: [-1.03936528  1.09008519]
Iteration 36: [-1.06717182  1.14853578]
Iteration 48: [-1.09393971  1.2062699 ]
Iteration 60: [-1.11977053  1.26334675]
Iteration 72: [-1.14475062  1.31981753]
Iteration 84: [-1.16895404  1.37572691]
Iteration 96: [-1.19244488  1.43111411]
Iteration 108: [-1.21527903  1.48601386]
Iteration 120: [-1.23750557  1.54045707]

Exercise 2

Let \(f \colon \mathbb{R}^n \rightarrow \mathbb{R}\) and \(T \in \mathbb{R}^{n \times n}\) be nonsingular. For any \(x \in \mathbb{R}^n\), define

\[\hat{x} = Tx \quad \text{and} \quad \hat{f}(\hat{x}) = f(T^{-1} \hat{x}) = f(x).\]

(a)

Applying definition 4.1.1,

\[\begin{split}\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(x)\\ \vdots\\ \frac{\partial f}{\partial x_n}(x)\\ \end{bmatrix} = \frac{\partial f(x)}{\partial x},\end{split}\]

yields

\[\begin{split}\nabla \hat{f}(\hat{x}) &= \frac{\partial \hat{f}(\hat{x})}{\partial \hat{x}}\\ &= \frac{\partial f(x)}{\partial \hat{x}} & \quad & \hat{f}(\hat{x}) = f(x)\\ &= \frac{\partial x}{\partial \hat{x}} \frac{\partial f(x)}{\partial x} & \quad & \text{denominator layout chain rule}\\ &= \frac{\partial T^{-1} \hat{x}}{\partial \hat{x}} \nabla f(x) & \quad & \hat{x} = Tx\\ &= T^{-\top} \nabla f(x) & \quad & \text{denominator layout vector-by-vector differentiation.}\end{split}\]

(b)

Applying definition 4.1.4,

\[\begin{split}\nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f(x)}{\partial x_1^2} & \cdots & \frac{\partial^2 f(x)}{\partial x_1 \partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial^2 f(x)}{\partial x_1 \partial x_n} & \cdots & \frac{\partial^2 f(x)}{\partial x_n^2}\\ \end{bmatrix} = \nabla \left[ \nabla f(x) \right]^\top = \frac{\partial^2 f(x)}{\partial x \partial x^\top},\end{split}\]

yields

\[\begin{split}\nabla^2 \hat{f}(\hat{x}) &= \frac{\partial}{\partial \hat{x}} \left( T^{-\top} \frac{\partial f(x)}{\partial x} \right)^\top & \quad & \text{(a)}\\ &= \left( \frac{\partial}{\partial \hat{x}} \frac{\partial f(x)}{\partial x^\top} \right) T^{-1}\\ &= \left( \frac{\partial x}{\partial \hat{x}} \frac{\partial^2 f(x)}{\partial x \partial x^\top} \right) T^{-1} & \quad & \text{denominator layout chain rule}\\ &= \frac{\partial T^{-1} \hat{x}}{\partial \hat{x}} \nabla^2 f(x) T^{-1} & \quad & \hat{x} = Tx\\ &= T^{-\top} \nabla^2 f(x) T^{-1} & \quad & \text{denominator layout vector-by-vector differentiation.}\end{split}\]

Exercise 3

Let \(D \in \mathbb{R}^{n \times n}\) denote a positive diagonal matrix, and \(m_c(x_c + s) = f(x_c) + g^\top s + \frac{1}{2} s^\top H s\) where \(g = \nabla f(x_c) \in \mathbb{R}^n\) and \(H \equiv \nabla^2 f(x_c) \in S^n_{++}\).

Recall that Lemma 6.4.1 is solving

\[\begin{split}\underset{s \in \mathbb{R}^n}{\text{minimize}} \quad m_c(x_c + s) &\\ \text{subject to} \quad \left\Vert s \right\Vert_2 &\leq \delta_c\end{split}\]

where \(s(\mu) = -\left( H + \mu I \right)^{-1} g\) is the solution. In order to utilize Lemma 6.4.1 to minimize the same function but constrained to \(\left\Vert Ds \right\Vert_2 \leq \delta_c\), a change of variable is needed.

Define \(\hat{x}_c = Dx_c\); applying the results from Exercise 2 with \(T = D\) yields

\[\begin{split}\underset{\hat{s} \in \mathbb{R}^n}{\text{minimize}} \quad m_c(Dx_c + \hat{s}) &= f(\hat{x}_c) + g^\top D^{-1} \hat{s} + \frac{1}{2} \hat{s}^\top D^{-1} H D^{-1} \hat{s}\\ \text{subject to} \quad \left\Vert \hat{s} \right\Vert_2 &\leq \delta_c.\end{split}\]

By inspection, \(\hat{s} = Ds\) because \(x_c\) is being optimized in the scaled space. Transforming the solution back gives

\[\begin{split}\hat{s}(\mu) &= -\left( D^{-1} H D^{-1} + \mu I \right)^{-1} D^{-1} g\\ H D^{-1} \hat{s}(\mu) + \mu D \hat{s}(\mu) &= -g\\ \left( H + \mu D^2 \right) s(\mu) &= -g\\ s(\mu) &= -\left( H + \mu D^2 \right)^{-1} g.\end{split}\]

Exercise 4

Let \(F \colon \mathbb{R}^n \rightarrow \mathbb{R}^n\) and \(T_1, T_2 \in \mathbb{R}^{n \times n}\) be nonsingular. For any \(x \in \mathbb{R}^n\), define

\[\hat{x} = T_1 x \quad \text{and} \quad \hat{F}(\hat{x}) = T_2 F(T_1^{-1} \hat{x}) = T_2 F(x).\]

Applying definition 4.1.7,

\[\begin{split}F'(x) = J(x) = \begin{bmatrix} \frac{\partial f_1(x)}{\partial x_1} & \cdots & \frac{\partial f_1(x)}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial f_n(x)}{\partial x_1} & \cdots & \frac{\partial f_n(x)}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \nabla f_1(x)^\top\\ \vdots\\ \nabla f_n(x)^\top \end{bmatrix} = F(x) \nabla^\top = \frac{\partial F(x)}{\partial x^\top},\end{split}\]

yields

\[\begin{split}\hat{J}(\hat{x}) &= \frac{\partial \hat{F}(\hat{x})}{\partial \hat{x}^\top}\\ &= \frac{\partial T_2 F(x)}{\partial \hat{x}^\top} & \quad & \hat{F}(\hat{x}) = T_2 F(x)\\ &= \frac{\partial T_2 F(x)}{\partial x^\top} \frac{\partial x^\top}{\partial \hat{x}^\top} & \quad & \text{numerator layout chain rule}\\ &= T_2 J(x) \frac{\partial \hat{x}^\top T_1^{-\top}}{\partial \hat{x}^\top} & \quad & \hat{x} = T_1 x\\ &= T_2 J(x) T_1^{-1} & \quad & \text{numerator layout vector-by-vector differentiation.}\end{split}\]

Using (6.5.2) to transform the Newton step for \(\hat{F}\) back to the original variable space gives

\[\begin{split}\hat{x}_+ &= \hat{x}_c + \hat{J}(\hat{x}_c)^{-1} \hat{F}(\hat{x}_c)\\ T_1 x_+ &= T_1 x_c + \left( T_2 J(x) T_1^{-1} \right)^{-1} T_2 F(x_c)\\ T_1 x_+ &= T_1 x_c + T_1 J(x)^{-1} F(x_c)\\ x_+ &= x_c + J(x)^{-1} F(x_c),\end{split}\]

which shows that scaling does not affect the Newton step. However, if a norm is used, scaling should still be applied to avoid ignoring small components.

Exercise 5

The only thing I can come up is when the optimization problem involves some kind of direct dependency between the variables e.g. \((x - y)\). Any dynamic scaling strategy would be specific to a problem domain hence not robust.

Exercise 6

According to Theorem 4.3.2, it is a saddle point when the Hessian \(H\) is indefinite and a maximum if \(H \preceq 0\).

See [BGL05] and Saddle Point Theorems for an overview on how to proceed in minimization algorithms.

Exercise 7

No significant gain in knowledge for this implementing this solution.

References

BGL05

Michele Benzi, Gene H Golub, and Jörg Liesen. Numerical solution of saddle point problems. Acta numerica, 14:1–137, 2005.