1 — The Simplest Optimizer

The problem

Suppose you want to minimise a smooth function $f : \mathbb{R}^n \to \mathbb{R}$. The most obvious strategy is gradient descent: move in the direction that makes $f$ decrease fastest.

\[x_{k+1} = x_k - \alpha \nabla f(x_k)\]

where $\alpha > 0$ is a step size.

Why it breaks

Gradient descent is simple, but it has well-known failure modes:

Step-size dilemma — too large and you overshoot; too small and you converge glacially. The optimal $\alpha$ depends on local curvature, which changes at every point.
Zig-zagging — in narrow valleys (like the Rosenbrock function) the gradient is nearly perpendicular to the valley floor, so the iterates bounce back and forth making almost no progress.
No curvature information — the gradient tells you direction but not how far the linear approximation is valid.

Gradient descent zig-zagging in a narrow valley

The idea

What if we used second-order information (curvature) to build a better local model of $f$? That is exactly what Newton's method does:

\[f(x_k + p) \approx m_k(p) = f(x_k) + \nabla f_k^\top p + \tfrac12 p^\top H_k\, p\]

where $H_k$ is the Hessian (or an approximation). Minimising $m_k$ gives the Newton step $p_N = -H_k^{-1}\, \nabla f_k$.

Newton converges quadratically near a solution — but it can diverge wildly if $H_k$ is indefinite or the starting point is far from the minimum.

Pitfall

Pure Newton's method has no global convergence guarantee. A bad Hessian or a large step can send you to infinity.

What Retro does

Retro never takes a raw Newton step. Instead, it wraps the quadratic model in a trust region — a ball of radius $\Delta$ around the current point where we trust the model to be accurate.

That is the subject of the next chapter.

Next → Adding Trust Regions