Method | Problem | \(d_k\) |
---|---|---|
Gauss-Newton | \(\min_{x} \frac{1}{2}\|r(x)\|^2\) | \(-(J_k^TJ_k)^{-1}(J_k^Tr_k)\) |
Pure Newton | General unconstrained | \(-H_k^{-1}g_k\) |
Scaled Descent | General unconstrained | \(-SS^Tg_k\) |
Reduced Gradient Method | Linearly constrained | \(-(Z^TH_kZ)^{-1}Z^Tg_k\) |
Exact line search: choose \(\alpha_k = \underset{\alpha\geq 0}{\mbox{argmin}} f(x + \alpha d)\)
Backtracking line search:
definition. \(f\) is \(L\)-Lipschitz continuously differentiable if
\[\|\nabla f(x) - \nabla f(y)\leq L\|x - y\|,\quad \forall x, y\in\mathbb{R}^n.\]
definition. \(f\) is \(\mu\)-strongly convex (\(\mu > 0\)) if
\[f(z) \geq f(x) + \nabla f(x)^T(z - x) + \frac{\mu}{2}\|z - x\|^2,\quad \forall x,z\in\mathbb{R}^n.\]
If \(f\) is twice continuously differentiable, this is equivalent to
\[\lambda_{\min}(\nabla^2 f(x))\geq \mu,\quad \forall x\in\mathbb{R}^n.\]
proposition. Let \(f\) be \(L\)-Lipschitz continuously differentiable. Let \(\{x_k\}\) be the iterates of gradient descent with constant stepsize \(\alpha\in (0, 2/L)\). Then, \(\min_{k=0,\dots, T} \|\nabla f(x_k)\|^2 \leq \frac{2(f_0 - f^*)}{\alpha T} = O(1/T)\).
Moreover, if \(f\) is \(\mu\)-strongly convex, then the number of iterates is at most \(\frac{L}{\mu}\log\left( \frac{f(x_0) - f^*}{\epsilon} \right)\), where \(\epsilon :=\) tolerance.
Let \(f(x) = \frac{1}{N}(f_{1}(x) + \cdots + f_{N}(x))\). In stochastic gradient descent, at each iteration \(k\), we sample a batch of indices \(i\in B_k\), \(i\sim \mbox{Unif}\{1,\dots, N\}\) and approximate the gradient with \(g_k = \frac{1}{|B_k|}\sum_{i\in B_k} \nabla f_i(x_k)\).
Compared to regular gradient descent, stochastic gradient descent will require more iterations, but can be much more computationally cheaper.