In unconstrained optimization, quasi-Newton methods are famous for the excellent balance of accuracy and efficiency. They require only the first-order derivative information, i.e., gradient, like what we used in steepest descent; however, by measuring the changes in gradient through some so-called secant equation, it allows us to construct a quadratic model that has approximate second-order properties and thus the improvement over traditional first-order methods is dramatic. Moreover, since explicit computations of second-order derivatives are not required or even not applicable, a quasi-Newton step can be a great alternative to a Newton step.

Among various quasi-Newton methods, the BFGS method is no doubt to be the most popular one. At $k$th iteration, the quadratic model of the objective function is formulated as:

\[m_k(p)=f(x_k) + \nabla f(x_k)^T p + \tfrac{1}{2}p^TB_kp \nonumber\]

where the Hessian approximation $B_k\succ 0$ will be updated at every iteration. This positive definite restriction makes above subproblem a well-defined convex quadratic model with a minimizer

\[p_k:=\arg\min m_k(p)=-B_k^{-1}\nabla f(x_k).\label{pk}\]

Because of the positive definiteness of $B_k$, $p_k$ is indeed a descent direction and the iterate can be updated by:

\[x_{k+1}=x_k+\alpha_k p_k \label{upd}\]

where $\alpha_k$ is the step size and may be chosen to satisfy the (strong) Wolfe conditions, which can guarantee the BFGS update is well-defined (a necessary condition of $B_{k+1}\succ 0$ will hold).

In this short article, I won’t go into much detail about the derivation of the algorithm or its analytical properties. Instead, I will focus on one particular point of the BFGS method: the update formula for $B_k^{-1}$ and its equivalence to the regular BFGS formula that updates $B_k$.

As is well-known, the BFGS update formula for $B_k$ is

\[B_{k+1}=B_k-\frac{(B_ks_k)(B_ks_k)^T}{s_k^TB_ks_k}+\frac{y_ky_k^T}{s_k^Ty_k} \label{bfgs}\]

where $s_k:=x_{k+1}-x_k$ and $y_k:=\nabla f(x_{k+1})-\nabla f(x_k)$. Notice that to obtain the $k$th search direction, we may need to compute $-B_k^{-1}\nabla f(x_k)$ according to \eqref{pk}, which requires to solve a linear system or compute the inverse (not recommended though). Either way results in an expensive operation. The rescue is to directly update the inverse of Hessian approximation $H_k:=B_k^{-1}$, and then the search direction is just a matrix-vector product $-H_k\nabla f(x_k)$. And it is also not hard to find the update formula for $H_k$ is

\[H_{k+1}=\left(I-\frac{s_ky_k^T}{s_k^Ty_k}\right)H_k\left(I-\frac{y_ks_k^T}{s_k^Ty_k}\right)+\frac{s_ks_k^T}{s_k^Ty_k}. \label{hh}\]

Although the equivalence claim between \eqref{bfgs} and \eqref{hh} can be seen everywhere, it is quite unclear to most people why this is true and hard to find a formal proof of that. Now I will give a rigorous proof as following based on the Sherman-Morrison-Woodbury formula:

\[(A+UV^{T})^{-1}=A^{-1}-A^{-1}U(I+V^{T}A^{-1}U)^{-1}V^{T}A^{-1}. \label{fff}\]

First of all, let’s set

\[U:=\left[\frac{y_k}{\sqrt{s_k^{T} y_k}} \quad \frac{B_k s_k}{\sqrt{s_k^{T}B_k s_k}}\right] \quad \text{and}\quad V:=\left[\frac{y_k}{\sqrt{s_k^{T} y_k}} \quad -\frac{B_k s_k}{\sqrt{s_k^{T}B_k s_k}}\right]. \nonumber\]

Then in \eqref{bfgs} we have $B_{k+1}=B_k+UV^T,$ and make use of \eqref{fff}, we obtain


Since $H_k=B_k^{-1}$, we can rewrite above equation as

\[H_{k+1}=H_k-H_k U(I+V^{T}H_k U)^{-1}V^{T}H_k. \label{hhh}\]

Next, we have

\[I+V^{T}H_k U = I+ \begin{bmatrix} \frac{y_k^T}{\sqrt{s_k^{T} y_k}} \newline -\frac{(B_k s_k)^T}{\sqrt{s_k^{T}B_k s_k}} \end{bmatrix}H_k \begin{bmatrix} \frac{y_k}{\sqrt{s_k^{T} y_k}} & \frac{B_k s_k}{\sqrt{s_k^{T}B_k s_k}} \end{bmatrix} = \begin{bmatrix} 1+\frac{y_k^{T}H_k y_k}{s_k^{T}y_k} & \frac{\sqrt{y_k^T s_k}}{\sqrt{s_k^T B_k s_k}} \newline -\frac{\sqrt{y_k^T s_k}}{\sqrt{s_k^T B_k s_k}} & 0 \end{bmatrix}. \nonumber\]


\[\displaystyle{\alpha := 1+\frac{y_k^{T}H_k y_k}{s_k^{T}y_k}\quad\text{and}\quad\beta:=\frac{\sqrt{y_k^T s_k}}{\sqrt{s_k^T B_k s_k}}}\nonumber\]


\[\displaystyle{(I+V^{T}H_k U)^{-1}=\frac{1}{\beta^2} \begin{bmatrix} 0 & -\beta \newline \beta & \alpha \end{bmatrix}.\label{temp1} }\]

Also, we have

\[H_k U = \begin{bmatrix} \frac{H_k y_k}{\sqrt{s_k^T y_k}} & \frac{s_k}{\sqrt{s_k^T B_k s_k}} \end{bmatrix}=: \begin{bmatrix} a_1 & a_2 \end{bmatrix}\label{temp2}\]


\[\ V^T H_k = \begin{bmatrix} \frac{y_k^T H_k}{\sqrt{s_k^T y_k}} \newline -\frac{(B_k s_k)^T H_k}{\sqrt{s_k^T B_k s_k}} \end{bmatrix} =: \begin{bmatrix} b_1^T \newline b_2^T \end{bmatrix}.\label{temp3}\]

Plug \eqref{temp1}, \eqref{temp2} and \eqref{temp3} into \eqref{hhh}, we obtain

\begin{align*} H_{k+1}&=H_k-H_k U(I+V^{T}H_k U)^{-1}V^{T}H_k \newline &= H_k - \begin{bmatrix} a_1 & a_2 \end{bmatrix} \frac{1}{\beta^2} \begin{bmatrix} 0 & -\beta \newline \beta & \alpha \end{bmatrix} \begin{bmatrix} b_1^T \newline b_2^T \end{bmatrix} \newline &=H_k- \frac{1}{\beta}(a_2 b_1^T-a_1 b_2^T)-\frac{\alpha}{\beta^2}(a_2 b_2^T) \newline &=H_k-\frac{H_k y_k s_k^T+s_k y_k^T H_k}{s_k^T y_k} + \frac{s_k s_k^T}{s_k^T y_k}\left(1+\frac{y_k^T H_k y_k}{s_k^T y_k}\right) \newline &=\left(I-\frac{s_ky_k^T}{s_k^Ty_k}\right)H_k\left(I-\frac{y_ks_k^T}{s_k^Ty_k}\right)+\frac{s_ks_k^T}{s_k^Ty_k}, \end{align*}

which is precisely the same as \eqref{hh}.

Moreover, define $\rho_k := \frac{1}{s_k^T y_k}$, we can simply write

\[H_{k+1} = (I-\rho_k s_k y_k^T)H_k(I-\rho_k y_k s_k^T)+\rho_k s_k s_k^T.\nonumber\]


- End -