Gradient and Hessian are handy tools when the objective function is continuously differentiable. They provide us with insights through derivative information of first-order and second-order.

Let $f:\mathbb{R}^n\to\mathbb{R}$ be a continuously differentiable function mapping to real values. The gradient of $f$ is defined as

\[\nabla f(x):=\begin{bmatrix} \frac{\partial f}{\partial x_1} \newline \vdots \newline \frac{\partial f}{\partial x_n} \end{bmatrix} \quad\text{where}\quad x=\begin{bmatrix} x_1 \newline \vdots \newline x_n \end{bmatrix}.\]

And the Hessian of $f$ is defined as

\[\nabla^2 f(x):= \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \newline \vdots & \ddots & \vdots \newline \frac{\partial^2 f}{\partial x_n x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}.\]

It is easy to see that if $f$ is real and differentiable, then $\nabla f(x)\in\mathbb{R}^n$, $\nabla^2 f(x)\in\mathbb{R}^{n\times n}$ and is symmetric. And a straightforward extension of this is, if $f:\mathbb{R}^n\to\mathbb{R}^n$, then $\nabla f(x)\in\mathbb{R}^{n\times n}$, e.g., $\nabla (\nabla f(x))=\nabla^2 f(x)$.

Next, we will show some results related to gradient and Hessian, which may be used widely in linear algebra, numerical optimization, and machine learning.

Result 1. Let $f(x)=b^Tx$ where $b\in\mathbb{R}^n$, then $\nabla f(x) = b$ and $\nabla^2 f(x) = \bf{0}$.

Proof: By definition, it is obvious.

Result 2. Let $f(x)=Ax$ where $A\in\mathbb{R}^{n\times n}$, then $\nabla f(x) = A$.

Proof: By definition, it is obvious.

Result 3. Let $f(x)=x^TAx$ where $A\in\mathbb{R}^{n\times n}$, then

\[\nabla f(x) = (A+A^T)x \nonumber\]

and

\[\nabla^2 f(x) = A+A^T. \nonumber\]

Proof: First of all, $f(x)=x^TAx=\sum_{i,j}a_{ij}x_ix_j$. To compute the gradient, we are interested in finding $\frac{\partial f}{\partial x_k}, k=1,\cdots,n$. From the definition of $f$, the only terms that relate to $x_k$ are:

\[\sum_{i\neq k}a_{ik}x_ix_k + \sum_{j\neq k}a_{kj}x_kx_j + a_{kk}x_k^2. \nonumber\]

Then we have

\[\frac{\partial f}{\partial x_k}=\sum_{i\neq k} a_{ik}x_i + \sum_{j\neq k}a_{kj}x_j + 2a_{kk}x_k =\sum_{i=1}^n a_{ik}x_i + \sum_{j=1}^na_{kj}x_j =(A^Tx)_k+(Ax)_k, \nonumber\]

which means

\[\nabla f(x)=(A+A^T)x.\nonumber\]

And by result 2, we immediately have $\nabla^2 f(x)=A+A^T$.

Result 4. Let $f(x)=x^TAx$ where $A\in\mathbb{R}^{n\times n}$ and symmetric, then $\nabla f(x) = 2Ax$ and $\nabla^2 f(x) = 2A$.

Proof: Follows directly from results 2 and 3.

Result 5. Let $f(x)=g(h(x))$ where both $g:\mathbb{R}\to\mathbb{R}$ and $h:\mathbb{R}^n\to\mathbb{R}$ are differentiable functions mapping to real values, then

\[\nabla f(x) = g'(h(x))\nabla h(x)\nonumber\]

and

\[\nabla^2 f(x)=g^{\prime\prime}(h(x))\nabla h(x)\nabla h(x)^T+g'(h(x))\nabla^2h(x). \nonumber\]

Proof: By chain rule, for $i$th component in $\nabla f(x)$:

\[\nabla_i f(x)=\frac{\partial f}{\partial x_i}=\frac{\partial g(h(x))}{\partial x_i}=\frac{\partial g}{\partial h(x)}\frac{\partial h(x)}{\partial x_i}=g'(h(x))\nabla_ih(x), \label{re}\]

where the symbol $g’(h(x))$ is used simply because $g$ is a real-value function. This tells us that $\nabla f(x) = g’(h(x))\nabla h(x)$.

Next, by \eqref{re}, for $i,j=1,\cdots, n$ we have

\[(\nabla^2 f(x))_{ij}=\frac{\partial \nabla_i f(x)}{\partial x_j}=g^{\prime\prime}(h(x))\nabla_j h(x) \nabla_i h(x)+ g'(h(x))\frac{\partial^2 h}{\partial x_i x_j}.\nonumber\]

This is equivalent to say: $\nabla^2 f(x)=g^{\prime\prime}(h(x))\nabla h(x)\nabla h(x)^T+g’(h(x))\nabla^2h(x)$.

Result 6. Let $f(x)=\lVert x\rVert_2^2=x^Tx$, then $\nabla f(x)=2x$ and $\nabla^2 f(x)=2I$.

Proof: Follows directly from result 4.

Result 7. Let $f(x)=\lVert x\rVert_2=\sqrt{x^Tx}$, then

\[\nabla f(x)=-\frac{x}{\lVert x\rVert_2} \nonumber\]

and

\[\nabla^2 f(x)=-\frac{1}{\lVert x\rVert_2^3}xx^T-\frac{1}{\lVert x\rVert_2}I. \nonumber\]

Proof: We can view $f(x)=\sqrt{x^Tx}$ as a composite function, i.e., $f(x)=g(h(x))$, where $g(x)=\sqrt{x}$ and $h(x)=x^Tx$. According to result 5 and 6, we have

\[\nabla f(x) = g'(h(x))\nabla h(x)=-\frac{1}{2\sqrt{x^Tx}}2x = -\frac{x}{\lVert x\rVert_2}.\nonumber\]

Also, Hessian can be computed similarly.

Result 8. Let $f(x)=\lVert Ax-b\rVert_2^2$ where $A\in\mathbb{R}^{n\times n}$, then $\nabla f(x)=2A^TAx-2A^Tb$ and $\nabla^2 f(x)=2A^TA$.

Proof: An simple way to do this is to rewrite $f(x)$ as

\[f(x)=(Ax-b)^T(Ax-b)=x^TA^TAx-2b^TAx+b^Tb, \nonumber\]

then it follows from results 1 and 4.

Result 9. Let $f(x)=\lVert Ax-b\rVert_2$ where $A\in\mathbb{R}^{n\times n}$, then

\[\nabla f(x)=-\frac{A^TAx-A^Tb}{\lVert Ax-b\rVert_2} \nonumber\]

and

\[\nabla^2 f(x)=-\frac{(A^TAx-A^Tb)(A^TAx-A^Tb)^T}{\lVert Ax-b\rVert_2^3}-\frac{A^TA}{\lVert Ax-b\rVert_2}. \nonumber\]

Proof: We can view $f(x)=\sqrt{(Ax-b)^T(Ax-b)}$ as a composite function, i.e., $f(x)=g(h(x))$, where $g(x)=\sqrt{x}$ and $h(x)=(Ax-b)^T(Ax-b)$. According to result 5 and 6, we have

\[\nabla f(x) = g'(h(x))\nabla h(x)=-\frac{2A^TAx-2A^Tb}{2\sqrt{(Ax-b)^T(Ax-b)}} = -\frac{A^TAx-A^Tb}{\lVert Ax-b\rVert_2}.\nonumber\]

Also, Hessian can be computed similarly.

 


- End -