In this post, we will derive the bias-variance decomposition for regression problem.

Denote $f(\cdot)$ as the regression model, $\textbf{x}$ as the input feature, $y$ as the output value. Let $\mathcal{L}(y, f(\textbf{x};\theta))$ be the loss function, which is normally chosed to be a squared loss in regression. Then the expected risk is

\begin{align} \mathcal{R}(\theta)&:=\mathbb{E}_{(\textbf{x},y)\sim p(\textbf{x},y)}[\mathcal{L}(y, f(\textbf{x};\theta))] \label{expectedrisk} \newline &=\iint (y-f(\textbf{x};\theta))^2 p(\textbf{x},y) \, d\textbf{x} \, dy. \nonumber \end{align}

Our goal is to choose a model $f(\cdot)$ properly in order to minimize the expected risk $\mathcal{R}(\theta)$. Making use of calculus of variations:

\[\frac{\delta \mathcal{R}(\theta)}{\delta f}= -2\int (y-f(\textbf{x};\theta))p(\textbf{x},y) \, dy, \nonumber\]

setting above derivative to 0 gives us the optimal solution for the regression model:

\[\label{bestf} f^*(\textbf{x};\theta)=\frac{\int yp(\textbf{x},y) \, dy}{p(\textbf{x})}=\int y p(y|\textbf{x}) \, dy = \mathbb{E}_{y\sim p(y|\textbf{x})}[y],\]

which is the conditional expectation of $y$ on $\textbf{x}$.

For simplicity, let $f:=f(\textbf{x};\theta)$ and $f^*:=f^*(\textbf{x};\theta)$. Observe that

\begin{align*} \mathcal{L}=(y-f)^2 &= (y-f^* + f^*-f)^2 \newline &=(y-f^*)^2 + 2(y-f^*)(f^*-f) + (f^*-f)^2. \end{align*}

Plugging this into \eqref{expectedrisk}, we have

\begin{align*} \mathcal{R}(\theta)= \mathbb{E}[(y-f^*)^2] + 2 \mathbb{E}[(y-f^*)(f^*-f)] + \mathbb{E}[(f^*-f)^2], \end{align*}

where

\begin{align*} \mathbb{E}[(y-f^*)^2]&=\iint (y-f^*)^2p(\textbf{x},y) \, d\textbf{x}\, dy =\iint (y-f^*)^2 p(y|\textbf{x}) p(\textbf{x}) \, d\textbf{x}\, dy \newline & =\int \left(\int (y-f^*)^2 p(y|\textbf{x})\, dy\right) p(\textbf{x}) \, d\textbf{x} \newline &=\int \mathbb{E}_{y\sim p(y|\textbf{x})}\left[\left(y-\mathbb{E}_{y\sim p(y|\textbf{x})}[y]\right)^2\right]\, p(\textbf{x}) \, d\textbf{x} \newline &=\int Var_{y\sim p(y|\textbf{x})} [y] \, p(\textbf{x}) \, d\textbf{x}, \newline \mathbb{E}[(y-f^*)(f^*-f)] &= \iint (y-f^*)(f^*-f)p(\textbf{x},y) \, d\textbf{x}\, dy = \iint (y-f^*)(f^*-f)p(y|\textbf{x}) p(\textbf{x}) \, d\textbf{x}\, dy \newline &=\int \left(\int (y-f^*)p(y|\textbf{x})\, dy\right)(f^*-f) p(\textbf{x}) \, d\textbf{x} \newline &=\int \mathbb{E}_{y\sim p(y|\textbf{x})}\left[y-\mathbb{E}_{y\sim p(y|\textbf{x})}[y]\right] \, (f^*-f)\, p(\textbf{x}) \, d\textbf{x} \newline &= 0, \newline \mathbb{E}[(f^*-f)^2] &= \iint (f^*-f)^2p(\textbf{x},y) \, d\textbf{x}\, dy \newline &=\int \left(\int p(\textbf{x},y) \, dy \right)(f^*-f)^2 d\textbf{x} \newline &=\int (f^*-f)^2 p(\textbf{x})\, d\textbf{x} \newline &=\mathbb{E}_{\textbf{x}\sim p(\textbf{x})} [(f-f^*)^2]. \end{align*}

Therefore, we can write

\begin{align}\label{decomp1} \mathcal{R}(\theta)=\mathbb{E}_{\textbf{x}\sim p(\textbf{x})} [(f-f^*)^2] + \int Var_{y\sim p(y|\textbf{x})} [y] \, p(\textbf{x}) \, d\textbf{x}. \end{align}

From \eqref{decomp1}, we see two things:

  • The second term is just the variance of the output value (label) averaged over the data $\textbf{x}$. It represents the intrinsic variability (depends on the sample distribution only) and can be regarded as noise term because it reflects the difference between the actual labels of the dataset and the expected output values in the desired case. It is independent of the choice of model $f$, and the cost is irreducible (minimum achievable value of the expected loss).

  • The first term represents the gap between our current model $f$ and the optimal model $f^*$, which is our real target and can be minimized by applying various machine learning algorithms to learn a better $f$. On the other hand, this confirms that the optimal least-squares predictor is given by the conditional expectation proved in \eqref{bestf}:

\begin{align*} \min_{f}\, \mathcal{R} \quad \Leftrightarrow \quad f^* = \mathbb{E}_{y\sim p(y|\textbf{x})}[y]=\int y p(y|\textbf{x}) \, dy . \end{align*}

In practice, we only have a particular finite dataset $\mathcal{D}$, which consists of i.i.d. samples under the true distribution $p(\textbf{x},y)$. Due to the limitation of data, we do not know $f^*$ exactly, and thus we are unable to choose a $f$ to make the first term in \eqref{decomp1} disappear. Instead, different sampling of $\mathcal{D}$ will result in a different learned model $f_{\mathcal{D}}:=f_{\mathcal{D}}(\textbf{x};\theta)$. The ability of a machine learning algorithm and the learned model $f_{\mathcal{D}}$ can be evaluated through the average over the ensemble of datasets, i.e., $\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}]$.

For a single example $\textbf{x}$, the expected gap between model $f_\mathcal{D}$ over different dataset $\mathcal{D}$ and the best model $f^*$ can be written as:

\begin{align*} \mathbb{E}_{\mathcal{D}}\left[(f_\mathcal{D}-f^*)^2\right] &= \mathbb{E}_{\mathcal{D}}\left[(f_\mathcal{D}-\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}]+\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}]-f^*)^2\right] \newline &=\mathbb{E}_{\mathcal{D}}\left[(f_\mathcal{D}-\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}])^2\right] + 2 \mathbb{E}_{\mathcal{D}}\left[(f_\mathcal{D}-\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}])(\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}]-f^*)\right] + \mathbb{E}_{\mathcal{D}}\left[(\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}]-f^*)^2\right] \newline &=\left(\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}]-f^*\right)^2 + \mathbb{E}_{\mathcal{D}}\left[(f_\mathcal{D}-\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}])^2\right] \newline &= (bias)^2 + variance, \end{align*}

where the third equality holds because the cross term is vanishing.

  • The squared bias term measures how the average performance of a model over all datasets differs from the desired regression function. It answers the question that if the model fits the data well.

  • The variance term measures the performance distinction of a model over different dataset around their average. It tells us if the model is sensitive to the particular choice of dataset and answers the question that how likely the model will overfit.

Combine the discussion for a single input $\textbf{x}$ with \eqref{decomp1}, we obtain the following decomposition of the expected risk:

\begin{align}\label{decomp2} \mathcal{R}(\theta)=(bias)^2 + variance + noise, \end{align}

where

\begin{align*} (bias)^2 &= \mathbb{E}_{\textbf{x}\sim p(\textbf{x})} \left[\left(\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}]-f^*\right)^2\right] = \int \left(\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}]-f^*\right)^2 p(\textbf{x}) \, d\textbf{x}, \newline variance &=\mathbb{E}_{\textbf{x}\sim p(\textbf{x})} \left[\mathbb{E}_{\mathcal{D}}\left[(f_\mathcal{D}-\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}])^2\right]\right] = \int \mathbb{E}_{\mathcal{D}}\left[(f_\mathcal{D}-\mathbb{E}_{\mathcal{D}}[f_{\mathcal{D}}])^2\right] p(\textbf{x}) \, d\textbf{x}, \newline noise &= \int Var_{y\sim p(y|\textbf{x})} [y] \, p(\textbf{x}) \, d\textbf{x}= \iint (y-f^*)^2p(\textbf{x},y) \, d\textbf{x}\, dy. \end{align*}

 


- End -