Maximum likelihood estimation (MLE) and Bayesian estimation are two different approaches in estimating the parameters of a statistical model. They are closely related, but each has a unique view on parameter estimation. In MLE, the model parameter $\theta$ is assumed to be a fixed (but unknown) value, i.e., $P(\theta)=1$, and it is estimated based on the data only. In contrast, Bayesian estimation believes that the model parameter $\theta$ has its own distribution, i.e., $P(\theta)\in[0,1]$, and the prior probability $P(\theta)$ is manually set in advance (without seeing any data). After it is determined, we are able to estimate $\theta$ by maximizing a posterior (MAP). In other words, the MAP considers the prior probability, but MLE does not. Therefore, they suit different scenarios due to their characteristics.

First of all, consider the general case that given data $\mathcal{D}= \{x_1,\cdots,x_N \}$, which are generated under some distribution with parameter $\theta$, our task is to estimate the unknown parameter $\theta$. In other words, according to observation $\mathcal{D}$, we would like to find a $\theta$ with the largest probability. This can be formalized as:

\[\max_{\theta} P(\theta|\mathcal{D}) = \max_{\theta} \frac{P(\mathcal{D}|\theta)P(\theta)}{P(\mathcal{D})} = \max _{\theta} P(\mathcal{D}|\theta)P(\theta), \label{prob}\]

where the first equality follows from Bayes formula and the second equality is obtained by removing the normalized constant $P(\mathcal{D})$ (same for all $\theta$).

MLE and MAP have different ways of solving problem \eqref{prob}. Consider a concrete example – coin flipping. It obeys Bernoulli distribution with parameter $\theta$ (the probability of a Head result). Denote $X$ as a random variable such that $X=1$ corresponds to a Head and $X=0$ corresponds to a Tail. Then, its p.m.f can be written as

\[P(X = x) = \theta^x (1-\theta)^{1-x}, \quad \text{where } x = 0, 1. \label{coin_prob}\]

Suppose we have collected the dataset $\mathcal{D}$ of $N$ independent experiments of flipping the coin. Next, we will estimate the parameter $\theta$ with MLE and MAP, respectively.

MLE (frequentist viewpoint)

MLE assumes the unkown parameter $\theta$ is fixed. Plugging $P(\theta)=1$ back to \eqref{prob}, we obtain the following problem:

\[\max _{\theta} \ \mathcal{L}(\theta):= P(\mathcal{D}|\theta), \label{mle}\]

where $\mathcal{L}(\theta)$ is usually called the likelihood function. Hence, the goal of MLE is to find a model parameter $\theta$ under which current observation $\mathcal{D}$ happens with the largest probability. This gives us a flavor that MLE tends to overfit since it aims at maximizing the likelihood in fitting the data, irrespective of how likely such a $\theta$ could be.

In the above example, utilizing the fact that samples are independent and \eqref{coin_prob}, we have

\[\mathcal{L}(\theta) = P(X_1=x_1,\cdots,X_N=x_N|\theta) = \prod_{i=1}^N P(X_i=x_i|\theta) = \prod_{i=1}^N \theta^{x_i} (1-\theta)^{1-x_i}. \nonumber\]

Taking log on both sides, we obtain

\[\log(\mathcal{L}(\theta)) = \sum_{i=1}^N \left[x_i \log (\theta) + (1-x_i) \log(1-\theta)\right]. \nonumber\]

Denote $\widehat{\theta}_{MLE} := \arg\max_{\theta} \mathcal{L} (\theta) = \arg\max_{\theta} \log(\mathcal{L}(\theta))$, setting the gradient at $\widehat{\theta}_{MLE}$ to be zero, we obtain

\[\sum_{i=1}^N\frac{x_i}{\widehat{\theta}_{MLE}} - \frac{1-x_i}{1-\widehat{\theta}_{MLE}} = 0 \quad \Rightarrow \quad \widehat{\theta}_{MLE} = \frac{\sum_{i=1}^Nx_i}{N}. \label{tmle}\]

In summary, MLE returns the maximizer of the likelihood function $\mathcal{L}(\theta)$ (the joint probability of i.i.d. samples). From the expression of $\widehat{\theta}$, it is clear that the estimation on $\theta$ is precisely the frequency of a Head occurs in the experiments. This is also called the frequentist interpretation of probability.

MAP (Bayesian viewpoint)

In MAP, we convert a prior probability into a posterior probability corresponds Bayes theorem and maximize that posterior. Before observing data, we capture our knowledge or assumption on $\theta$ in the form of a prior probability distribution $P(\theta)$. By incorporating the extra evidence provided by the observed data $\mathcal{D}$, we want to maximize the posterior probability $P(\theta | \mathcal{D})$. The problem becomes:

\[\max _{\theta} P(\mathcal{D}|\theta)P(\theta) = \mathcal{L}(\theta)P(\theta),\label{be}\]

which is exactly \eqref{prob}. Note that $P(\mathcal{D}|\theta)P(\theta)$ is not the real posterior since it omits a normalized constant. Comparing \eqref{mle} and \eqref{be}, we understand that why MAP alleviates the overfitting problem in MLE: $P(\theta)$ works as a penalty term which penalizes a choice of $\theta$ where the likelihood is very large while the choice itself is unlikely to be true.

The critical point here is that we have our judgment on the distribution of $\theta$ even though we have not seen any data. For example, in our coin-flipping problem, it is evident that $\theta\in[0,1]$ and it is natural to think that $\theta$ lies in a neighborhood of 0.5 with a higher probability. Let us assume that $\theta$ obeys a beta distribution with parameters $\alpha$ and $\beta$ (beta distribution is defined in the unit interval $[0,1]$ and is often used to describe the prior knowledge concerning the probability of success). More formally, we assume the prior distribution of $\theta$ is defined as

\[\pi_{\theta} = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}. \nonumber\]

To be general, here we use the symbol $\alpha$ and $\beta$ instead of some specific values, but in practice, their values should be determined in advance.

Our next step is to choose the best $\theta$ based on the sample information. In this case

\[P(\mathcal{D}|\theta)P(\theta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}\prod_{i=1}^N \theta^{x_i} (1-\theta)^{1-x_i}\,\, \propto\,\, \theta^{\sum x_i+\alpha-1} (1-\theta)^{N-\sum x_i+\beta -1}, \nonumber\]

where the last part obeys another beta distribution with $\alpha’=\sum x_i+\alpha$ and $\beta’ = N-\sum x_i + \beta$. Therefore, the maximum is attained at the mode under this new beta distribution, and we obtain

\[\widehat{\theta}_{MAP} = \frac{\alpha'-1}{\alpha'+\beta'-2}=\frac{\sum x_i+\alpha-1}{N+\alpha+\beta-2}. \label{tmap}\]

MLE v.s. MAP

From the above analysis, i.e., \eqref{tmle} and \eqref{tmap}, we have

\[\widehat{\theta}\_{MLE} = \frac{\sum_{i=1}^Nx_i}{N}\quad\text{and}\quad\widehat{\theta}_{MAP} =\frac{\sum_{i=1}^N x_i+\alpha-1}{N+\alpha+\beta-2}. \nonumber\]
  • As $N\to\infty$, $ \widehat{\theta}_{MAP}\to\frac{\sum_{i=1}^N x_i}{N}=\widehat{\theta}_{MLE} $, i.e., the MLE result is a limit of the Bayesian estimation when we have infinitely many samples. This tells us that as long as we have sufficiently many samples, the prior information and knowledge become insignificant, and it is almost the same as estimating the parameter using only the sample information.

  • By contrast, suppose we have a tiny sample size, e.g., $N=1$. Then we have

    \[\widehat{\theta}_{MLE}= \begin{cases} 1, \quad x_i = 1 \\\\ 0, \quad x_i = 0 \end{cases} \quad \text{and}\quad \widehat{\theta}_{MAP}= \begin{cases} \frac{\alpha}{\alpha+\beta-1}, \quad x_i = 1 \\\\ \frac{\alpha-1}{\alpha+\beta-2}, \quad x_i = 0 \end{cases}. \nonumber\]

    Therefore, in the case that the sample size is small, MLE tends to return us a very extreme outcome (overfitting) while the MAP is much better due to the utilization of the prior knowledge.


- End -