# EM算法

Suppose we have samples ${X_i,i=1,\cdots,n}$ from $p$-dimension normal distribution, and some observations have partial missing values. Without loss of generality, assume missing values occure at indexes $1,\cdots,d$, and $d < p$. Then how do we estimate the mean and variance of the underlying distribution? Of course, we could estimate by just using complete observations. But there are ways to use those observations with missing values. EM algorithm is helpful to handle this.

EM algorithm is an iteration algorithm. It can calculate posterior mode(or MLE) by simple iterations. In an iteration, there are two steps, E(Expectation) step and M(Maximization) step. Let $p(\theta|Y)$ the distribution density function of $\theta$ based on observation data $Y$, called observed posterior distribution. Set $p(\theta|Y,Z)$ the distribution density function of $\theta$ after adding data $Z$, called added posterior distribution. $p(Z|\theta,Y)$ stands for conditional distribution density function of additional data given $\theta$ and observation data $Y$. We want to calculate the mode of $p(\theta|Y)$. Denote $\theta^{(i)}$ the estimation before the (i+1)-th iteration, then two steps of (i+1)-th iteration are

E step: taking expectation of $p(\theta|Y,Z)$ or $\log{p(\theta|Y,Z)}$ on conditional distribution of $Z$, say

\begin{align*}
Q(\theta|\theta^{(i)},Y)&=E_Z[\log{p(\theta|Y,Z)|\theta^{(i)},Y]}\\
&=\int{\log{[p(\theta|Y,Z)]p(Z|\theta^{(i)},Y)}}dZ.
\end{align*}

M step: maximize $Q(\theta|\theta^{(i)},Y)$, set

$Q(\theta^{(i+1)}|\theta^{(i)},Y)=\mathop{max}_{\theta}Q(\theta|\theta^{(i)},Y),$

an iteration is formed by $\theta^{(i)}\to\theta^{(i+1)}$. Iteration can be stopped when $\parallel\theta^{(i+1)}-\theta^{(i)}\parallel$ or $\parallel Q(\theta^{(i+1)}|\theta^{(i)},Y)- Q(\theta^{(i)}|\theta^{(i)},Y)\parallel$ is small enough.

There is theorem guarantees after every iteration, EM algorithm increases posterior density function values, i.e.

$p(\theta^{(i+1)}|Y)\ge{p(\theta^{(i)}|Y)}.$

Proof can be found in some advanced statistics books.

The problem in the beginning can now be solved. Using EM algorithm, we add missing values as latent data, and with current estimation of mean and variance, we could get the distribution of missing values(subvector of a multivariate normal distribution vector is still a multivariate normal distribution vector), then we maximize conditional expectation of missing values and get the new estimation of mean and variance. Estimation can be obtained by iteration.

Now we present an example. Data in Table 1 from model: $y_{ij}$ is the $j$-th observation in $i$-th row, and

$(y_{ij}|\mu_i,\sigma)\sim{N(\mu_i,\sigma^2)},j=1,\cdots,n_i;i=1,\cdots,m,$

 row $i$ observation value 1 62,60,63,59 2 63,67,71,64,65,66 3 68,66,71,67,68,68 4 56,62,60,61,63,64,63,59

and all the $y_{ij}$’s are independent. Set

$\mu_i\sim{ N(\mu,\tau^2)}, \theta=(\mu,\log{\sigma},\log{\tau}),$

$Y={y_{ij},j=1,\cdots,n_i;i=1,\cdots,m},$

$Z=(\mu_1,\cdots,\mu_m),$

$n=\sum\limits_{i=1}^m\mu_i.$

$\theta$ is the unknown parameters we are interested in, set its prior distribution

$p(\mu,\log{\sigma},\log{\tau})\propto{\tau}.$

We know nothing about $\theta$, so we guess the above formula. By Bayes formula, we get posterior distribution

$p(\mu_1,\cdots,\mu_m,\mu,\log{\sigma},\log{\tau}|Y)\propto{p(\mu,\log{\sigma},\log{\tau})\prod\limits_{i=1}^mp(\mu_i|\mu,\tau)\prod\limits_{i=1}^m{\prod\limits_{j=1}^{n_i}p(y_{ij}|\mu,\sigma),}}$

extend and take the logarithm

\begin{align*}
&\log{p(\mu_1,\cdots,\mu_m,\mu,\log{\sigma},\log{\tau}|Y)}\\
&\propto{-n\log{\sigma}-(m-1) \log{\tau}-\frac{1}{2\tau^2}\sum\limits_{i=1}^m(\mu_i-\mu)^2-\frac{1}{2\sigma^2}\sum\limits_{i=1}^m{\sum\limits_{j=1}^{n_i}(\mu_i-y_{ij})^2}},(*)
\end{align*}

Maximizing $\log{p(\mu,\log{\sigma},\log{\tau}|Y)}$ directly to estimate $\mu$, $\sigma$, $\tau$ is hard, we add $Z$ and use EM algorithm to maximize it.

E step: set the current estimation of $\theta$ is $\theta^{(t)}=(\mu^{(t)},\sigma^{(t)},\tau^{(t)})$, we take the expectation of formula(*) given $Y$ and $\theta^{(t)}$. Normal distribution is the conjugated prior distribution of $\mu_i$, so

$(\mu_i|\theta^{(t)},Y)\sim{N(\hat{\mu}_i^{(t)},V_i^{(t)})},$

where

$\hat{\mu}_i^{(t)}=(\frac{\mu}{(\tau^{(t)} )^2}+\frac{n_i\hat{y}_i}{(\sigma^{(t)} )^2})/(\frac{1}{(\tau^{(t)} )^2}+\frac{n_i}{(\sigma^{(t)} )^2})$

$V_i^{(t)}=(\frac{1}{(\tau^{(t)} )^2}+\frac{n_i}{(\sigma^{(t)} )^2})^{-1}$

here $\hat{y}_i=\frac{1}{n_i}\sum\limits_{j=1}^{n_i}y_{ij}$. For any $C$ independent of $\mu_i$, we have

\begin{align*}
E[(\mu_i –C)^2|\theta^{(t)},Y]&=[E(\mu_i|\theta^{(t)},Y)-C]^2+Var(\mu_i|\theta^{(t)},Y)\\
&=(\hat{\mu}_i^{(t)}-C)^2+ V_i^{(t)}
\end{align*}

So

\begin{align*}
Q(\theta|\theta^{(t)},Y)=&-n\log{\sigma}-(m-1) \log{\tau}-\frac{1}{2\tau^2}\sum\limits_{i=1}^m[(\hat{\mu}_i^{(t)}-\mu)^2+ V_i^{(t)}]\\
-&\frac{1}{2\sigma^2}\sum\limits_{i=1}^m{\sum\limits_{j=1}^{n_i}[(\hat{\mu}_i^{(t)}-y_{ij})^2+ V_i^{(t)}]}+C_1,
\end{align*}

where $C_1$ is a quantity independent of $\theta$.

M step: maximize $Q(\theta|\theta^{(t)},Y)$ and we get

$\mu^{(t+1)}=\frac{1}{m}\sum\limits_{i=1}^m{\hat{\mu}_i^{(t)}},$

$\sigma^{(t+1)}=\{\frac{1}{n}\sum\limits_{i=1}^m{\sum\limits_{j=1}^{n_i}[(\hat{\mu}_i^{(t)}-y_{ij})^2+ V_i^{(t)}]}\}^{1/2},$

$\tau^{(t+1)}=\{\frac{1}{m-1}\sum\limits_{i=1}^m[(\hat{\mu}_i^{(t)}-\mu)^2+ V_i^{(t)}]\}^{1/2}.$

So we can estimate $\theta$ by iteration now.