4  Latent variables

4.1 Setting

Many statistical models simplify when written in terms of unobserved latent variable \(U\) in addition to the observed data \(Y\). The latent variable

  • may really exist: for example, when \(Y = I(U > c)\) for some continuous \(U\) (“do you earn less than c per year?”);

  • may be human construct: for example, something called IQ is said to underlie scores on intelligence tests, but is IQ just a cultural construct? (“Mismeasure of man” debate, etc.);

  • may just be a mathematical / computational device: for example, latent variables are used in the implementation of MCMC or EM algorithms.

Some prominent examples include models with random effects, hidden variables in probit regression, and mixture models.

Example 4.1 (Velocity of galaxies) The galaxies data set of the MASS R packages provides the velocities, in km/sec, of 82 galaxies, moving away from our own galaxy, from 6 well-separated conic sections of an ‘unfilled’ survey of the Corona Borealis region. If galaxies are indeed super-clustered the distribution of their velocities estimated from the red-shift in their light-spectra would be multimodal, and unimodal otherwise.

Figure 4.1 shows two kernel density estimators with gaussian kernel but different bandwidth selection procedures. We can think of each observation having a latent, unobserved, variable indicating the supercluster the galaxy belongs to. Clearly, depending on what density estimator is used, we may end up with different inferences.

Code
cols <- hcl.colors(3)
data("galaxies", package = "MASS")
## Fix typo see `?galaxies`
galaxies[78] <- 26960
## Rescale to 1000km/s
galaxies <- galaxies / 1000
plot(x = c(0, 40), y = c(0, 0.25), type = "n", bty = "l",
     xlab = "velocity of galaxy (1000km/s)", ylab = "density")
rug(galaxies)
lines(density(galaxies, bw = "nrd0"), col = cols[2])
lines(density(galaxies, bw = "SJ"), col = cols[3])
legend("topleft", legend = c('bw = "nrd0"', 'bw = "SJ"'),
       col = cols[2:3], lty = 1)
Density of galaxy velocities in 1000km/s, using two kernel density estimators with gaussian kernel but different bandwidth selection procedures.
Figure 4.1: Density of galaxy velocities in 1000km/s, using two kernel density estimators with gaussian kernel but different bandwidth selection procedures.

4.2 Latent variable models

Let \([U], D\) denote discrete random variables, and \((U), X\) continuous ones.

Then in notation for graphical models:

  • \([U] \to X\) or \([U]\to D\): finite mixture models, hidden Markov models, changepoint models, etc.

  • \((U) \to D\): data coarsening (censoring, truncation, \(\ldots\))

  • \((U)\to X\) or \((U)\to D\): variance components and other hierarchical models (generalized nonlinear mixed models, exploratory / confirmatory factor analysis, etc.)

Example 4.2 (Probit regression) For example, a generalized linear model for binary data with a probit link function can be written as \[ \begin{aligned} U_i & \stackrel{\text{ind}}{\sim}\mathop{\mathrm{N}}(x_i^\top \beta, 1) \\ Y_i & = I(U_i \geq 0) \, . \end{aligned} \] The log-likelihood of the probit regression model is then \[ \sum_{i = 1}^n \left\{ y_i \log \Phi(x_i^\top \beta) + (1 - y_i)\log\{1 - \Phi(x^\top\beta)\} \right\} \, . \]

Different assumptions for the distribution of the latent variable \(U\) give rise to different models (e.g. logistic distribution gives rise to logistic regression, extreme-value distribution to complementary log-log regression, etc.).

4.3 Finite mixture models

Settings like that of Example 4.1 are common in practice; observations are taken from a population composed of distinct sub-populations, but it is unknown from which of those sub-populations the observations come from. A natural model for such settings is a finite mixture model. A \(p\)-component finite mixture model has density or probability mass function \[ f(y \,;\,\pi, \theta) = \sum_{r=1}^p \pi_r f_r(y \,;\,\theta) \quad (0\leq \pi_r\leq 1 ; \sum_{r=1}^p \pi_r = 1) \,, \] where \(\pi_r\) is the probability that \(y\) is from the \(r\)th component, and \(f_r(y \,;\,\theta)\) is its density or probability mass function conditional on this event (component density).

We can represent the mixture model using indicator variables \(U\), taking a value in \(1, \ldots, p\) with probabilities \(\pi_1, \ldots, \pi_p\), respectively, indicating from which component \(Y\) is drawn.

Mixture models is a widely used class of models for density estimation and clustering. It is often assumed that the number of components \(p\) is unknown.

Such models are non-regular for likelihood inference, for the following reasons:

  • The ordering of components is non-identifiable. In other works, charging the order of the components does not change the model.

  • Setting \(\pi_r = 0\) eliminates the unknown parameters in \(f_r(y \,;\,\pi, \theta)\)

  • Depending on the specification of the component distribution, the maximum of the log-likelihood can be \(+\infty\), and can be achieved for various \(\theta\).

Typically, implementations use special constraints on the parameters overcome to the above issues.

4.4 Expectation-Maximization

Derivation of the EM algorithm

Suppose that the aim is to use the observed value \(y\) of \(Y\) for inference on \(\theta\) when we cannot easily compute \[ f(y \,;\,\theta) = \int f(y\mid u \,;\,\theta) f(u \,;\,\theta) du\,. \]

Assuming that we have observations on \(U\), the Bayes theorem gives that the complete data log-likelihood is \[ \log f(y, u \,;\,\theta) = \log f(y \,;\,\theta) + \log f(u \mid y \,;\,\theta) \, . \tag{4.1}\]

On the other hand, the incomplete data log-likelihood (sometimes also called the observable data log-likelihood) is simply \[ \ell(\theta) = \log f(y \,;\,\theta) \, . \]

Taking expectations in both sides of (4.1) with respect to \(f(u \mid y \,;\,\theta')\) gives \[ \mathop{\mathrm{E}}\left( \log f(Y, U \,;\,\theta) \mid Y = y \,;\,\theta'\right) = \ell(\theta) + \mathop{\mathrm{E}}\left( \log f(U \mid Y \,;\,\theta)\mid Y = y \,;\,\theta'\right) \, , \tag{4.2}\] which we write as \[ Q(\theta\,;\,\theta') = \ell(\theta) + C(\theta \,;\,\theta') \, . \]

Let’s fix \(\theta'\), and consider how \(Q(\theta \,;\,\theta')\) and \(C(\theta \,;\,\theta')\) depend on \(\theta\). Note that, by Jensen’s inequality, \(C(\theta' \,;\,\theta')\geq C(\theta \,;\,\theta')\), with equality only when \(\theta = \theta'\). Hence, \[ Q(\theta \,;\,\theta')\geq Q(\theta' \,;\,\theta') \implies \ell(\theta) - \ell(\theta') \geq C(\theta' \,;\,\theta') - C(\theta \,;\,\theta') \geq 0 \, . \tag{4.3}\]

Under mild smoothness conditions, \(C(\theta \,;\,\theta')\) has a stationary point at \(\theta := \theta'\), so if \(Q(\theta \,;\,\theta')\) is stationary at \(\theta := \theta'\), so too is \(\ell(\theta)\).

Hence, we now formulate the Expectation-Maximization (EM) algorithm for maximizing \(\ell(\theta)\). Starting from an initial value \(\theta'\) of \(\theta\)

  1. Compute \(Q(\theta \,;\,\theta') = \mathop{\mathrm{E}}\left( \log f(Y,U \,;\,\theta) \mid Y = y \,;\,\theta'\right)\).

  2. With \(\theta'\) fixed, maximize \(Q(\theta \,;\,\theta')\) with respect to \(\theta\), and let \(\theta^\dagger\) be the maximizer.

  3. Check if the algorithm has converged, based on \(\ell(\theta^\dagger) - \ell(\theta')\) if available, or \(\|\theta^\dagger - \theta'\|\), or both. If the algorithm has converged stop and return \(\theta^\dagger\) as the value of the maximum likelihood estimator \(\hat\theta\). Otherwise, set \(\theta' := \theta^\dagger\) and go to 1.

Steps 1 and 2 are the Expectation (E) and maximization (M) steps, respectively.

The M-step ensures that \(Q(\theta^\dagger \,;\,\theta') \geq Q(\theta' \,;\,\theta')\), so (4.3) implies that \(\ell(\theta^\dagger) \geq \ell(\theta')\). The log likelihood never decreases between EM iterations!

Convergence

If \(\ell(\theta)\) has only one stationary point, and if \(Q(\theta \,;\, \theta')\) eventually reaches a stationary value at \(\hat\theta\), then \(\hat\theta\) must maximize \(\ell(\theta)\). Otherwise, the algorithm may converge to a local maximum of the log likelihood or to a saddlepoint.

Note here, that the EM algorithm never decreases the log likelihood so it is, generally, more stable than Newton–Raphson-type algorithms. The rate of convergence depends on closeness of \(Q(\theta \,;\,\theta')\) and \(\ell(\theta)\).

Similarly to (4.2), we can write \[ -{\partial^2\ell(\theta)\over\partial \theta\partial \theta^\top} = \mathop{\mathrm{E}}\left(\left. -{\partial^2\log f(Y, U \,;\,\theta)\over\partial \theta\partial \theta^\top} \right| Y=y \,;\,\theta\right) - \mathop{\mathrm{E}}\left(\left. -{\partial^2\log f(U \mid Y \,;\,\theta)\over\partial \theta\partial \theta^\top} \right| Y=y \,;\,\theta\right), \] or \(J(\theta) = I_c(\theta \,;\,y) - I_m(\theta \,;\,y)\). The latter expression is often referred to as the missing information principle: the observed information equals the complete-data information minus the missing information.

The rate of convergence of the EM is slow if the largest eigenvalue of \(I_c(\theta \,;\,y)^{-1}I_m(\theta \,;\,y) \approx 1\). Roughly, this occurs if the missing information is a high proportion of the total.

Example 4.3 (Negative binomial) For a toy example, suppose that conditional on \(U = u\), \(Y\) is a Poisson variable with mean \(u\), and that \(U\) is gamma with mean \(\theta\) and variance \(\theta^2/\nu\). Inference is required for \(\theta\) with the shape parameter \(\nu > 0\) assumed known. The complete data log-likelihood is \[ y\log u -u - \log y! + \nu\log\nu -\nu\log\theta + (\nu-1)\log u -\nu u/\theta -\log\Gamma(\nu) \,, \] and hence (4.2) is \[ Q(\theta \,;\,\theta') = (y+\nu-1)\mathop{\mathrm{E}}(\log U\mid Y=y \,;\,\theta') -(1+\nu/\theta)\mathop{\mathrm{E}}(U\mid Y=y \,;\,\theta') -\nu\log\theta \] plus terms that depend neither on \(U\) nor on \(\theta\).

The E-step, that is the computation of \(Q(\theta \,;\,\theta')\), involves two expectations, but fortunately \(\mathop{\mathrm{E}}(\log U \mid Y = y \,;\, \theta')\) does not appear in terms that involve \(\theta\) and so is not required. To compute \(\mathop{\mathrm{E}}(U\mid Y = y \,;\,\theta')\), note that \(Y\) and \(U\) have joint density \[ f(y\mid u)f(u \,;\,\theta) = {u^y\over y!} e^{-u} \times {\nu^\nu u^{\nu-1}\over\theta^\nu \Gamma(\nu)} e^{-\nu u/\theta},\quad y=0,1,\ldots,\ u>0,\quad \theta>0 \, , \] so the marginal density of \(Y\) is \[ f(y \,;\,\theta) = \int_0^\infty f(y\mid u)f(u \,;\,\theta,\nu) du = {\Gamma(y+\nu)\nu^\nu \over\Gamma(\nu)y!} {\theta^y\over (\theta+\nu)^{y+\nu}} \quad (y = 0,1,\ldots)\, . \]

Hence, the conditional density \(f(u \mid y \,;\,\theta')\) is gamma with shape parameter \(y + \nu\) and mean \(\mathop{\mathrm{E}}(U\mid Y=y \,;\,\theta') = (y + \nu)/(1 + \nu / \theta')\). Ignoring terms that do not depend on \(\theta\), we can take \[ Q(\theta \,;\,\theta') = -(1+\nu/\theta)(y+\nu)/(1+\nu/\theta') -\nu\log\theta \, . \]

The M-step involves maximization of \(Q(\theta \,;\,\theta')\) over \(\theta\) for fixed \(\theta'\). If we differentiate with respect to \(\theta\) we find that the maximizing value is \[ \theta^\dagger = \theta'(y+\nu)/(\theta'+\nu) \, \tag{4.4}\] Hence, the EM algorithm boils down to choosing an initial \(\theta'\), updating it to \(\theta^\dagger\) using (4.4), setting \(\theta' = \theta^\dagger\) and iterating to convergence.

4.5 EM for mixture models

Consider the \(p\)-component mixture density \(f(y \,;\,\pi, \theta) = \sum_{r=1}^p \pi_r f_r(y \,;\,\theta)\). The contribution to the complete data log-likelihood (assuming that we know from what component \(y\) is from) has the form \[ \log f(y, u \,;\,\pi, \theta) = \sum_{r=1}^p I(u=r)\left\{ \log \pi_r + \log f_r(y \,;\,\theta)\right\} \, . \]

Hence, for the E-step, we must compute the expectation of \(\log f(y,u \,;\,\pi, \theta)\) over the conditional distribution \[ w_r(y \,;\,\pi', \theta') = P(U = r \mid Y = y \,;\,\pi', \theta') = {\pi'_r f_r(y \,;\,\theta')\over \sum_{s=1}^p \pi'_s f_s(y \,;\,\theta')} \quad (r=1,\ldots,p)\,. \tag{4.5}\] This probability can be regarded as the weight attributable to component \(r\) if \(y\) has been observed.

The expected value of \(I(U = r)\) with respect to (4.5) is \(w_r(y \,;\,\pi', \theta')\), so the expected value of the complete-data log likelihood based on a random sample \((y_1, u_1), \ldots, (y_n, u_n)\) is \[ \begin{aligned} Q(\pi, \theta \,;\,\pi', \theta') & = \sum_{j=1}^n \sum_{r=1}^p w_r(y_j \,;\,\pi', \theta')\left\{ \log\pi_r + \log f_r(y_j \,;\,\theta)\right\}\\ &= \sum_{r=1}^p \left\{ \sum_{j=1}^n w_r(y_j \,;\,\pi', \theta')\right\}\log\pi_r + \sum_{r=1}^p \sum_{j=1}^n w_r(y_j \,;\,\pi', \theta')\log f_r(y_j \,;\,\theta) \, . \end{aligned} \]

The M-step of the algorithm entails maximizing \(Q(\pi, \theta \,;\,\pi', \theta')\) over \(\pi\) and \(\theta\) for fixed \(\pi'\) and \(\theta'\). As \(\pi_r\) do not usually appear in the component density \(f_r\), the maximizing values \(\pi_r^\dagger\) are obtained from the first term of \(Q\), which corresponds to a multinomial log likelihood. Thus \[ \pi_r^\dagger = \frac{1}{n} \sum_j w_r(y_j \,;\,\theta') \, \] which is the average weight for component \(r\).

Estimates of the parameters of the component distributions are obtained from the weighted log likelihoods that form the second term of \(Q(\pi, \theta \,;\,\pi', \theta')\). For example, if \(f_r\) is the normal density with mean \(\mu_r\) and variance \(\sigma^2_r\), simple calculations give the weighted estimates \[ \mu_r^\dagger = {\sum_{j=1}^n w_r(y_j \,;\,\pi', \theta') y_j\over \sum_{j=1}^n w_r(y_j \,;\,\pi', \theta')}\,, \quad \sigma^{2\dagger}_r = {\sum_{j=1}^n w_r(y_j \,;\,\pi', \theta')(y_j-\mu_r^\dagger)^2\over \sum_{j=1}^n w_r(y_j \,;\,\pi', \theta')} \quad (r=1,\ldots,p) \, . \]

Given initial values for the parameters, the EM algorithm simply involves computing the weights \(w_r(y_j \,;\,\pi', \theta')\) at these initial values, updating to obtain \(\pi^\dagger, \theta^\dagger\), and checking convergence using the log likelihood, \(\|\theta^\dagger - \theta'\| + \|\pi^\dagger - \pi'\|\), or both. If convergence is not yet attained, \(\pi', \theta'\) are replaced by \(\pi^\dagger, \theta^\dagger\) and the cycle repeated.

Example 4.4 (Velocity of galaxies (revisited)) We now revisit Example 4.1, and fit a mixture model with normal component densities, where each component has its own mean and variance. This can be done using the mclust R package, which provides the EM algorithm for gaussian mixture models. The code chunk below fits all mixture models with 1 up to 10 components, and plots the values of BIC for the 10 models (note that mclust reports the negative of the BIC as we defined it). The model with 4 components has the best BIC value.

Code
library("mclust")
gal_mix <- Mclust(galaxies, G = 1:10, modelNames = "V")
plot(gal_mix, what = "BIC")

Figure 4.2 shows the estimated mixture density.

Code
plot(x = c(0, 40), y = c(0, 0.25), type = "n", bty = "l",
     xlab = "velocity of galaxy (1000km/s)", ylab = "density")
rug(galaxies)
lines(density(galaxies, bw = "nrd0"), col = cols[2])
lines(density(galaxies, bw = "SJ"), col = cols[3])
legend("topleft", legend = c('GMM', 'bw = "nrd0"', 'bw = "SJ"'),
       col = cols, lty = 1)
ra <- seq(0, 40, length.out = 1000)
lines(ra, dens(ra, modelName = "V", parameters = gal_mix$parameters),
      col = cols[1])
Density of galaxy velocities in 1000km/s, using two kernel density estimators with gaussian kernel but different bandwidth selection procedures, and a mixture of 4 normal densities.
Figure 4.2: Density of galaxy velocities in 1000km/s, using two kernel density estimators with gaussian kernel but different bandwidth selection procedures, and a mixture of 4 normal densities.

One of the advantages of using the EM to fit mixture models is that the weights (4.5) from the E-step can be used to determine both at which component is observation has been assigned to and the uncertainty around that assignment. For example,

Code
par(mfrow = c(1, 2))
plot(gal_mix, what = "classification")
plot(gal_mix, what = "uncertainty")

4.6 Exponential families

Suppose that the complete-data log likelihood is the logarithm of the density or probability mass function of an exponential family distribution, that is \[ \log f(y,u \,;\,\theta) = s(y,u)^\top\theta - \kappa(\theta) + c(y,u) \, . \tag{4.6}\]

In order to implement the EM algorithm, we need the expected value of \(\log f(y,u \,;\,\theta)\) with respect to \(f(u\mid y \,;\,\theta')\). The final term in (4.6) can be ignored. Hence, the M-step involves maximizing \[ Q(\theta \,;\,\theta') = \mathop{\mathrm{E}}\left( s(y,U)^\top\theta\mid Y = y \,;\,\theta'\right) - \kappa(\theta) \, , \] or, equivalently, solving for \(\theta\) the equation \[ \mathop{\mathrm{E}}\left( s(y,U)\mid Y=y \,;\,\theta'\right) = {\partial\kappa(\theta)\over\partial\theta} \, . \]

The likelihood equation for \(\theta\) based on the complete data is \(s(y,u) = \partial\kappa(\theta)/\partial\theta\). So, the EM algorithm simply replaces \(s(y , u)\) by its conditional expectation \(\mathop{\mathrm{E}}\left( s(y,U)\mid Y=y \,;\,\theta'\right)\) and solves the likelihood equation. Thus, a routine to fit the complete-data model can readily be adapted for missing data if the conditional expectations are available.