Model overview

smoothEMr fits a Gaussian mixture model with a structured (smoothness) prior on the component means. Let XidX_i \in \mathbb{R}^d be the observed vector for i=1,,ni = 1,\dots,n, and let zi{1,,K}z_i \in \{1,\dots,K\} be the latent component label. The observation model is

Xizi=kN(μk,Σ),Pr(zi=k)=πk. X_i \mid z_i = k \sim N(\mu_k, \Sigma), \qquad \Pr(z_i = k) = \pi_k .

To encourage ordered / smooth transitions across components k=1,,Kk=1,\dots,K, we place a multivariate normal prior on the stacked mean vector U=(μ1,,μK)U = (\mu_1^\top, \dots, \mu_K^\top)^\top:

UN(0,Qλ1), U \sim N(0, Q_\lambda^{-1}),

where the precision matrix QλQ_\lambda encodes the desired structure across components. For computational efficiency, we require QλQ_\lambda to be sparse so the prior is a GMRF (Gaussian Markov random field). Many common GMRF priors can be interpreted as discrete analogues of continuous GP priors. Equivalently, the M-step solves a penalized optimization problem with quadratic penalty 12UQλU\tfrac{1}{2} U^\top Q_\lambda U.

The default choice in smoothEMr is a qq-th order random-walk (RW) prior, with

Qλ=λQrwq, Q_\lambda = \lambda Q_{\mathrm{rw}q},

where QrwqQ_{\mathrm{rw}q} is the precision matrix induced by qq-th order discrete differences. Larger λ\lambda enforces stronger smoothness across neighboring components.

Equivalently, letting μ,j=(μ1,j,,μK,j)K\mu_{\cdot,j} = (\mu_{1,j},\dots,\mu_{K,j})^\top \in \mathbb{R}^K, the RW2 penalty can be written as k=3K(μk,j2μk1,j+μk2,j)2=μ,jQrw2μ,j,Qrw2=D2D2, \sum_{k=3}^K \big(\mu_{k,j} - 2\mu_{k-1,j} + \mu_{k-2,j}\big)^2 \;=\; \mu_{\cdot,j}^\top Q_{\mathrm{rw}2}\,\mu_{\cdot,j}, \qquad Q_{\mathrm{rw}2} = D_2^\top D_2, where D2(K2)×KD_2 \in \mathbb{R}^{(K-2)\times K} is the second-difference matrix with rows (0,,0,1,2,1,0,,0)(0,\dots,0,\,1,\,-2,\,1,\,0,\dots,0).

By default, smoothEMr assumes Σ\Sigma is diagonal and shared across components (homoskedasticity), but the framework can be extended to accommodate heteroskedasticity or other covariance structures.

Algorithm overview (smooth-EM)

Fix a smoothing strength λ\lambda and a sparse precision matrix Qλ=λQrwqQ_\lambda = \lambda Q_{\mathrm{rw}q}. Let U=(μ1,,μK)U = (\mu_1^\top,\dots,\mu_K^\top)^\top stack all component means. We seek a MAP estimate by maximizing the log-posterior (up to an additive constant) over (π,U,Σ)(\pi,U,\Sigma):

J(π,U,Σ;λ)=i=1nlog(k=1Kπkp(Xiμk,Σ))12UQλU+const(λ), J(\pi,U,\Sigma;\lambda) = \sum_{i=1}^n \log\!\Big(\sum_{k=1}^K \pi_k\, p(X_i \mid \mu_k,\Sigma)\Big) \;-\; \tfrac{1}{2} U^\top Q_\lambda U \;+\; \text{const}(\lambda), where const(λ)\text{const}(\lambda) collects terms that do not depend on (π,U,Σ)(\pi,U,\Sigma).

The smooth-EM algorithm performs coordinate ascent on an ELBO (evidence lower bound) ELBO(Γ,π,U,Σ;λ)\mathrm{ELBO}(\Gamma,\pi,U,\Sigma;\lambda) of the form ELBO(Γ,π,U,Σ;λ)=i=1nk=1Kγik{logπk+logp(Xiμk,Σ)}+H(Γ)12UQλU+const(λ), \mathrm{ELBO}(\Gamma,\pi,U,\Sigma;\lambda) = \sum_{i=1}^n \sum_{k=1}^K \gamma_{ik} \Big\{ \log \pi_k + \log p(X_i \mid \mu_k,\Sigma) \Big\} \;+\; H(\Gamma) \;-\; \tfrac{1}{2} U^\top Q_\lambda U \;+\; \text{const}(\lambda), where H(Γ)=i,kγiklogγikH(\Gamma) = -\sum_{i,k}\gamma_{ik}\log\gamma_{ik} is the entropy term.

Relationship between ELBO and the objective. For any Γ\Gamma, J(π,U,Σ;λ)ELBO(Γ,π,U,Σ;λ), J(\pi,U,\Sigma;\lambda) \;\ge\; \mathrm{ELBO}(\Gamma,\pi,U,\Sigma;\lambda), and equality holds when γik=Pr(zi=kXi,π,U,Σ)\gamma_{ik} = \Pr(z_i=k \mid X_i,\pi,U,\Sigma) (i.e., Γ\Gamma equals the posterior responsibilities). Thus, maximizing the ELBO provides a monotone lower-bound ascent procedure for the MAP objective JJ.

Given current parameter values, each iteration alternates:

  1. E-step (update Γ\Gamma): γikπkp(Xiμk,Σ),normalized over k=1,,K. \gamma_{ik} \propto \pi_k\, p(X_i \mid \mu_k,\Sigma), \quad \text{normalized over } k=1,\dots,K.

  2. M-step (update π,U,Σ\pi,U,\Sigma): update π\pi and Σ\Sigma in closed form given Γ\Gamma, and update UU by maximizing UargmaxU{i,kγiklogp(Xiμk,Σ)12UQλU}. U \leftarrow \arg\max_U \Big\{ \sum_{i,k} \gamma_{ik}\log p(X_i \mid \mu_k,\Sigma) \;-\; \tfrac{1}{2} U^\top Q_\lambda U \Big\}.

With λ\lambda fixed, the ELBO is non-decreasing over iterations; moreover, because each E-step makes the bound tight at the current parameters, the MAP objective JJ is also non-decreasing.

A simple 2D example

We simulate data from a spiral (swiss-roll) embedded in 2D space, then fit a smoothEM model.

sim <- simulate_swiss_roll_1d_2d(n = 1500, sigma = 0.2, seed = 123)

plot(sim$obs$x, sim$obs$y, pch = 16, cex = 0.35, col = "grey80",
     xlab = "x", ylab = "y", main = "Noisy 2D swiss-roll (spiral)")

pal <- colorRampPalette(c("navy", "cyan", "yellow", "red"))(256)
t_scaled <- (sim$t - min(sim$t)) / (max(sim$t) - min(sim$t))  # in [0,1]
col_t <- pal[pmax(1, pmin(256, 1 + floor(255 * t_scaled)))]
points(sim$truth$x, sim$truth$y, pch = 16, cex = 0.25, col = col_t)

To fit the model, we first initialize the smoothEM algorithm using a given initialization method. Here, we use fiedler initialization and fix the smoothing parameter lambda to 10.

set.seed(123)
fit <- initialize_csmoothEM(
  X = as.matrix(sim$obs),
  method = "fiedler",
  adaptive = FALSE,
  k = 15,
  K = 100,
  lambda = 10
)
fit
#> Fitted csmoothEM object with RW(2) prior along K = 100
#> -----
#>   n = 1500, d = 2, modelName = homoskedastic
#>   iter = 1; init_method = fiedler; adaptive = none;
#>   lambda_vec: range=[10, 10], mean=10, relative = TRUE
#>   ELBO last = -10472.720320; penLogLik last = -10383.239352; ML/C last = -10472.704642

The key function do_csmoothEM() runs smooth-EM updates for a given number of iterations.

fit <- fit |>
  do_csmoothEM(iter = 1)

plot(fit)

We can also inspect ELBO values over iterations:

plot(fit, plot_type = "elbo")

Adaptive estimation of the smoothing parameter λ\lambda

Optionally, smoothEMr can update λ\lambda during the EM iterations. Unlike the MAP updates performed when λ\lambda is fixed, here we aim to optimize the marginal likelihood (λ)=logp(Xη,λ)=logp(Xμ,η)p(μλ)dμ. \ell(\lambda)=\log p(X\mid \eta,\lambda)=\log\int p(X\mid \mu,\eta)\,p(\mu\mid \lambda)\,d\mu .

The marginal likelihood is generally intractable. Using the ELBO as a surrogate for logp(Xμ,η)\log p(X\mid \mu,\eta) yields an approximate marginal likelihood ̃(λ)=logexp(ELBO(q;μ,η))p(μλ)dμ. \widetilde{\ell}(\lambda) =\log\int \exp\!\big(\mathrm{ELBO}(q;\mu,\eta)\big)\,p(\mu\mid \lambda)\,d\mu .

Because the ELBO is (locally) quadratic in μ\mu and the prior p(μλ)p(\mu\mid \lambda) is Gaussian, we can further obtain a closed-form Laplace approximation. Let μ̂=argmaxμ{ELBO(q;μ,η)+logp(μλ)},H=μ2[ELBO(q;μ,η)+logp(μλ)]|μ=μ̂. \widehat{\mu} =\arg\max_{\mu}\Big\{\mathrm{ELBO}(q;\mu,\eta)+\log p(\mu\mid \lambda)\Big\}, \qquad H=-\nabla_\mu^2\!\Big[\mathrm{ELBO}(q;\mu,\eta)+\log p(\mu\mid \lambda)\Big]\Big|_{\mu=\widehat{\mu}} . Then the collapsed objective is C(q;η,λ)=ELBO(q;μ̂,η)+logp(μ̂λ)+K2log(2π)12log|H|. C(q;\eta,\lambda) =\mathrm{ELBO}(q;\widehat{\mu},\eta)+\log p(\widehat{\mu}\mid \lambda) +\frac{K}{2}\log(2\pi)-\frac{1}{2}\log|H|.

To implement adaptive estimation of λ\lambda, we just need to use adaptive = "ml" in the initialization step.

set.seed(123)
fit_ml <- initialize_csmoothEM(
  X = as.matrix(sim$obs),
  method = "fiedler",
  adaptive = "ml"
)
fit_ml
#> Fitted csmoothEM object with RW(2) prior along K = 50
#> -----
#>   n = 1500, d = 2, modelName = homoskedastic
#>   iter = 1; init_method = fiedler; adaptive = ml;
#>   lambda_vec: range=[1.16, 1.39], mean=1.28, relative = TRUE
#>   ELBO last = -10548.204455; penLogLik last = -10500.506958; ML/C last = -10525.414589
fit_ml <- fit_ml |>
  do_csmoothEM(iter = 30)

plot(fit_ml)

plot(fit_ml, plot_type = "elbo")