Introduction to FASH • fashr

Overview

The package considers the following functional adaptive shrinkage (FASH) scenarios. Given $N$ sets of series data: $\boldsymbol{y} = \{y_i(t_j): j\in[n_i]\}_{i=1}^{N}$ , where $n_i$ is the length of the $i$ -th series, we assume that each series $y_i$ represents $n_i$ measurements over the continuous treatment $t$ at levels $\{t_j\}_{j=1}^{n_i}$ .

Different from traditional smoothing methods that assumes each $f_i(t)$ has a separate prior $g_i$ , FASH assumes that all $f_i(t)$ are $iid$ with a common prior $g_f$ . Generalizing the idea from Stephens, 2017 and Urbut et al, 2018, the prior $g_f$ takes the following form of a finite mixture of Gaussian processes (GP): $g_f|\boldsymbol{\pi} = \sum_{k=0}^{K} \pi_k\text{GP}(m_k,C_k),$ where $\boldsymbol{\pi} = [\pi_1,...,\pi_K]^T$ is the prior mixing weight vector, $m_k$ is the mean function, and $C_k$ is the covariance function of the $k$ -th GP.

Rather than integrating out the prior mixing weights $\boldsymbol{\pi}$ with a given prior $p(\boldsymbol{\pi})$ as $g_f = \int(g_f|\boldsymbol{\pi})p(\boldsymbol{\pi})d\boldsymbol{\pi},$ FASH optimizes $\hat{\boldsymbol{\pi}}$ by maximizing the marginal likelihood of the data $\boldsymbol{y}$ : $\hat{\boldsymbol{\pi}} = \arg\max_{\boldsymbol{\pi}} \sum_{i=1}^{N} \log\left(\sum_{k=0}^{K} \pi_k \mathbf{L}_{ik}\right),$ where $\mathbf{L}_{ik}$ denotes the marginal likelihood of the $i$ -th series data under the $k$ -th GP component.

Then the prior $g_f$ is determined as: $\hat{g}_f = \int(g_f|\boldsymbol{\pi})\delta_{\hat{\boldsymbol{\pi}}}(\boldsymbol{\pi})d\boldsymbol{\pi} = g_f|\hat{\boldsymbol{\pi}}.$

Based on the estimated prior $\hat{g}$ , FASH then obtains the posterior $p(f_i(t)|\boldsymbol{y})$ for by: $p(f_i(t)|\boldsymbol{y}, \hat{\boldsymbol{\pi}}) = \sum_{k=0}^{K} \tilde{\pi}_k p_k(f_i(t)|\boldsymbol{y}_i),$

where $p_k(f_i(t)|\boldsymbol{y}_i)$ is the posterior of the $i$ -th series data under the $k$ -th GP component.

With the posterior, FASH aim to simultaneously answer any subset of the following questions:

What is the estimated function $f_i(t)$ for each series $y_i$ ? (Smoothing)
With a false discovery rate (FDR) control, which $f_i(t) \in S_0 \subset S$ ? (Hypothesis testing)
Is there any clustering structure in the estimated functions $f_i(t)$ in terms of their behaviors? (Clustering)

LGP Prior

For now, let’s assume the mean function $m_k$ is zero, and each GP component is defined through the following ordinary differential equation (ODE): $Lf(t) = \sigma_k W(t),$ where $W(t)$ is a Gaussian white noise process and $L$ is a known $p$ th order linear differential operator. Given the $L$ operator, the covariance function $C_k$ is completely specified by the single standard deviation parameter $\sigma_k$ .

This prior shrinks the function $f$ toward the base model $S_0 = \text{Null}\{L\}$ , which is the set of functions that satisfy $Lf = 0$ . The smaller $\sigma_k$ is, the stronger the shrinkage is. By choosing different $L$ operator, this one-parameter GP family can produce prior that encodes different kinds of shapes. Some examples are discussed in Zhang et.al 2023 and Zhang et.al 2024.

The above one-parameter family of GP priors is flexible and interpretable. By choosing the $L$ operator, we can choose different types of base model to shrink the function toward. In order words, it specifies the center of the shrinkage (like the null hypothesis).

Example: Integrated Wiener Process

For example, when $L = \frac{d^2}{dt^2}$ , the prior is called a second-order Integrated Wiener Process (IWP) prior, which shrinks the function toward the base model $S_0 = \text{Null}\{L\} = \text{span}\{1,t\}$ .

When all the observations are Gaussian, the posterior mean $\mathbb{E}(f|\boldsymbol{y}_i)$ using the second order IWP is exactly the cubic smoothing spline estimate in Kimeldorf and Wahba, 1970.

Computation Issue

To simplify the posterior computation with each GP component, we apply the following two tricks:

Finite Element Method: The finite element method approximates each GP $f(t)$ as a linear combination of basis functions: $f(t) = \sum_{l=1}^{m} w_l \psi_l(t)$ , where the $m$ basis functions $\psi_l(t)$ are fixed and the weights $\boldsymbol{w}$ follow Gaussian distribution. This simplifies the computation of each $p(f_i(t)|\boldsymbol{y}_i,\sigma_k)$ to $p(\boldsymbol{w}|\boldsymbol{y}_i,\sigma_k)$ . The weights not only have smaller dimension than the function $f(t)$ , but also have a sparse precision matrix. See Zhang et al, 2023 and Lindgren et.al, 2011 for more details.
Laplace Approximation: An efficient way to compute the posterior of the weights $\boldsymbol{w}$ is to use the Laplace approximation, as discussed in Rue et al, 2009. The Laplace approximation approximates the posterior distribution as a Gaussian distribution with the mode at the posterior mean and the covariance matrix as the inverse of the Hessian matrix at the mode: $p_G(\boldsymbol{w}|\boldsymbol{y}, \sigma_k) = \mathcal{N}(\hat{\boldsymbol{w}}, \hat{V})$ .

In this way, the complicated integration required in the posterior computation is replaced by a simpler optimization task with sparse matrices. When the observations are Gaussian, the Laplace approximation is exact. When the observations are not Gaussian, the Laplace approximation provides reasonable approximation with very small amount of computation cost.