Fitting sGP Models to the Lynx Dataset: An Intermediate Tutorial

Last updated: 2024-12-06

Checks: 7 0

Knit directory: online_tut/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20241120)

The command set.seed(20241120) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 5962975

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 5962975. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    code/.DS_Store

Untracked files:
    Untracked:  analysis/_includes/

Unstaged changes:
    Modified:   analysis/_site.yml

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/lynx.rmd) and HTML (docs/lynx.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	945ac04	Ziang Zhang	2024-11-26	Build site.
Rmd	dd9bbd7	Ziang Zhang	2024-11-26	workflowr::wflow_publish("analysis/lynx.rmd")
html	83ea80b	Ziang Zhang	2024-11-24	Build site.
Rmd	6b13b76	Ziang Zhang	2024-11-24	workflowr::wflow_publish("analysis/lynx.rmd")
html	3e653ad	Ziang Zhang	2024-11-21	Build site.
html	601ceb7	Ziang Zhang	2024-11-21	Build site.
html	1cf9d1d	Ziang Zhang	2024-11-21	Build site.
Rmd	d910a7c	Ziang Zhang	2024-11-21	workflowr::wflow_publish("analysis/lynx.rmd")
html	0f15650	Ziang Zhang	2024-11-21	Build site.
Rmd	64fcf99	Ziang Zhang	2024-11-21	workflowr::wflow_publish("analysis/lynx.rmd")
html	8e9829b	Ziang Zhang	2024-11-20	Build site.
Rmd	b167966	Ziang Zhang	2024-11-20	workflowr::wflow_publish("analysis/lynx.rmd")

require(Matrix)
require(tidyverse)

Introduction:

In this tutorial, we will use the lynx dataset as an example, which can be accessed directly from R. Let’s load the dataset and visualize it:

data <- data.frame(year = seq(1821, 1934, by = 1), y = as.numeric(lynx))
data$x <- data$year - min(data$year)
plot(lynx)

Version	Author	Date
8e9829b	Ziang Zhang	2024-11-20

Based on a visual examination of the dataset, we can observe an obvious 10-year quasi-periodic behavior in the lynx count with evolving amplitudes over time. Therefore, we will consider fitting the following model:

\[ \begin{equation} \begin{aligned} y_i|\lambda_i &\sim \text{Poisson}(\lambda_i) ,\\ \log(\lambda_i) &= \eta_i = \beta_0 + g(x_i) + \xi_i,\\ g &\sim \text{sGP}_{\alpha} \big(\sigma\big), \ \alpha = \frac{2\pi}{10}, \\ \xi_i &\sim N(0,\sigma_\xi). \end{aligned} \end{equation} \] Here, each \(y_i\) represents the lynx count, \(x_i\) represents the number of years since 1821, and \(\xi_i\) is an observation-level random intercept to account for overdispersion effect.

Prior Elicitation:

To specify the priors for the sGP boundary conditions and the intercept parameter, we assume independent normal priors with mean 0 and variance 1000. For the overdispersion parameter \(\sigma_\xi\), we assign an exponential prior with \(P(\sigma_\xi > 1) = 0.01\).

To determine the prior for the standard deviation parameter \(\sigma\) of the sGP, we employ the concept of predictive standard deviation (PSD). We start with an exponential prior on the 50 years PSD: \[P(\sigma(50)>1) = 0.01.\]

Overall, we highly recommend users to work on the PSD scale when specifying the prior for the sGP model as it is more interpretable and easier to elicit. However if the users hope to see the prior on the original standard deviation scale, the function prior_conversion_sgp in the BayesGP package can be used to convert the prior from the PSD scale to the original standard deviation scale.

require(BayesGP)

Loading required package: BayesGP

prior_PSD <- list(u = 1, alpha = 0.01)
prior_SD <- prior_conversion_sgp(d = 50, a = 2*pi/10, prior = prior_PSD)
prior_SD

$u
[1] 0.1256637

$alpha
[1] 0.01

Based on the above computation, the corresponding exponential prior on the original SD parameter \(\sigma\) is: \[P(\sigma>0.126) = 0.01.\] Note that this corresponds to an Exponential distribution with median:

INLA::inla.pc.qprec(0.5, u=prior_SD$u, alpha = prior_SD$alpha)^{-0.5}

[1] 0.01891427

Inference in the easy way:

The easy way to fit this model is to use the BayesGP package, which provides a simple interface to fit the sGP model using its sB-spline approximation.

fit <- BayesGP::model_fit(
  formula = y ~ f(x = year, model = "sgp",
                  period = 10,
                  sd.prior = list(param = prior_PSD, h = 50)) +
                  f(x, model = "iid", sd.prior = 1), 
  data = data,
  family = "poisson")

Note that we do not need to manually convert the prior for PSD to SD when using the BayesGP package. The package automatically handles this conversion internally when h is specified in the sd.prior argument.

Let’s take a quick look a the model fit:

par(mfrow = c(2,2))
plot(fit)

Version	Author	Date
945ac04	Ziang Zhang	2024-11-26

par(mfrow = c(1,1))

Inference from scratch:

Now, for users who wish to do more exercises with the sGP, we will demonstrate how to fit the Bayesian hierarchical model from scratch using the sGPfit package. We will consider fitting both the exact sGP model using the state-space representation and the approximate sGP model using the seasonal B-spline (sB-spline) approximation.

Preparation:

To carry out our inference from scratch in a fully Bayesian manner, we will load the following required packages:

require(sGPfit)
require(aghq)
require(TMB)

If you haven’t installed the prototype package sGPfit yet, you can download it from GitHub using the following command: devtools::install_git("https://github.com/AgueroZZ/sGPfit").

In this example, we will also use the aghq package for posterior approximation. To utilize it, we need to compile a C++ template for the model and load the output. Execute the following code to compile the C++ file and load the resulting library:

compile(file = "code/tut.cpp")
dyn.load(dynlib("code/tut"))

Once the C++ file is compiled and the library is loaded, we can proceed with the remaining inference procedures.

Exact State-Space Approach:

First, we need to set up the design matrix X for the two boundary conditions and the intercept parameter:

x <- data$x
a <- 2*pi/10
X <- as(cbind(cos(a*x),sin(a*x),1), "dgTMatrix")

To use the state-space representation, we need to create two additional matrices: \(B\) and \(Q\). The matrix \(B\) is the design matrix of each value of \([g(x_i),g'(x_i)]\), which will be (mostly) a diagonal matrix in this case. The matrix \(Q\) is the precision matrix of \([g(x),g'(x)]\), which can be constructed using the joint_prec_construct function.

n = length(x)
B <- Matrix::Diagonal(n = 2*n)[,1:(2*n)]
B <- B[seq(1,2*n,by = 2),][, -c(1:2)]
Q <- joint_prec_construct(a = a, t_vec = x[-1], sd = 1)
Q <- as(as(Q, "matrix"),"dgTMatrix")

With the matrices ready, the model can be fitted using aghq as follows:

tmbdat <- list(
    # Design matrix
    B = B,
    X = X,
    # Precision matrix
    P = Q,
    logPdet = as.numeric(determinant(Q, logarithm = T)$modulus),
    # Response
    y = data$y,
    # Prior
    u = prior_SD$u,
    alpha = prior_SD$alpha,
    u_over = prior_PSD$u,
    alpha_over = prior_PSD$alpha,
    betaprec = 0.001
  )

tmbparams <- list(
    W = c(rep(0, (ncol(B) + ncol(X) + length(data$y)))), 
    theta = 0,
    theta_over = 0
  )
  
ff <- TMB::MakeADFun(
    data = tmbdat,
    parameters = tmbparams,
    random = "W",
    DLL = "tut",
    silent = TRUE
  )
ff$he <- function(w) numDeriv::jacobian(ff$gr,w)
fitted_mod <- aghq::marginal_laplace_tmb(ff,5,c(0,0))

Approximate sB-Spline Approach:

Implementing the exact state-space representation can be computationally expensive, when the sample size is very large with irregularly spaced locations. In such cases, the sB-spline approximation can be a more efficient alternative.

To use the sB-spline approach, the sGP is approximated as \(\tilde{g}_k(x) = \sum_{i=1}^k w_i \psi_i(x)\), where \(\{\psi_i, i \in [k]\}\) is a set of sB-spline basis functions, and \(\boldsymbol{w} = \{w_i, i \in [k]\}\) is a set of Gaussian weights.

In this case, the design matrix \(B\) will be defined with element \(B_{ij} = \psi_j(x_i)\), which can be constructed using the Compute_B_sB function. The matrix \(Q\) will be the precision matrix of the Gaussian weights \(\boldsymbol{w}\), which can be constructed using the Compute_Q_sB function. As an example, let’s consider \(k = 30\) sB-spline basis functions constructed with equal spacing:

B2 <- Compute_B_sB(x = data$x, a = a, region = range(data$x), k = 30)
B2 <- as(B2,"dgTMatrix")
Q2 <- Compute_Q_sB(a = a, k = 30, region = range(data$x))
Q2 <- as(as(Q2, "matrix"),"dgTMatrix")

Once these matrices are constructed, the inference can be carried out in a similar manner as before:

tmbdat <- list(
    # Design matrix
    B = B2,
    X = X,
    # Precision matrix
    P = Q2,
    logPdet = as.numeric(determinant(Q2, logarithm = T)$modulus),
    # Response
    y = data$y,
    # Prior
    u = prior_SD$u,
    alpha = prior_SD$alpha,
    u_over = prior_PSD$u,
    alpha_over = prior_PSD$alpha,
    betaprec = 0.001
  )

tmbparams <- list(
    W = c(rep(0, (ncol(B2) + ncol(X) + length(data$y)))), 
    theta = 0,
    theta_over = 0
  )
  
ff <- TMB::MakeADFun(
    data = tmbdat,
    parameters = tmbparams,
    random = "W",
    DLL = "tut",
    silent = TRUE
  )
ff$he <- function(w) numDeriv::jacobian(ff$gr,w)

fitted_mod_sB <- aghq::marginal_laplace_tmb(ff,5,c(0,0))

Comparison:

First, let’s obtain the posterior samples and summary using the state-space representation:

## Posterior samples:
samps1 <- sample_marginal(fitted_mod, M = 3000)
g_samps <- B %*% samps1$samps[1:ncol(B),] + X %*% samps1$samps[(ncol(B) + 1):(ncol(B) + ncol(X)),]

## Posterior summary:
mean <- apply(as.matrix(g_samps), MARGIN = 1, mean)
upper <- apply(as.matrix(g_samps), MARGIN = 1, quantile, p = 0.975)
lower <- apply(as.matrix(g_samps), MARGIN = 1, quantile, p = 0.025)

Next, let’s plot the posterior results obtained from the state-space representation:

par(mfrow = c(2,2))
## Plotting
plot(log(data$y) ~ data$x, xlab = "time", ylab = "Posterior of g(x)", ylim = c(3.1,9))
lines(upper ~ data$x, type = "l", col = "red", lty = "dashed")
lines(mean ~ data$x, type = "l", col = "blue")
lines(lower ~ data$x, type = "l", col = "red", lty = "dashed")

## Posterior of the SD parameter:
prec_marg <- fitted_mod$marginals[[1]]
logpostsigma <- compute_pdf_and_cdf(prec_marg,list(totheta = function(x) -2*log(x),fromtheta = function(x) exp(-x/2)),interpolation = 'spline')
plot(pdf_transparam ~ transparam, data = logpostsigma, type = 'l', xlab = "SD", ylab = "Post")
## Posterior of the Overdispersion parameter:
prec_marg <- fitted_mod$marginals[[2]]
logpostsigma <- compute_pdf_and_cdf(prec_marg,list(totheta = function(x) -2*log(x),fromtheta = function(x) exp(-x/2)),interpolation = 'spline')
plot(pdf_transparam ~ transparam, data = logpostsigma, type = 'l', xlab = "Overdispersion", ylab = "Post")
par(mfrow = c(1,1))

Version	Author	Date
945ac04	Ziang Zhang	2024-11-26

Now, let’s obtain the posterior samples and summary using the seasonal B-spline approach:

## Posterior samples:
samps2 <- sample_marginal(fitted_mod_sB, M = 3000)
g_samps_2 <- B2 %*% samps2$samps[1:ncol(B2),] + X %*% samps2$samps[(ncol(B2) + 1):(ncol(B2) + ncol(X)),]

## Posterior summary:
mean2 <- apply(as.matrix(g_samps_2), MARGIN = 1, mean)
upper2 <- apply(as.matrix(g_samps_2), MARGIN = 1, quantile, p = 0.975)
lower2 <- apply(as.matrix(g_samps_2), MARGIN = 1, quantile, p = 0.025)

Finally, let’s plot the posterior results obtained from the seasonal B-spline approach:

par(mfrow = c(2,2))
## Plotting
plot(log(data$y) ~ data$x, xlab = "time", ylab = "Posterior of g(x)", ylim = c(3.1,9))
lines(upper2 ~ data$x, type = "l", col = "red", lty = "dashed")
lines(mean2 ~ data$x, type = "l", col = "blue")
lines(lower2 ~ data$x, type = "l", col = "red", lty = "dashed")
## Posterior of the SD parameter:
prec_marg <- fitted_mod_sB$marginals[[1]]
logpostsigma <- compute_pdf_and_cdf(prec_marg,list(totheta = function(x) -2*log(x),fromtheta = function(x) exp(-x/2)),interpolation = 'spline')
plot(pdf_transparam ~ transparam, data = logpostsigma, type = 'l', xlab = "SD", ylab = "Post")
## Posterior of the Overdispersion parameter:
prec_marg <- fitted_mod_sB$marginals[[2]]
logpostsigma <- compute_pdf_and_cdf(prec_marg,list(totheta = function(x) -2*log(x),fromtheta = function(x) exp(-x/2)),interpolation = 'spline')
plot(pdf_transparam ~ transparam, data = logpostsigma, type = 'l', xlab = "Overdispersion", ylab = "Post")
par(mfrow = c(1,1))

Version	Author	Date
945ac04	Ziang Zhang	2024-11-26
83ea80b	Ziang Zhang	2024-11-24
8e9829b	Ziang Zhang	2024-11-20

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.7.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TMB_1.9.15      aghq_0.4.1      sGPfit_1.0.1    BayesGP_0.1.3  
 [5] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [9] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
[13] ggplot2_3.5.1   tidyverse_2.0.0 Matrix_1.6-4    workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1    hdrcde_3.4          bitops_1.0-9       
 [4] fastmap_1.2.0       RCurl_1.98-1.16     INLA_23.09.09      
 [7] pracma_2.4.4        promises_1.3.0      digest_0.6.37      
[10] timechange_0.3.0    lifecycle_1.0.4     sf_1.0-17          
[13] cluster_2.1.6       statmod_1.5.0       processx_3.8.4     
[16] magrittr_2.0.3      compiler_4.3.1      rlang_1.1.4        
[19] sass_0.4.9          tools_4.3.1         utf8_1.2.4         
[22] yaml_2.3.10         data.table_1.16.2   knitr_1.48         
[25] sp_2.1-4            mclust_6.1.1        classInt_0.4-10    
[28] rainbow_3.8         KernSmooth_2.23-24  fda_6.2.0          
[31] numDeriv_2016.8-1.1 withr_3.0.2         grid_4.3.1         
[34] pcaPP_2.0-5         fansi_1.0.6         git2r_0.33.0       
[37] e1071_1.7-16        colorspace_2.1-1    scales_1.3.0       
[40] MASS_7.3-60         cli_3.6.3           mvtnorm_1.3-1      
[43] rmarkdown_2.28      generics_0.1.3      mvQuad_1.0-8       
[46] rstudioapi_0.16.0   httr_1.4.7          tzdb_0.4.0         
[49] fds_1.8             DBI_1.2.3           cachem_1.1.0       
[52] proxy_0.4-27        splines_4.3.1       vctrs_0.6.5        
[55] jsonlite_1.8.9      callr_3.7.6         hms_1.1.3          
[58] jquerylib_0.1.4     units_0.8-5         glue_1.8.0         
[61] ps_1.8.0            stringi_1.8.4       gtable_0.3.6       
[64] later_1.3.2         munsell_0.5.1       pillar_1.9.0       
[67] htmltools_0.5.8.1   deSolve_1.40        R6_2.5.1           
[70] fmesher_0.1.7       ks_1.14.3           rprojroot_2.0.4    
[73] evaluate_1.0.1      lattice_0.22-6      highr_0.11         
[76] httpuv_1.6.15       bslib_0.8.0         class_7.3-22       
[79] Rcpp_1.0.13-1       whisker_0.4.1       xfun_0.48          
[82] fs_1.6.4            getPass_0.2-4       pkgconfig_2.0.3