diff --git a/src/functions-reference/embedded_laplace.qmd b/src/functions-reference/embedded_laplace.qmd
index 63346dbfb..247d19b44 100644
--- a/src/functions-reference/embedded_laplace.qmd
+++ b/src/functions-reference/embedded_laplace.qmd
@@ -4,23 +4,29 @@ pagetitle: Embedded Laplace Approximation
# Embedded Laplace Approximation
-The embedded Laplace approximation can be used to approximate certain
-marginal and conditional distributions that arise in latent Gaussian models.
-A latent Gaussian model observes the following hierarchical structure:
+The embedded Laplace approximation can be used to approximate certain marginal
+and conditional distributions that arise in latent Gaussian models.
+Embedded Laplace replaces explicit sampling of high-dimensional Gaussian latent
+variables with a local Gaussian approximation.
+In doing so, it marginalizes out the latent Gaussian variables.
+Inference can then be performed on the remaining, often low-dimensional,
+parameters. The embedded Laplace approximation in Stan is best suited for
+latent Gaussian models when jointly sampling over all model parameters is
+expensive and the conditional posterior of the Gaussian latent variables is
+reasonably close to Gaussian.
+
+For observed data $y$, latent Gaussian variables $\theta$, and hyperparameters $\phi$, a latent Gaussian model observes the following hierarchical structure:
\begin{eqnarray}
\phi &\sim& p(\phi), \\
\theta &\sim& \text{MultiNormal}(0, K(\phi)), \\
y &\sim& p(y \mid \theta, \phi).
\end{eqnarray}
-In this formulation, $y$ represents the
-observed data, and $p(y \mid \theta, \phi)$ is the likelihood function that
-specifies how observations are generated conditional on the latent Gaussian
-variables $\theta$ and hyperparameters $\phi$.
+In this formulation, $p(y \mid \theta, \phi)$ is the likelihood function that
+specifies how observations are generated conditional on $\theta$ and $\phi$.
$K(\phi)$ denotes the prior covariance matrix for the latent Gaussian variables
-$\theta$ and is parameterized by $\phi$.
-The prior $p(\theta \mid \phi)$ is restricted to be a multivariate normal
-centered at 0. That said, we can always pick a likelihood that offsets $\theta$,
-which is equivalently to specifying a prior mean.
+$\theta$ and is parameterized by $\phi$. The prior on $\theta$ is centered at 0,
+however an offset can always be added when specifying the likelihood function
+$p(y \mid \theta, \phi)$.
To sample from the joint posterior $p(\phi, \theta \mid y)$, we can either
use a standard method, such as Markov chain Monte Carlo, or we can follow
@@ -34,7 +40,7 @@ are typically available in closed form and so they must be approximated.
The marginal posterior can be written as $p(\phi \mid y) \propto p(y \mid \phi) p(\phi)$,
where $p(y \mid \phi) = \int p(y \mid \phi, \theta) p(\theta) \text{d}\theta$
is called the marginal likelihood. The Laplace method approximates
-$p(y \mid \phi, \theta) p(\theta)$ with a normal distribution centered at
+$p(y \mid \phi, \theta) p(\theta)$ with a normal distribution centered at the mode,
$$
\theta^* = \underset{\theta}{\text{argmax}} \ \log p(\theta \mid y, \phi),
$$
@@ -53,7 +59,7 @@ using one of Stan's algorithms. The marginal posterior is lower
dimensional and likely to have a simpler geometry leading to more
efficient inference. On the other hand each marginal likelihood
computation is more costly, and the combined change in efficiency
-depends on the case.
+depends on the application.
To obtain posterior draws for $\theta$, we sample from the normal
approximation to $p(\theta \mid y, \phi)$ in `generated quantities`.
@@ -62,7 +68,11 @@ then $p(\theta \mid y, \phi)$ produces samples from the joint posterior
$p(\theta, \phi \mid y)$.
The Laplace approximation is especially useful if $p(y \mid \phi, \theta)$ is
-log-concave. Stan's embedded Laplace approximation is restricted to the case
+log-concave, e.g., Poisson, binomial, negative-binomial, and Bernoulli.
+(The normal distribution is also log concave, however when the likelihood is
+normal, marginalization can be performed exactly and does not required an
+approximation.)
+Stan's embedded Laplace approximation is restricted to the case
where the prior $p(\theta \mid \phi)$ is multivariate normal.
Furthermore, the likelihood $p(y \mid \phi, \theta)$ must be computed using
only operations which support higher-order derivatives
@@ -74,33 +84,36 @@ In the `model` block, we increment `target` with `laplace_marginal`, a function
that approximates the log marginal likelihood $\log p(y \mid \phi)$.
The signature of the function is:
-\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...) likelihood\_arguments, function covariance\_function, tuple(...), vector theta\_init covariance\_arguments): real}|hyperpage}
+\index{{\tt \bfseries laplace\_marginal }!{\tt (function likelihood\_function, tuple(...) likelihood\_arguments, int hessian_block_size, function covariance\_function, tuple(...)): real}|hyperpage}
-
+`real` **`laplace_marginal`**`(function likelihood_function, tuple(...) likelihood_arguments, int hessian_block_size, function covariance_function, tuple(...) covariance_arguments)`
-`real` **`laplace_marginal`**`(function likelihood_function, tuple(...) likelihood_arguments, function covariance_function, tuple(...) covariance_arguments)`
-
-Which returns an approximation to the log marginal likelihood $p(y \mid \phi)$.
+which returns an approximation to the log marginal likelihood $p(y \mid \phi)$.
{{< since 2.37 >}}
-This function takes in the following arguments.
+The embedded Laplace functions accept two functors whose user defined arguments are passed in as tuples to `laplace_marginal`.
-1. `likelihood_function` - user-specified log likelihood whose first argument is the vector of latent Gaussian variables `theta`
-2. `likelihood_arguments` - A tuple of the log likelihood arguments whose internal members will be passed to the covariance function
-3. `covariance_function` - Prior covariance function
-4. `covariance_arguments` A tuple of the arguments whose internal members will be passed to the the covariance function
+1. `likelihood_function` - user-specified log likelihood whose first argument is the vector of latent Gaussian variables $\theta$.
+The subsequent arguments are user defined.
+ - `real likelihood_function(vector theta, likelihood_arguments_1, likelihood_arguments_2, ...)`
+2. `likelihood_arguments` - A tuple of arguments whose internal members are be passed to the log likelihood function.
+This tuple does NOT include the latent variable $\theta$.
+3. `hessian_block_size` - the block size of the Hessian of the log likelihood, $\partial^2 \log p(y \mid \theta, \phi) / \partial \theta^2$.
+4. `covariance_function` - A function that returns the covariance matrix of the multivariate normal prior on $\theta$.
+ - `matrix covariance_function(covariance_argument_1, covariance_argument_2, ...)`
+5. `covariance_arguments` A tuple of the arguments whose internal members will be passed to the the covariance function.
Below we go over each argument in more detail.
## Specifying the log likelihood function {#laplace-likelihood_spec}
The first step to use the embedded Laplace approximation is to write down a
-function in the `functions` block which returns the log joint likelihood
+function in the `functions` block which returns the log likelihood
$\log p(y \mid \theta, \phi)$.
There are a few constraints on this function:
-1. The function return type must be `real`
+1. The function return type must be `real`.
2. The first argument must be the latent Gaussian variable $\theta$ and must
have type `vector`.
@@ -124,7 +137,7 @@ as data or parameter.
The tuple after `likelihood_function` contains the arguments that get passed
to `likelihood_function` *excluding $\theta$*. For instance, if a user defined
-likelihood uses a real and a matrix the likelihood function's signature would
+likelihood uses a real and a matrix, the likelihood function's signature would
first have a vector and then a real and matrix argument.
```stan
@@ -149,6 +162,13 @@ for example,
real likelihood_function(vector theta, data vector x, ...)
```
+In addition to the likelihood function, users must specify the block size
+of the Hessian, $\partial^2 \log p(y \mid \theta, \phi) / \partial \theta^2$.
+The Hessian is often block diagonal and this structure can be taken advantage of for fast computation.
+For example, if $y_i$ only depends on $\theta_i$, then the Hessian is diagonal and `hessian_block_size=1`.
+On the other hand, if the Hessian is not block diagonal, we can always set
+`hessian_block_size=n` where $n$ is the size of $\theta$.
+
## Specifying the covariance function
The argument `covariance_function` returns the prior covariance matrix
@@ -159,20 +179,6 @@ It's return type must be a matrix of size $n \times n$ where $n$ is the size of
matrix covariance_function(...)
```
-
-
The `...` represents a set of optional
variadic arguments. There is no type restrictions for the variadic arguments
`...` and each argument can be passed as data or parameter. The variables
@@ -198,51 +204,77 @@ It also possible to specify control parameters, which can help improve the
optimization that underlies the Laplace approximation, using `laplace_marginal_tol`
with the following signature:
-\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): real}|hyperpage}
-
-
-\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): real}|hyperpage}
-
-`real` **`laplace_marginal_tol`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch)`
\newline
+```stan
+real laplace_marginal_tol(function likelihood_function, tuple(...),
+ hessian_block_size,
+ function covariance_function, tuple(...),
+ tuple(vector theta_init, real tol, int max_steps, int solver,
+ int max_steps_linesearch, int allow_fallback))
+```
Returns an approximation to the log marginal likelihood $p(y \mid \phi)$
and allows the user to tune the control parameters of the approximation.
-* `theta_init`: the initial guess for the Newton solver when finding the mode
+* `theta_init`: the initial guess for a Newton solver when finding the mode
of $p(\theta \mid y, \phi)$. By default, it is a zero-vector.
* `tol`: the tolerance $\epsilon$ of the optimizer. Specifically, the optimizer
stops when $||\nabla \log p(\theta \mid y, \phi)|| \le \epsilon$. By default,
-the value is $\epsilon = 10^{-6}$.
+the value is $\epsilon \approx 1.49 \times 10^{-8}$, which is the square-root of machine precision.
* `max_num_steps`: the maximum number of steps taken by the optimizer before
it gives up (in which case the Metropolis proposal gets rejected). The default
-is 100 steps.
+is 500 steps.
-* `hessian_block_size`: the size of the blocks, assuming the Hessian
-$\partial \log p(y \mid \theta, \phi) \ \partial \theta$ is block-diagonal.
-The structure of the Hessian is determined by the dependence structure of $y$
-on $\theta$. By default, the Hessian is treated as diagonal
-(`hessian_block_size=1`). If the Hessian is not block diagonal, then set
-`hessian_block_size=n`, where `n` is the size of $\theta$.
-
-* `solver`: choice of Newton solver. The optimizer used to compute the
+* `solver`: choice of Newton solver. The optimizer underlying the
Laplace approximation does one of three matrix decompositions to compute a
-Newton step. The problem determines which decomposition is numerical stable.
-By default (`solver=1`), the solver makes a Cholesky decomposition of the
-negative Hessian, $- \partial \log p(y \mid \theta, \phi) / \partial \theta$.
-If `solver=2`, the solver makes a Cholesky decomposition of the covariance
-matrix $K(\phi)$.
-If the Cholesky decomposition cannot be computed for neither the negative
-Hessian nor the covariance matrix, use `solver=3` which uses a more expensive
-but less specialized approach.
+Newton step. The problem determines which decomposition is numerically stable.
+By default (`solver=1`), the solver attempts a Cholesky decomposition of the
+negative Hessian of the log likelihood,
+$- \partial^2 \log p(y \mid \theta, \phi) / \partial^2 \theta$.
+This operation is legal if the negative Hessian is positive-definite,
+which will always be true when the likelihood is log concave.
+If `solver=2`, the solver makes a Cholesky decomposition of the covariance matrix $K(\phi)$.
+Since a covariance matrix is always positive-definite, compute its
+Cholesky decomposition is always a legal operation, at least in theory.
+In practice, we may not be able to compute the Cholesky decomposition of the
+negative Hessian or the covariance matrix, either because it does not exist or
+because of numerical issues.
+In that case, we can use `solver=3` which uses a more expensive but less
+specialized approach to compute a Newton step.
* `max_steps_linesearch`: maximum number of steps in linesearch. The linesearch
-method tries to insure that the Newton step leads to a decrease in the
-objective function. If the Newton step does not improve the objective function,
-the step is repeatedly halved until the objective function decreases or the
-maximum number of steps in the linesearch is reached. By default,
-`max_steps_linesearch=0`, meaning no linesearch is performed.
+adjusts to step size to ensure that a Newton step leads to an increase in
+the objective function (i.e., $f(\theta) = p(\theta \mid \phi, y)$).
+If a standard Newton step does not improve the objective function,
+the step is adjusted iteratively until the objective function increases
+or the maximum number of steps in the linesearch is reached.
+By default, `max_steps_linesearch=1000`.
+Setting `max_steps_linesearch=0` results in no linesearch.
+
+* `allow_fallback`: If user set solver fails, this flag determines whether to fallback to the next solver. For example, if the user specifies `solver=1` but the Cholesky decomposition of the negative Hessian $- \partial^2 \log p(y \mid \theta, \phi) / \partial^2 \theta$ fails, the optimizer will try `solver=2` instead.
+By default, `allow_fallback = 1` (TRUE).
+
+The embedded Laplace approximation's options have a helper callable `generate_laplace_options(int theta_size)` that will generate the tuple for you. This can be useful for quickly setting up the control parameters in the `transformed data` block to reuse within the model.
+
+```stan
+tuple(vector[theta_size], real, int, int, int, int, int) laplace_ops = generate_laplace_options(theta_size);
+// Modify solver type
+laplace_ops.5 = 2;
+// Turn off fallthrough
+laplace_ops.7 = 0;
+```
+
+The arguments stored in the `laplace_ops` tuple are,
+```
+laplace_ops = {theta_init,
+ tol,
+ max_num_steps,
+ hessian_block_size,
+ solver,
+ max_steps_linesearch,
+ allow_fallback}
+```
{{< since 2.37 >}}
@@ -253,18 +285,18 @@ approximation of $p(\theta \mid \phi, y)$ using `laplace_latent_rng`.
The signature for `laplace_latent_rng` follows closely
the signature for `laplace_marginal`:
-
-\index{{\tt \bfseries laplace\_latent\_rng }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init): vector}|hyperpage}
+
+\index{{\tt \bfseries laplace\_latent\_rng }!{\tt (function likelihood\_function, tuple(...), int hessian_block_size, function covariance\_function, tuple(...)): vector}|hyperpage}
-`vector` **`laplace_latent_rng`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...))`
\newline
+`vector` **`laplace_latent_rng`**`(function likelihood_function, tuple(...), hessian_block_size, function covariance_function, tuple(...))`
\newline
-Draws approximate samples from the conditional posterior $p(\theta \mid y, \phi)$.
+Draws samples from the Laplace approximation to the conditional posterior $p(\theta \mid y, \phi)$.
{{< since 2.37 >}}
Once again, it is possible to specify control parameters:
-\index{{\tt \bfseries laplace\_latent\_tol\_rng }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): vector}|hyperpage}
+\index{{\tt \bfseries laplace\_latent\_tol\_rng }!{\tt (function likelihood\_function, tuple(...), int hessian_block_size, function covariance\_function, tuple(...), tuple(...) laplace_ops): vector}|hyperpage}
-`vector` **`laplace_latent_tol_rng`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch)`
\newline
+`vector` **`laplace_latent_tol_rng`**`(function likelihood_function, tuple(...), int hessian_block_size, function covariance_function, tuple(...), tuple(...) laplace_ops)`
\newline
Draws approximate samples from the conditional posterior $p(\theta \mid y, \phi)$
and allows the user to tune the control parameters of the approximation.
{{< since 2.37 >}}
diff --git a/src/functions-reference/functions_index.qmd b/src/functions-reference/functions_index.qmd
index 870df9d52..ad18c3d42 100644
--- a/src/functions-reference/functions_index.qmd
+++ b/src/functions-reference/functions_index.qmd
@@ -1713,7 +1713,7 @@ pagetitle: Alphabetical Index
**laplace_latent_rng**:
- -
[`(function likelihood_function, tuple(...), function covariance_function, tuple(...)) : vector`](embedded_laplace.qmd#index-entry-6d0685309664591fc32d3e2a2304af7aa5459e1c) (embedded_laplace.html)
+ - [`(function likelihood_function, tuple(...), int hessian_block_size, function covariance_function, tuple(...)) : vector`](embedded_laplace.qmd#index-entry-cacb1d5344e246ec89c460e00a6f1065f4a8a1c1) (embedded_laplace.html)
**laplace_latent_tol_bernoulli_logit_rng**:
@@ -1731,11 +1731,6 @@ pagetitle: Alphabetical Index
- [`(array[] int y, array[] int y_index, vector m, function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch) : vector`](embedded_laplace.qmd#index-entry-97f692748ebfabf0574b7a8e2edaceb940ee4b6b) (embedded_laplace.html)
-**laplace_marginal**:
-
- - [`(function likelihood_function, tuple(...) likelihood_arguments, function covariance_function, tuple(...) covariance_arguments) : real`](embedded_laplace.qmd#index-entry-6da15f0ed076016d814cdc278127896f99d29633) (embedded_laplace.html)
-
-
**laplace_marginal_bernoulli_logit**:
- [distribution statement](embedded_laplace.qmd#index-entry-1d93ab799518d0aac88e63c01f9655f36c7cbeb6) (embedded_laplace.html)
@@ -1781,11 +1776,6 @@ pagetitle: Alphabetical Index
- [`(array[] int y | array[] int y_index, vector m, function covariance_function, tuple(...)) : real`](embedded_laplace.qmd#index-entry-c092314a5f45deef27ce0e8930a7d28c87ca601d) (embedded_laplace.html)
-**laplace_marginal_tol**:
-
- - [`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch) : real`](embedded_laplace.qmd#index-entry-0f4bd0330deef2db7884dc5a4c933f181e1f2a8c) (embedded_laplace.html)
-
-
**laplace_marginal_tol_bernoulli_logit**:
- [distribution statement](embedded_laplace.qmd#index-entry-92d43f7c3643c7d85966bbfc0b2a71546facb37c) (embedded_laplace.html)
diff --git a/src/reference-manual/laplace.qmd b/src/reference-manual/laplace.qmd
index 092b8c479..06ac96432 100644
--- a/src/reference-manual/laplace.qmd
+++ b/src/reference-manual/laplace.qmd
@@ -14,14 +14,14 @@ to the constrained space before outputting them.
Given the estimate of the mode $\widehat{\theta}$,
the Hessian $H(\widehat{\theta})$ is computed using
-central finite differences of the model functor.
+central finite differences of the model functor.
Next the algorithm computes the Cholesky factor of the negative inverse Hessian:
$R^{-1} = \textrm{chol}(-H(\widehat{\theta})) \backslash \mathbf{1}$.
Each draw is generated on the unconstrained scale by sampling
-$\theta^{\textrm{std}(m)} \sim \textrm{normal}(0, \textrm{I})$
+$\theta^{\textrm{std}(m)} \sim \textrm{normal}(0, \textrm{I})$
and defining draw $m$ to be
diff --git a/src/reference-manual/laplace_embedded.qmd b/src/reference-manual/laplace_embedded.qmd
index 3912ccc8a..5a9fdd542 100644
--- a/src/reference-manual/laplace_embedded.qmd
+++ b/src/reference-manual/laplace_embedded.qmd
@@ -4,173 +4,191 @@ pagetitle: Embedded Laplace Approximation
# Embedded Laplace Approximation
-Stan provides functions to perform an embedded Laplace
-approximation for latent Gaussian models, following the procedure described
-by @RasmussenWilliams:2006 and @Rue:2009. This approach is often refered to
-as the integrated or nested Laplace approximation, although the exact details
-of the method can vary. The details of Stan's implementation can be found in
-references [@Margossian:2020; @Margossian:2023].
-
-A standard approach to fit a latent Gaussian model would be to perform inference
-jointly over the latent Gaussian variables and the hyperparameters.
-Instead, the embedded Laplace approximation can be used to do *approximate*
-marginalization of the latent Gaussian variables; we can then
-use any inference over the remaining hyperparameters, for example Hamiltonian
-Monte Carlo sampling.
-
-Formally, consider a latent Gaussian model,
+The embedded Laplace approximation replaces explicit sampling of high-dimensional Gaussian latent variables with a local Gaussian approximation, marginalizing them out so that inference proceeds over the remaining hyperparameters alone.
+This approach is often referred to as the integrated or nested Laplace approximation, although the exact details of the method can vary.
+The details of Stan's implementation can be found in references [@Margossian:2020; @Margossian:2023].
+
+A standard approach to fit a latent Gaussian model would be to perform inference jointly over the latent Gaussian variables and the hyperparameters.
+Instead, the embedded Laplace approximation can be used to do *approximate* marginalization of the latent Gaussian variables; we can then use any inference over the remaining hyperparameters.
+By marginalizing out the latent variables, the sampler explores a lower-dimensional, better-behaved marginal posterior.
+Individual iterations are more expensive (each requires an inner optimization), but the sampler typically needs far fewer iterations to achieve the same effective sample size.
+
+For complete function signatures and the built-in likelihood wrappers (Poisson, Negative Binomial, Bernoulli), see the [Embedded Laplace functions reference](../functions-reference/embedded_laplace.qmd).
+For worked examples with full data blocks, see the [Gaussian Processes chapter](../stan-users-guide/gaussian-processes.qmd#poisson-gp-using-an-embedded-laplace-approximation).
+
+
+## Latent Gaussian models
+
+The embedded Laplace approximation is used for latent Gaussian models. A latent Gaussian model is defined by three main components:
+
+- $\phi$: hyperparameters (e.g., GP kernel length-scale and magnitude, or variance components in a hierarchical model),
+- $\theta$: latent Gaussian variables (the high-dimensional quantity to be marginalized out),
+- $y$: observed data.
+
+These components are related through a hierarchical structure.
+The hyperparameters $\phi$ are given a prior $p(\phi)$.
+The latent variables $\theta$ have a multivariate normal prior with covariance matrix $K(\phi)$.
+The observations $y$ are generated according to a likelihood $p(y \mid \theta, \phi)$.
+The prior on $\theta$ is centered at zero; an offset can be incorporated into the likelihood function if a non-zero mean is needed.
+
\begin{eqnarray*}
\phi & \sim & p(\phi) \\
\theta & \sim & \text{Multi-Normal}(0, K(\phi)) \\
y & \sim & p(y \mid \theta, \phi).
\end{eqnarray*}
-The motivation for marginalization is to bypass the challenging geometry of the joint
-posterior $p(\phi, \theta \mid y)$. This geometry (e.g. funnels) often frustrates
-inference algorithms, including Hamiltonian Monte Carlo sampling and approximate
-methods such as variational inference. On the other hand, the marginal posterior
-$p(\phi \mid y)$ is often well-behaved and in many cases low-dimensional.
-Furthermore, the conditional posterior $p(\theta \mid \phi, y)$ can be well
-approximated by a normal distribution, if the likelihood $p(y \mid \theta, \phi)$
-is log concave.
+The generative model above defines a joint distribution over all three quantities, $p(\phi, \theta, y) = p(\phi) \, p(\theta \mid \phi) \, p(y \mid \theta, \phi)$.
+After observing data $y$, Bayes' theorem gives the joint posterior $p(\phi, \theta \mid y) \propto p(\phi) \, p(\theta \mid \phi) \, p(y \mid \theta, \phi)$.
+
+Sampling directly from the joint posterior $p(\phi, \theta \mid y)$ of this model is often difficult.
+Challenging geometries (e.g., funnels) frustrate inference algorithms, including Hamiltonian Monte Carlo and variational inference.
+However, the marginal posterior $p(\phi \mid y)$ is often well-behaved and low-dimensional, making it much easier to sample.
+The Laplace approximation allows the conditional posterior $p(\theta \mid \phi, y)$ to be approximated by a normal distribution when the likelihood $p(y \mid \theta, \phi)$ is log concave.
## Approximation of the conditional posterior and marginal likelihood
-The Laplace approximation is the normal distribution that matches the mode
-and curvature of the conditional posterior $p(\theta \mid y, \phi)$.
-The mode,
+The two-step inference strategy for using embedded laplace in a latent Gaussian model requires approximations to both the conditional posterior $p(\theta \mid y, \phi)$ and the marginal likelihood $p(y \mid \phi)$.
+The Laplace approximation is the normal distribution that matches the mode and curvature of the conditional posterior $p(\theta \mid y, \phi)$.
+The mode, defined as the value of $\theta$ that maximizes the conditional posterior, is estimated by a Newton solver,
$$
\theta^* = \underset{\theta}{\text{argmax}} \ p(\theta \mid y, \phi),
$$
-is estimated by a Newton solver. Since the approximation is normal,
-the curvature is matched by setting the covariance to the negative Hessian
-of the log conditional posterior, evaluated at the mode,
+
+Since the approximation is normal, the curvature is matched by setting the covariance to the negative Hessian of the log conditional posterior, evaluated at the mode,
+
$$
\Sigma^* = - \left . \frac{\partial^2}{\partial \theta^2}
\log p (\theta \mid \phi, y) \right |_{\theta =\theta^*}.
$$
-The resulting Laplace approximation is then,
+
+The resulting Laplace approximation is a multivariate normal centered at the mode with covariance given by the inverse curvature,
+
$$
\hat p_\mathcal{L} (\theta \mid y, \phi) = \text{Multi-Normal}(\theta^*, \Sigma^*)
\approx p(\theta \mid y, \phi).
$$
-This approximation implies another approximation for the marginal likelihood,
+
+This approximation also yields an approximation to the marginal likelihood, obtained by evaluating the prior, likelihood, and approximate posterior at the mode $\theta^*$,
+
$$
\hat p_\mathcal{L}(y \mid \phi) := \frac{p(\theta^* \mid \phi) \
p(y \mid \theta^*, \phi) }{ \hat p_\mathcal{L} (\theta^* \mid \phi, y) }
\approx p(y \mid \phi).
$$
-Hence, a strategy to approximate the posterior of the latent Gaussian model
-is to first estimate the marginal posterior
-$\hat p_\mathcal{L}(\phi \mid y) \propto p(\phi) p_\mathcal{L} (y \mid \phi)$
+
+Hence, a strategy to approximate the posterior of the latent Gaussian model is to first estimate the marginal posterior $\hat p_\mathcal{L}(\phi \mid y) \propto p(\phi) p_\mathcal{L} (y \mid \phi)$
using any algorithm supported by Stan.
-Approximate posterior draws for the latent Gaussian variables are then
-obtained by first drawing $\phi \sim \hat p_\mathcal{L}(\phi \mid y)$ and
-then $\theta \sim \hat p_\mathcal{L}(\theta \mid \phi, y)$.
+Approximate posterior draws for the latent Gaussian variables are then obtained by first drawing $\phi \sim \hat p_\mathcal{L}(\phi \mid y)$ and then $\theta \sim \hat p_\mathcal{L}(\theta \mid \phi, y)$.
## Trade-offs of the approximation
-The embedded Laplace approximation presents several trade-offs with standard
-inference over the joint posterior $p(\theta, \phi \mid y)$. The main
-advantage of the embedded Laplace approximation is that it side-steps the
-intricate geometry of hierarchical models. The marginal posterior
-$p(\phi \mid y)$ can then be handled by Hamiltonian Monte Carlo sampling
-without extensive tuning or reparameterization, and the mixing time is faster,
-meaning we can run shorter chains to achieve a desired precision. In some cases,
-approximate methods, e.g. variational inference, which
-work poorly on the joint $p(\theta, \phi \mid y)$ work well on the marginal
-posterior $p(\phi \mid y)$.
-
-On the other hand, the embedded Laplace approximation presents certain
-disadvantages. First, we need to perform a Laplace approximation each time
-the log marginal likelihood is evaluated, meaning each iteration
-can be expensive. Secondly, the approximation can introduce non-negligible
-error, especially with non-conventional likelihoods (note the prior
-is always multivariate normal). How these trade-offs are resolved depends on
-the application; see @Margossian:2020 for some examples.
+The embedded Laplace approximation presents several trade-offs with standard inference over the joint posterior $p(\theta, \phi \mid y)$.
+The main advantage of the embedded Laplace approximation is that it side-steps the intricate geometry of hierarchical models.
+The marginal posterior $p(\phi \mid y)$ can then be handled by Hamiltonian Monte Carlo sampling without extensive tuning or reparameterization, and the mixing time is faster, meaning we can run shorter chains to achieve a desired precision.
+One additional benefit is that approximate methods, e.g. variational inference, which work poorly on the joint $p(\theta, \phi \mid y)$ work well on the marginal posterior $p(\phi \mid y)$.
+
+On the other hand, the embedded Laplace approximation presents certain disadvantages.
+First, we need to perform a Laplace approximation each time the log marginal likelihood is evaluated, meaning each iteration can be expensive.
+Secondly, the approximation can introduce non-negligible error, especially with non-conventional likelihoods (note the prior is always multivariate normal).
+How these trade-offs are resolved depends on the application; see @Margossian:2020 for some examples.
+### When the approximation is appropriate
+
+The quality of the Laplace approximation depends on how close the true conditional posterior $p(\theta \mid y, \phi)$ is to Gaussian.
+
+**Works well.** Log-concave likelihoods i.e. A Poisson with log link or negative binomial with log link.
+These produce unimodal conditional posteriors when combined with a Gaussian prior.
+The approximation error is typically negligible for these likelihoods, especially with moderate-to-large counts [@Kuss:2005; @Vanhatalo:2010; @Cseke:2011; @Vehtari:2016].
+
+**Works adequately.** Bernoulli with logit link.
+The curvature information from binary observations is weaker, so the Gaussian approximation is less accurate than for count data.
+The embedded Laplace is still useful when $\theta$ is high-dimensional and joint sampling is infeasible; see @Vehtari:2016 and @Margossian:2020 for discussion.
+
+**Not appropriate.** The normal distribution is log-concave, but when the likelihood is normal, marginalization can be performed exactly and no approximation is needed.
+For likelihoods that are not log-concave in $\theta$, the conditional posterior may be multimodal and the Laplace approximation will miss modes.
+When $\theta$ is low-dimensional (a few dozen or fewer), the overhead of the inner optimization may not pay for itself and standard joint HMC sampling is often adequate.
## Details of the approximation
+When the embedded Laplace approximation does not converge or produces unexpected results, the solver configuration may need adjustment.
+This section describes the internals of the Newton solver and the options available for tuning it.
+
### Tuning the Newton solver
-A critical component of the embedded Laplace approximation is the Newton solver
-used to estimate the mode $\theta^*$ of $p(\theta \mid \phi, y)$. The objective
-function being maximized is
+A critical component of the embedded Laplace approximation is the Newton solver used to estimate the mode $\theta^*$ of $p(\theta \mid \phi, y)$.
+The objective function being maximized is the log joint density of the prior and likelihood.
+
$$
\Psi(\theta) = \log p(\theta \mid \phi) + \log p(y \mid \theta, \phi),
$$
-and convergence is declared if the change in the objective is sufficiently
-small between two iterations
+
+Convergence is declared when the change in the objective between successive iterations falls below a *tolerance* $\Delta$.
+
$$
| \Psi (\theta^{(i + 1)}) - \Psi (\theta^{(i)}) | \le \Delta,
$$
-for some *tolerance* $\Delta$. The solver also stops after reaching a
-pre-specified *maximum number of steps*: in that case, Stan throws an exception
-and rejects the current proposal. This is not a problem, as
-long as these exceptions are rare and confined to early phases of the warmup.
-The Newton iteration can be augmented with a linesearch step to insure that
-at each iteration the objective function $\Psi$ decreases. Specifically,
-suppose that
+The solver also stops after reaching a pre-specified *maximum number of steps*.
+In that case, Stan throws a warning, but still returns the last iteration's parameters.
+If you see this warning you should check the diagnostics to understand why the solver failed to converge.
+
+To help with cases where the Newton step does not lead to a decrease in the objective function, the Newton iteration is augmented with a wolfe line-search to ensure that at each iteration the objective function $\Psi$ decreases.
+Specifically, suppose the objective increases after a Newton step, indicating the step overshot a region of improvement.
+
$$
\Psi (\theta^{(i + 1)}) < \Psi (\theta^{(i)}).
$$
-This can indicate that the Newton step is too large and that we skipped a region
-where the objective function decreases. In that case, we can reduce the step
-length by a factor of 2, using
+
+This can indicate that the Newton step $\alpha$ at iteration $i$ is too large and that we skipped a region where the objective function decreases.
+In that case, we can fallback to a Wolfe line search to find a step size which satisfies the Wolfe conditions. The wolfe line search attempts to find a search direction $p_i$ and step size $\alpha_k$ such that an accepted step both increases our objective while ensuring the slope of the accepted step is flatter than our previous position. Together these help push the algorithm towards a minimum.
+
+$$
+f(x_i + \alpha_k p_i) \le f(x_i) + c_1 \alpha_k \nabla f(x_i)^T p_i
+-p^T_i \Delta f(x_i + \alpha_k p_i) \le -c_2 p^T_i \Delta f(x_i)
+$$
+
$$
\theta^{(i + 1)} \leftarrow \frac{\theta^{(i + 1)} + \theta^{(i)}}{2}.
$$
-We repeat this halving of steps until
-$\Psi (\theta^{(i + 1)}) \ge \Psi (\theta^{(i)})$, or until a maximum number
-of linesearch steps is reached. By default, this maximum is set to 0, which
-means the Newton solver performs no linesearch. For certain problems, adding
-a linsearch can make the optimization more stable.
+We repeat this halving of steps until $\Psi (\theta^{(i + 1)}) \ge \Psi (\theta^{(i)})$, or until a maximum number of linesearch steps is reached.
+For certain problems, adding a linesearch can make the optimization and solver more stable.
+
+### Solver Strategies
-The embedded Laplace approximation uses a custom Newton solver, specialized
-to find the mode of $p(\theta \mid \phi, y)$.
-A keystep for efficient optimization is to insure all matrix inversions are
-numerically stable. This can be done using the Woodburry-Sherman-Morrison
-formula and requires one of three matrix decompositions:
+The embedded Laplace approximation uses a custom Newton solver, specialized to find the mode of $p(\theta \mid \phi, y)$.
+A key step for efficient optimization is to ensure all matrix inversions are numerically stable.
+This can be done using the Woodbury-Sherman-Morrison formula and requires one of three matrix decompositions:
-1. Cholesky decomposition of the Hessian of the negative log likelihood
-$W = - \partial^2_\theta \log p(y \mid \theta, \phi)$
+1. Cholesky decomposition of the Hessian of the negative log likelihood $W = - \partial^2_\theta \log p(y \mid \theta, \phi)$.
2. Cholesky decomposition of the prior covariance matrix $K(\phi)$.
3. LU-decomposition of $I + KW$, where $I$ is the identity matrix.
-The first solver (1) should be used if the negative log likelihood is
-positive-definite. Otherwise the user should rely on (2). In rarer cases where
-it is not numerically safe to invert the covariance matrix $K$, users can
-use the third solver as a last-resort option.
+The first solver (1) should be used if the negative log likelihood is positive-definite.
+Otherwise the user should rely on (2).
+In rarer cases where it is not numerically safe to invert the covariance matrix $K$, users can use the third solver as a last-resort option.
### Sparse Hessian of the log likelihood
-A key step to speed up computation is to take advantage of the sparsity of
-the Hessian of the log likelihood,
+A key step to speed up computation is to take advantage of the sparsity of $H$, the Hessian of the log likelihood with respect to the latent variables,
$$
H = \frac{\partial^2}{\partial \theta^2} \log p(y \mid \theta, \phi).
$$
-For example, if the observations $(y_1, \cdots, y_N)$ are conditionally
-independent and each depends on only depend on one component of $\theta$,
-such that
+For example, if the observations $(y_1, \cdots, y_N)$ are conditionally independent and each depends on only one component of $\theta$, the log likelihood decomposes into a sum of per-observation terms,
$$
\log p(y \mid \theta, \phi) = \sum_{i = 1}^N \log p(y_i \mid \theta_i, \phi),
$$
-then the Hessian is diagonal. This leads to faster calculations of the Hessian
-and subsequently sparse matrix operations. This case is common in Gaussian
-process models, and certain hierarchical models.
+and the Hessian is diagonal.
+This leads to faster calculations of the Hessian and subsequently sparse matrix operations.
+This case is common in Gaussian process models, and certain hierarchical models.
-Stan's suite of functions for the embedded Laplace approximation are not
-equipped to handle arbitrary sparsity structures; instead, they work on
-block-diagonal Hessians, and the user can specify the size $B$ of these blocks.
-The user is responsible for working out what $B$ is. If the Hessian is dense,
-then we simply set $B = N$.
+Stan's suite of functions for the embedded Laplace approximation exploits block-diagonal structure in the Hessian, where the user specifies the block size B.
+The user can specify the size $B$ of these blocks.
+The user is responsible for working out what $B$ is. If the Hessian is dense, then we simply set $B = N$.
+The diagonal case above corresponds to B = 1.
+Arbitrary sparsity patterns beyond block-diagonal structure are not currently supported.
-NOTE: currently, there is no support for sparse prior covariance matrix.
-We expect this to be supported in future versions of Stan.
diff --git a/src/stan-users-guide/gaussian-processes.qmd b/src/stan-users-guide/gaussian-processes.qmd
index 58f05db3b..ffbde0880 100644
--- a/src/stan-users-guide/gaussian-processes.qmd
+++ b/src/stan-users-guide/gaussian-processes.qmd
@@ -476,7 +476,19 @@ parameters {
vector[N] eta;
}
model {
- // ...
+ vector[N] f;
+ {
+ matrix[N, N] L_K;
+ matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho);
+
+ // diagonal elements
+ for (n in 1:N) {
+ K[n, n] = K[n, n] + delta;
+ }
+
+ L_K = cholesky_decompose(K);
+ f = L_K * eta;
+ }
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
a ~ std_normal();
@@ -510,7 +522,7 @@ the marginal likelihood as follows:
$$
\hat p_\mathcal{L}(y \mid \rho, \alpha, a)
= \frac{p(f^* \mid \alpha, \rho) p(y \mid f^*, a)}{
- \hat p_\mathcal{L}(f \mid \rho, \alpha, a, y)},
+ \hat p_\mathcal{L}(f^* \mid \rho, \alpha, a, y)},
$$
where $f^*$ is the mode of $p(f \mid \rho, \alpha, a, y)$, obtained via
numerical optimization.
@@ -533,7 +545,24 @@ functions {
}
```
-We then increment `target` in the model block with the approximation to
+The embedded Laplace relies on calculations of the log likelihood's Hessian,
+$\partial^2 \log p(y \mid f, a, \rho, \alpha) / \partial f^2$, and these
+calculations can be much faster when the Hessian is sparse. In particular,
+it is expected that the Hessian is block diagonal. In the `transformed data`
+block we can specify the block size of the Hessian.
+```stan
+transformed data {
+ int hessian_block_size = 1;
+}
+```
+For example, if $y_i$ depends only on $f_i$, then the Hessian of the log
+likelihood is diagonal and the block size is 1. On the other hand, if the
+Hessian is not sparse, then we set the hessian block size
+to $N$, where $N$ is the dimension of $f$. Currently, Stan does not check
+the block size of the Hessian and so the user is responsible for correctly
+specifying the block size.
+
+Finally, we increment `target` in the model block with the approximation to
$\log p(y \mid \rho, \alpha, a)$.
```stan
model {
@@ -541,7 +570,7 @@ model {
alpha ~ std_normal();
sigma ~ std_normal();
- target += laplace_marginal(ll_function, (a, y),
+ target += laplace_marginal(ll_function, (a, y), hessian_block_size,
cov_function, (rho, alpha, x, N, delta));
}
```
@@ -549,51 +578,56 @@ Notice that we do not need to construct $f$ explicitly, since it is
marginalized out. Instead, we recover the GP function in `generated quantities`:
```stan
generated quantities {
- vector[N] f = laplace_latent_rng(ll_function, (a, y),
+ vector[N] f = laplace_latent_rng(ll_function, (a, y), hessian_block_size,
cov_function, (rho, alpha, x, N, delta));
}
```
Users can set the control parameters of the embedded Laplace approximation,
via `laplace_marginal_tol` and `laplace_latent_tol_rng`. When using these
-functions, the user must set *all* the control parameters.
+functions, the user must set *all* the control options and store them in a tuple.
+These control parameters mostly concern the numerical optimizer used to find
+the mode $f^*$ of $p(f \mid \rho, \alpha, a)$.
```stan
transformed data {
-// ...
+ tuple(vector[N], real, int, int, int, int) laplace_ops;
+ laplace_ops.1 = rep_vector(0, N); // starting point for Laplace optimizer
+ laplace_ops.2 = 1.49e-8; // tolerance for optimizer
+ laplace_ops.3 = 500; // maximum number of steps for optimizer.
+ laplace_ops.4 = 1; // solver type being used.
+ laplace_ops.5 = 1000; // max number of steps for linesearch.
+ laplace_ops.6 = 1; // allow_fallback (1: TRUE, 0: FALSE)
+```
+If users want to depart from the defaults for only some of the control
+parameters, a tuple with the default values (as above) can be created with the
+helper callable `generate_laplace_options()`, and the specific control
+parameter can then be modified,
+```stan
+transformed data {
+ tuple(vector[N], real, int, int, int, int, int) laplace_ops =
+ generate_laplace_options(N);
- vector[N] f_init = rep_vector(0, N); // starting point for optimizer.
- real tol = 1e-6; // optimizer's tolerance for Laplace approx.
- int max_num_steps = 1e3; // maximum number of steps for optimizer.
- int hessian_block_size = 1; // when hessian of log likelihood is block
- // diagonal, size of block (here 1).
- int solver = 1; // which Newton optimizer to use; default is 1,
- // use 2 and 3 only for special cases.
- max_steps_linesearch = 0; // if >= 1, optimizer does a lineseach with
- // specified number of steps.
+ laplace_ops.2 = 1e-6; // make tolerance of the optimizer less strict.
}
-
-// ...
-
+```
+The tuple `laplace_ops` is then passed is then passed to `laplace_marginal_tol`
+and `laplace_rng_tol`.
+```stan
model {
// ...
- target += laplace_marginal(ll_function, (a, y),
- cov_function, (rho, alpha, x, N, delta),
- f_init, tol, max_num_steps, hessian_block_size,
- solver, max_steps_linesearch);
+ target += laplace_marginal_tol(ll_function, (a, y), hessian_block_size,
+ cov_function, (rho, alpha, x, N, delta),
+ laplace_ops);
}
generated quantities {
- vector[N] f = laplace_latent_rng(ll_function, (a, y),
+ vector[N] f = laplace_latent_rng(ll_function, (a, y), hessian_block_size,
cov_function, (rho, alpha, x, N, delta),
- f_init, tol, max_num_steps,
- hessian_block_size, solver,
- max_steps_linesearch);
+ laplace_ops);
}
```
-For details about the control parameters, see @Margossian:2023.
-
Stan also provides support for a limited menu of built-in functions, including
the Poisson distribution with a log link and and prior mean $m$. When using such
@@ -696,13 +730,19 @@ functions {
// ...
+transformed data {
+ int hessian_block_size = 1;
+}
+
+// ...
+
model {
- target += laplace_marginal(ll_function, (a, z),
+ target += laplace_marginal(ll_function, (a, z), hessian_block_size,
cov_function, (rho, alpha, x, N, delta));
}
generated quantities {
- vector[N] f = laplace_latent_rng(ll_function, (a, z),
+ vector[N] f = laplace_latent_rng(ll_function, (a, z), hessian_block_size,
cov_function, (rho, alpha, x, N, delta));
}
```