diff --git a/src/functions-reference/embedded_laplace.qmd b/src/functions-reference/embedded_laplace.qmd index 63346dbfb..247d19b44 100644 --- a/src/functions-reference/embedded_laplace.qmd +++ b/src/functions-reference/embedded_laplace.qmd @@ -4,23 +4,29 @@ pagetitle: Embedded Laplace Approximation # Embedded Laplace Approximation -The embedded Laplace approximation can be used to approximate certain -marginal and conditional distributions that arise in latent Gaussian models. -A latent Gaussian model observes the following hierarchical structure: +The embedded Laplace approximation can be used to approximate certain marginal +and conditional distributions that arise in latent Gaussian models. +Embedded Laplace replaces explicit sampling of high-dimensional Gaussian latent +variables with a local Gaussian approximation. +In doing so, it marginalizes out the latent Gaussian variables. +Inference can then be performed on the remaining, often low-dimensional, +parameters. The embedded Laplace approximation in Stan is best suited for +latent Gaussian models when jointly sampling over all model parameters is +expensive and the conditional posterior of the Gaussian latent variables is +reasonably close to Gaussian. + +For observed data $y$, latent Gaussian variables $\theta$, and hyperparameters $\phi$, a latent Gaussian model observes the following hierarchical structure: \begin{eqnarray} \phi &\sim& p(\phi), \\ \theta &\sim& \text{MultiNormal}(0, K(\phi)), \\ y &\sim& p(y \mid \theta, \phi). \end{eqnarray} -In this formulation, $y$ represents the -observed data, and $p(y \mid \theta, \phi)$ is the likelihood function that -specifies how observations are generated conditional on the latent Gaussian -variables $\theta$ and hyperparameters $\phi$. +In this formulation, $p(y \mid \theta, \phi)$ is the likelihood function that +specifies how observations are generated conditional on $\theta$ and $\phi$. $K(\phi)$ denotes the prior covariance matrix for the latent Gaussian variables -$\theta$ and is parameterized by $\phi$. -The prior $p(\theta \mid \phi)$ is restricted to be a multivariate normal -centered at 0. That said, we can always pick a likelihood that offsets $\theta$, -which is equivalently to specifying a prior mean. +$\theta$ and is parameterized by $\phi$. The prior on $\theta$ is centered at 0, +however an offset can always be added when specifying the likelihood function +$p(y \mid \theta, \phi)$. To sample from the joint posterior $p(\phi, \theta \mid y)$, we can either use a standard method, such as Markov chain Monte Carlo, or we can follow @@ -34,7 +40,7 @@ are typically available in closed form and so they must be approximated. The marginal posterior can be written as $p(\phi \mid y) \propto p(y \mid \phi) p(\phi)$, where $p(y \mid \phi) = \int p(y \mid \phi, \theta) p(\theta) \text{d}\theta$ is called the marginal likelihood. The Laplace method approximates -$p(y \mid \phi, \theta) p(\theta)$ with a normal distribution centered at +$p(y \mid \phi, \theta) p(\theta)$ with a normal distribution centered at the mode, $$ \theta^* = \underset{\theta}{\text{argmax}} \ \log p(\theta \mid y, \phi), $$ @@ -53,7 +59,7 @@ using one of Stan's algorithms. The marginal posterior is lower dimensional and likely to have a simpler geometry leading to more efficient inference. On the other hand each marginal likelihood computation is more costly, and the combined change in efficiency -depends on the case. +depends on the application. To obtain posterior draws for $\theta$, we sample from the normal approximation to $p(\theta \mid y, \phi)$ in `generated quantities`. @@ -62,7 +68,11 @@ then $p(\theta \mid y, \phi)$ produces samples from the joint posterior $p(\theta, \phi \mid y)$. The Laplace approximation is especially useful if $p(y \mid \phi, \theta)$ is -log-concave. Stan's embedded Laplace approximation is restricted to the case +log-concave, e.g., Poisson, binomial, negative-binomial, and Bernoulli. +(The normal distribution is also log concave, however when the likelihood is +normal, marginalization can be performed exactly and does not required an +approximation.) +Stan's embedded Laplace approximation is restricted to the case where the prior $p(\theta \mid \phi)$ is multivariate normal. Furthermore, the likelihood $p(y \mid \phi, \theta)$ must be computed using only operations which support higher-order derivatives @@ -74,33 +84,36 @@ In the `model` block, we increment `target` with `laplace_marginal`, a function that approximates the log marginal likelihood $\log p(y \mid \phi)$. The signature of the function is: -\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...) likelihood\_arguments, function covariance\_function, tuple(...), vector theta\_init covariance\_arguments): real}|hyperpage} +\index{{\tt \bfseries laplace\_marginal }!{\tt (function likelihood\_function, tuple(...) likelihood\_arguments, int hessian_block_size, function covariance\_function, tuple(...)): real}|hyperpage} - +`real` **`laplace_marginal`**`(function likelihood_function, tuple(...) likelihood_arguments, int hessian_block_size, function covariance_function, tuple(...) covariance_arguments)` -`real` **`laplace_marginal`**`(function likelihood_function, tuple(...) likelihood_arguments, function covariance_function, tuple(...) covariance_arguments)` - -Which returns an approximation to the log marginal likelihood $p(y \mid \phi)$. +which returns an approximation to the log marginal likelihood $p(y \mid \phi)$. {{< since 2.37 >}} -This function takes in the following arguments. +The embedded Laplace functions accept two functors whose user defined arguments are passed in as tuples to `laplace_marginal`. -1. `likelihood_function` - user-specified log likelihood whose first argument is the vector of latent Gaussian variables `theta` -2. `likelihood_arguments` - A tuple of the log likelihood arguments whose internal members will be passed to the covariance function -3. `covariance_function` - Prior covariance function -4. `covariance_arguments` A tuple of the arguments whose internal members will be passed to the the covariance function +1. `likelihood_function` - user-specified log likelihood whose first argument is the vector of latent Gaussian variables $\theta$. +The subsequent arguments are user defined. + - `real likelihood_function(vector theta, likelihood_arguments_1, likelihood_arguments_2, ...)` +2. `likelihood_arguments` - A tuple of arguments whose internal members are be passed to the log likelihood function. +This tuple does NOT include the latent variable $\theta$. +3. `hessian_block_size` - the block size of the Hessian of the log likelihood, $\partial^2 \log p(y \mid \theta, \phi) / \partial \theta^2$. +4. `covariance_function` - A function that returns the covariance matrix of the multivariate normal prior on $\theta$. + - `matrix covariance_function(covariance_argument_1, covariance_argument_2, ...)` +5. `covariance_arguments` A tuple of the arguments whose internal members will be passed to the the covariance function. Below we go over each argument in more detail. ## Specifying the log likelihood function {#laplace-likelihood_spec} The first step to use the embedded Laplace approximation is to write down a -function in the `functions` block which returns the log joint likelihood +function in the `functions` block which returns the log likelihood $\log p(y \mid \theta, \phi)$. There are a few constraints on this function: -1. The function return type must be `real` +1. The function return type must be `real`. 2. The first argument must be the latent Gaussian variable $\theta$ and must have type `vector`. @@ -124,7 +137,7 @@ as data or parameter. The tuple after `likelihood_function` contains the arguments that get passed to `likelihood_function` *excluding $\theta$*. For instance, if a user defined -likelihood uses a real and a matrix the likelihood function's signature would +likelihood uses a real and a matrix, the likelihood function's signature would first have a vector and then a real and matrix argument. ```stan @@ -149,6 +162,13 @@ for example, real likelihood_function(vector theta, data vector x, ...) ``` +In addition to the likelihood function, users must specify the block size +of the Hessian, $\partial^2 \log p(y \mid \theta, \phi) / \partial \theta^2$. +The Hessian is often block diagonal and this structure can be taken advantage of for fast computation. +For example, if $y_i$ only depends on $\theta_i$, then the Hessian is diagonal and `hessian_block_size=1`. +On the other hand, if the Hessian is not block diagonal, we can always set +`hessian_block_size=n` where $n$ is the size of $\theta$. + ## Specifying the covariance function The argument `covariance_function` returns the prior covariance matrix @@ -159,20 +179,6 @@ It's return type must be a matrix of size $n \times n$ where $n$ is the size of matrix covariance_function(...) ``` - - The `...` represents a set of optional variadic arguments. There is no type restrictions for the variadic arguments `...` and each argument can be passed as data or parameter. The variables @@ -198,51 +204,77 @@ It also possible to specify control parameters, which can help improve the optimization that underlies the Laplace approximation, using `laplace_marginal_tol` with the following signature: -\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): real}|hyperpage} - - -\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): real}|hyperpage} - -`real` **`laplace_marginal_tol`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch)`
\newline +```stan +real laplace_marginal_tol(function likelihood_function, tuple(...), + hessian_block_size, + function covariance_function, tuple(...), + tuple(vector theta_init, real tol, int max_steps, int solver, + int max_steps_linesearch, int allow_fallback)) +``` Returns an approximation to the log marginal likelihood $p(y \mid \phi)$ and allows the user to tune the control parameters of the approximation. -* `theta_init`: the initial guess for the Newton solver when finding the mode +* `theta_init`: the initial guess for a Newton solver when finding the mode of $p(\theta \mid y, \phi)$. By default, it is a zero-vector. * `tol`: the tolerance $\epsilon$ of the optimizer. Specifically, the optimizer stops when $||\nabla \log p(\theta \mid y, \phi)|| \le \epsilon$. By default, -the value is $\epsilon = 10^{-6}$. +the value is $\epsilon \approx 1.49 \times 10^{-8}$, which is the square-root of machine precision. * `max_num_steps`: the maximum number of steps taken by the optimizer before it gives up (in which case the Metropolis proposal gets rejected). The default -is 100 steps. +is 500 steps. -* `hessian_block_size`: the size of the blocks, assuming the Hessian -$\partial \log p(y \mid \theta, \phi) \ \partial \theta$ is block-diagonal. -The structure of the Hessian is determined by the dependence structure of $y$ -on $\theta$. By default, the Hessian is treated as diagonal -(`hessian_block_size=1`). If the Hessian is not block diagonal, then set -`hessian_block_size=n`, where `n` is the size of $\theta$. - -* `solver`: choice of Newton solver. The optimizer used to compute the +* `solver`: choice of Newton solver. The optimizer underlying the Laplace approximation does one of three matrix decompositions to compute a -Newton step. The problem determines which decomposition is numerical stable. -By default (`solver=1`), the solver makes a Cholesky decomposition of the -negative Hessian, $- \partial \log p(y \mid \theta, \phi) / \partial \theta$. -If `solver=2`, the solver makes a Cholesky decomposition of the covariance -matrix $K(\phi)$. -If the Cholesky decomposition cannot be computed for neither the negative -Hessian nor the covariance matrix, use `solver=3` which uses a more expensive -but less specialized approach. +Newton step. The problem determines which decomposition is numerically stable. +By default (`solver=1`), the solver attempts a Cholesky decomposition of the +negative Hessian of the log likelihood, +$- \partial^2 \log p(y \mid \theta, \phi) / \partial^2 \theta$. +This operation is legal if the negative Hessian is positive-definite, +which will always be true when the likelihood is log concave. +If `solver=2`, the solver makes a Cholesky decomposition of the covariance matrix $K(\phi)$. +Since a covariance matrix is always positive-definite, compute its +Cholesky decomposition is always a legal operation, at least in theory. +In practice, we may not be able to compute the Cholesky decomposition of the +negative Hessian or the covariance matrix, either because it does not exist or +because of numerical issues. +In that case, we can use `solver=3` which uses a more expensive but less +specialized approach to compute a Newton step. * `max_steps_linesearch`: maximum number of steps in linesearch. The linesearch -method tries to insure that the Newton step leads to a decrease in the -objective function. If the Newton step does not improve the objective function, -the step is repeatedly halved until the objective function decreases or the -maximum number of steps in the linesearch is reached. By default, -`max_steps_linesearch=0`, meaning no linesearch is performed. +adjusts to step size to ensure that a Newton step leads to an increase in +the objective function (i.e., $f(\theta) = p(\theta \mid \phi, y)$). +If a standard Newton step does not improve the objective function, +the step is adjusted iteratively until the objective function increases +or the maximum number of steps in the linesearch is reached. +By default, `max_steps_linesearch=1000`. +Setting `max_steps_linesearch=0` results in no linesearch. + +* `allow_fallback`: If user set solver fails, this flag determines whether to fallback to the next solver. For example, if the user specifies `solver=1` but the Cholesky decomposition of the negative Hessian $- \partial^2 \log p(y \mid \theta, \phi) / \partial^2 \theta$ fails, the optimizer will try `solver=2` instead. +By default, `allow_fallback = 1` (TRUE). + +The embedded Laplace approximation's options have a helper callable `generate_laplace_options(int theta_size)` that will generate the tuple for you. This can be useful for quickly setting up the control parameters in the `transformed data` block to reuse within the model. + +```stan +tuple(vector[theta_size], real, int, int, int, int, int) laplace_ops = generate_laplace_options(theta_size); +// Modify solver type +laplace_ops.5 = 2; +// Turn off fallthrough +laplace_ops.7 = 0; +``` + +The arguments stored in the `laplace_ops` tuple are, +``` +laplace_ops = {theta_init, + tol, + max_num_steps, + hessian_block_size, + solver, + max_steps_linesearch, + allow_fallback} +``` {{< since 2.37 >}} @@ -253,18 +285,18 @@ approximation of $p(\theta \mid \phi, y)$ using `laplace_latent_rng`. The signature for `laplace_latent_rng` follows closely the signature for `laplace_marginal`: - -\index{{\tt \bfseries laplace\_latent\_rng }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init): vector}|hyperpage} + +\index{{\tt \bfseries laplace\_latent\_rng }!{\tt (function likelihood\_function, tuple(...), int hessian_block_size, function covariance\_function, tuple(...)): vector}|hyperpage} -`vector` **`laplace_latent_rng`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...))`
\newline +`vector` **`laplace_latent_rng`**`(function likelihood_function, tuple(...), hessian_block_size, function covariance_function, tuple(...))`
\newline -Draws approximate samples from the conditional posterior $p(\theta \mid y, \phi)$. +Draws samples from the Laplace approximation to the conditional posterior $p(\theta \mid y, \phi)$. {{< since 2.37 >}} Once again, it is possible to specify control parameters: -\index{{\tt \bfseries laplace\_latent\_tol\_rng }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): vector}|hyperpage} +\index{{\tt \bfseries laplace\_latent\_tol\_rng }!{\tt (function likelihood\_function, tuple(...), int hessian_block_size, function covariance\_function, tuple(...), tuple(...) laplace_ops): vector}|hyperpage} -`vector` **`laplace_latent_tol_rng`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch)`
\newline +`vector` **`laplace_latent_tol_rng`**`(function likelihood_function, tuple(...), int hessian_block_size, function covariance_function, tuple(...), tuple(...) laplace_ops)`
\newline Draws approximate samples from the conditional posterior $p(\theta \mid y, \phi)$ and allows the user to tune the control parameters of the approximation. {{< since 2.37 >}} diff --git a/src/functions-reference/functions_index.qmd b/src/functions-reference/functions_index.qmd index 870df9d52..ad18c3d42 100644 --- a/src/functions-reference/functions_index.qmd +++ b/src/functions-reference/functions_index.qmd @@ -1713,7 +1713,7 @@ pagetitle: Alphabetical Index **laplace_latent_rng**: - -
[`(function likelihood_function, tuple(...), function covariance_function, tuple(...)) : vector`](embedded_laplace.qmd#index-entry-6d0685309664591fc32d3e2a2304af7aa5459e1c) (embedded_laplace.html)
+ -
[`(function likelihood_function, tuple(...), int hessian_block_size, function covariance_function, tuple(...)) : vector`](embedded_laplace.qmd#index-entry-cacb1d5344e246ec89c460e00a6f1065f4a8a1c1) (embedded_laplace.html)
**laplace_latent_tol_bernoulli_logit_rng**: @@ -1731,11 +1731,6 @@ pagetitle: Alphabetical Index -
[`(array[] int y, array[] int y_index, vector m, function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch) : vector`](embedded_laplace.qmd#index-entry-97f692748ebfabf0574b7a8e2edaceb940ee4b6b) (embedded_laplace.html)
-**laplace_marginal**: - - -
[`(function likelihood_function, tuple(...) likelihood_arguments, function covariance_function, tuple(...) covariance_arguments) : real`](embedded_laplace.qmd#index-entry-6da15f0ed076016d814cdc278127896f99d29633) (embedded_laplace.html)
- - **laplace_marginal_bernoulli_logit**: -
[distribution statement](embedded_laplace.qmd#index-entry-1d93ab799518d0aac88e63c01f9655f36c7cbeb6) (embedded_laplace.html)
@@ -1781,11 +1776,6 @@ pagetitle: Alphabetical Index -
[`(array[] int y | array[] int y_index, vector m, function covariance_function, tuple(...)) : real`](embedded_laplace.qmd#index-entry-c092314a5f45deef27ce0e8930a7d28c87ca601d) (embedded_laplace.html)
-**laplace_marginal_tol**: - - -
[`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch) : real`](embedded_laplace.qmd#index-entry-0f4bd0330deef2db7884dc5a4c933f181e1f2a8c) (embedded_laplace.html)
- - **laplace_marginal_tol_bernoulli_logit**: -
[distribution statement](embedded_laplace.qmd#index-entry-92d43f7c3643c7d85966bbfc0b2a71546facb37c) (embedded_laplace.html)
diff --git a/src/reference-manual/laplace.qmd b/src/reference-manual/laplace.qmd index 092b8c479..06ac96432 100644 --- a/src/reference-manual/laplace.qmd +++ b/src/reference-manual/laplace.qmd @@ -14,14 +14,14 @@ to the constrained space before outputting them. Given the estimate of the mode $\widehat{\theta}$, the Hessian $H(\widehat{\theta})$ is computed using -central finite differences of the model functor. +central finite differences of the model functor. Next the algorithm computes the Cholesky factor of the negative inverse Hessian: $R^{-1} = \textrm{chol}(-H(\widehat{\theta})) \backslash \mathbf{1}$. Each draw is generated on the unconstrained scale by sampling -$\theta^{\textrm{std}(m)} \sim \textrm{normal}(0, \textrm{I})$ +$\theta^{\textrm{std}(m)} \sim \textrm{normal}(0, \textrm{I})$ and defining draw $m$ to be diff --git a/src/reference-manual/laplace_embedded.qmd b/src/reference-manual/laplace_embedded.qmd index 3912ccc8a..5a9fdd542 100644 --- a/src/reference-manual/laplace_embedded.qmd +++ b/src/reference-manual/laplace_embedded.qmd @@ -4,173 +4,191 @@ pagetitle: Embedded Laplace Approximation # Embedded Laplace Approximation -Stan provides functions to perform an embedded Laplace -approximation for latent Gaussian models, following the procedure described -by @RasmussenWilliams:2006 and @Rue:2009. This approach is often refered to -as the integrated or nested Laplace approximation, although the exact details -of the method can vary. The details of Stan's implementation can be found in -references [@Margossian:2020; @Margossian:2023]. - -A standard approach to fit a latent Gaussian model would be to perform inference -jointly over the latent Gaussian variables and the hyperparameters. -Instead, the embedded Laplace approximation can be used to do *approximate* -marginalization of the latent Gaussian variables; we can then -use any inference over the remaining hyperparameters, for example Hamiltonian -Monte Carlo sampling. - -Formally, consider a latent Gaussian model, +The embedded Laplace approximation replaces explicit sampling of high-dimensional Gaussian latent variables with a local Gaussian approximation, marginalizing them out so that inference proceeds over the remaining hyperparameters alone. +This approach is often referred to as the integrated or nested Laplace approximation, although the exact details of the method can vary. +The details of Stan's implementation can be found in references [@Margossian:2020; @Margossian:2023]. + +A standard approach to fit a latent Gaussian model would be to perform inference jointly over the latent Gaussian variables and the hyperparameters. +Instead, the embedded Laplace approximation can be used to do *approximate* marginalization of the latent Gaussian variables; we can then use any inference over the remaining hyperparameters. +By marginalizing out the latent variables, the sampler explores a lower-dimensional, better-behaved marginal posterior. +Individual iterations are more expensive (each requires an inner optimization), but the sampler typically needs far fewer iterations to achieve the same effective sample size. + +For complete function signatures and the built-in likelihood wrappers (Poisson, Negative Binomial, Bernoulli), see the [Embedded Laplace functions reference](../functions-reference/embedded_laplace.qmd). +For worked examples with full data blocks, see the [Gaussian Processes chapter](../stan-users-guide/gaussian-processes.qmd#poisson-gp-using-an-embedded-laplace-approximation). + + +## Latent Gaussian models + +The embedded Laplace approximation is used for latent Gaussian models. A latent Gaussian model is defined by three main components: + +- $\phi$: hyperparameters (e.g., GP kernel length-scale and magnitude, or variance components in a hierarchical model), +- $\theta$: latent Gaussian variables (the high-dimensional quantity to be marginalized out), +- $y$: observed data. + +These components are related through a hierarchical structure. +The hyperparameters $\phi$ are given a prior $p(\phi)$. +The latent variables $\theta$ have a multivariate normal prior with covariance matrix $K(\phi)$. +The observations $y$ are generated according to a likelihood $p(y \mid \theta, \phi)$. +The prior on $\theta$ is centered at zero; an offset can be incorporated into the likelihood function if a non-zero mean is needed. + \begin{eqnarray*} \phi & \sim & p(\phi) \\ \theta & \sim & \text{Multi-Normal}(0, K(\phi)) \\ y & \sim & p(y \mid \theta, \phi). \end{eqnarray*} -The motivation for marginalization is to bypass the challenging geometry of the joint -posterior $p(\phi, \theta \mid y)$. This geometry (e.g. funnels) often frustrates -inference algorithms, including Hamiltonian Monte Carlo sampling and approximate -methods such as variational inference. On the other hand, the marginal posterior -$p(\phi \mid y)$ is often well-behaved and in many cases low-dimensional. -Furthermore, the conditional posterior $p(\theta \mid \phi, y)$ can be well -approximated by a normal distribution, if the likelihood $p(y \mid \theta, \phi)$ -is log concave. +The generative model above defines a joint distribution over all three quantities, $p(\phi, \theta, y) = p(\phi) \, p(\theta \mid \phi) \, p(y \mid \theta, \phi)$. +After observing data $y$, Bayes' theorem gives the joint posterior $p(\phi, \theta \mid y) \propto p(\phi) \, p(\theta \mid \phi) \, p(y \mid \theta, \phi)$. + +Sampling directly from the joint posterior $p(\phi, \theta \mid y)$ of this model is often difficult. +Challenging geometries (e.g., funnels) frustrate inference algorithms, including Hamiltonian Monte Carlo and variational inference. +However, the marginal posterior $p(\phi \mid y)$ is often well-behaved and low-dimensional, making it much easier to sample. +The Laplace approximation allows the conditional posterior $p(\theta \mid \phi, y)$ to be approximated by a normal distribution when the likelihood $p(y \mid \theta, \phi)$ is log concave. ## Approximation of the conditional posterior and marginal likelihood -The Laplace approximation is the normal distribution that matches the mode -and curvature of the conditional posterior $p(\theta \mid y, \phi)$. -The mode, +The two-step inference strategy for using embedded laplace in a latent Gaussian model requires approximations to both the conditional posterior $p(\theta \mid y, \phi)$ and the marginal likelihood $p(y \mid \phi)$. +The Laplace approximation is the normal distribution that matches the mode and curvature of the conditional posterior $p(\theta \mid y, \phi)$. +The mode, defined as the value of $\theta$ that maximizes the conditional posterior, is estimated by a Newton solver, $$ \theta^* = \underset{\theta}{\text{argmax}} \ p(\theta \mid y, \phi), $$ -is estimated by a Newton solver. Since the approximation is normal, -the curvature is matched by setting the covariance to the negative Hessian -of the log conditional posterior, evaluated at the mode, + +Since the approximation is normal, the curvature is matched by setting the covariance to the negative Hessian of the log conditional posterior, evaluated at the mode, + $$ \Sigma^* = - \left . \frac{\partial^2}{\partial \theta^2} \log p (\theta \mid \phi, y) \right |_{\theta =\theta^*}. $$ -The resulting Laplace approximation is then, + +The resulting Laplace approximation is a multivariate normal centered at the mode with covariance given by the inverse curvature, + $$ \hat p_\mathcal{L} (\theta \mid y, \phi) = \text{Multi-Normal}(\theta^*, \Sigma^*) \approx p(\theta \mid y, \phi). $$ -This approximation implies another approximation for the marginal likelihood, + +This approximation also yields an approximation to the marginal likelihood, obtained by evaluating the prior, likelihood, and approximate posterior at the mode $\theta^*$, + $$ \hat p_\mathcal{L}(y \mid \phi) := \frac{p(\theta^* \mid \phi) \ p(y \mid \theta^*, \phi) }{ \hat p_\mathcal{L} (\theta^* \mid \phi, y) } \approx p(y \mid \phi). $$ -Hence, a strategy to approximate the posterior of the latent Gaussian model -is to first estimate the marginal posterior -$\hat p_\mathcal{L}(\phi \mid y) \propto p(\phi) p_\mathcal{L} (y \mid \phi)$ + +Hence, a strategy to approximate the posterior of the latent Gaussian model is to first estimate the marginal posterior $\hat p_\mathcal{L}(\phi \mid y) \propto p(\phi) p_\mathcal{L} (y \mid \phi)$ using any algorithm supported by Stan. -Approximate posterior draws for the latent Gaussian variables are then -obtained by first drawing $\phi \sim \hat p_\mathcal{L}(\phi \mid y)$ and -then $\theta \sim \hat p_\mathcal{L}(\theta \mid \phi, y)$. +Approximate posterior draws for the latent Gaussian variables are then obtained by first drawing $\phi \sim \hat p_\mathcal{L}(\phi \mid y)$ and then $\theta \sim \hat p_\mathcal{L}(\theta \mid \phi, y)$. ## Trade-offs of the approximation -The embedded Laplace approximation presents several trade-offs with standard -inference over the joint posterior $p(\theta, \phi \mid y)$. The main -advantage of the embedded Laplace approximation is that it side-steps the -intricate geometry of hierarchical models. The marginal posterior -$p(\phi \mid y)$ can then be handled by Hamiltonian Monte Carlo sampling -without extensive tuning or reparameterization, and the mixing time is faster, -meaning we can run shorter chains to achieve a desired precision. In some cases, -approximate methods, e.g. variational inference, which -work poorly on the joint $p(\theta, \phi \mid y)$ work well on the marginal -posterior $p(\phi \mid y)$. - -On the other hand, the embedded Laplace approximation presents certain -disadvantages. First, we need to perform a Laplace approximation each time -the log marginal likelihood is evaluated, meaning each iteration -can be expensive. Secondly, the approximation can introduce non-negligible -error, especially with non-conventional likelihoods (note the prior -is always multivariate normal). How these trade-offs are resolved depends on -the application; see @Margossian:2020 for some examples. +The embedded Laplace approximation presents several trade-offs with standard inference over the joint posterior $p(\theta, \phi \mid y)$. +The main advantage of the embedded Laplace approximation is that it side-steps the intricate geometry of hierarchical models. +The marginal posterior $p(\phi \mid y)$ can then be handled by Hamiltonian Monte Carlo sampling without extensive tuning or reparameterization, and the mixing time is faster, meaning we can run shorter chains to achieve a desired precision. +One additional benefit is that approximate methods, e.g. variational inference, which work poorly on the joint $p(\theta, \phi \mid y)$ work well on the marginal posterior $p(\phi \mid y)$. + +On the other hand, the embedded Laplace approximation presents certain disadvantages. +First, we need to perform a Laplace approximation each time the log marginal likelihood is evaluated, meaning each iteration can be expensive. +Secondly, the approximation can introduce non-negligible error, especially with non-conventional likelihoods (note the prior is always multivariate normal). +How these trade-offs are resolved depends on the application; see @Margossian:2020 for some examples. +### When the approximation is appropriate + +The quality of the Laplace approximation depends on how close the true conditional posterior $p(\theta \mid y, \phi)$ is to Gaussian. + +**Works well.** Log-concave likelihoods i.e. A Poisson with log link or negative binomial with log link. +These produce unimodal conditional posteriors when combined with a Gaussian prior. +The approximation error is typically negligible for these likelihoods, especially with moderate-to-large counts [@Kuss:2005; @Vanhatalo:2010; @Cseke:2011; @Vehtari:2016]. + +**Works adequately.** Bernoulli with logit link. +The curvature information from binary observations is weaker, so the Gaussian approximation is less accurate than for count data. +The embedded Laplace is still useful when $\theta$ is high-dimensional and joint sampling is infeasible; see @Vehtari:2016 and @Margossian:2020 for discussion. + +**Not appropriate.** The normal distribution is log-concave, but when the likelihood is normal, marginalization can be performed exactly and no approximation is needed. +For likelihoods that are not log-concave in $\theta$, the conditional posterior may be multimodal and the Laplace approximation will miss modes. +When $\theta$ is low-dimensional (a few dozen or fewer), the overhead of the inner optimization may not pay for itself and standard joint HMC sampling is often adequate. ## Details of the approximation +When the embedded Laplace approximation does not converge or produces unexpected results, the solver configuration may need adjustment. +This section describes the internals of the Newton solver and the options available for tuning it. + ### Tuning the Newton solver -A critical component of the embedded Laplace approximation is the Newton solver -used to estimate the mode $\theta^*$ of $p(\theta \mid \phi, y)$. The objective -function being maximized is +A critical component of the embedded Laplace approximation is the Newton solver used to estimate the mode $\theta^*$ of $p(\theta \mid \phi, y)$. +The objective function being maximized is the log joint density of the prior and likelihood. + $$ \Psi(\theta) = \log p(\theta \mid \phi) + \log p(y \mid \theta, \phi), $$ -and convergence is declared if the change in the objective is sufficiently -small between two iterations + +Convergence is declared when the change in the objective between successive iterations falls below a *tolerance* $\Delta$. + $$ | \Psi (\theta^{(i + 1)}) - \Psi (\theta^{(i)}) | \le \Delta, $$ -for some *tolerance* $\Delta$. The solver also stops after reaching a -pre-specified *maximum number of steps*: in that case, Stan throws an exception -and rejects the current proposal. This is not a problem, as -long as these exceptions are rare and confined to early phases of the warmup. -The Newton iteration can be augmented with a linesearch step to insure that -at each iteration the objective function $\Psi$ decreases. Specifically, -suppose that +The solver also stops after reaching a pre-specified *maximum number of steps*. +In that case, Stan throws a warning, but still returns the last iteration's parameters. +If you see this warning you should check the diagnostics to understand why the solver failed to converge. + +To help with cases where the Newton step does not lead to a decrease in the objective function, the Newton iteration is augmented with a wolfe line-search to ensure that at each iteration the objective function $\Psi$ decreases. +Specifically, suppose the objective increases after a Newton step, indicating the step overshot a region of improvement. + $$ \Psi (\theta^{(i + 1)}) < \Psi (\theta^{(i)}). $$ -This can indicate that the Newton step is too large and that we skipped a region -where the objective function decreases. In that case, we can reduce the step -length by a factor of 2, using + +This can indicate that the Newton step $\alpha$ at iteration $i$ is too large and that we skipped a region where the objective function decreases. +In that case, we can fallback to a Wolfe line search to find a step size which satisfies the Wolfe conditions. The wolfe line search attempts to find a search direction $p_i$ and step size $\alpha_k$ such that an accepted step both increases our objective while ensuring the slope of the accepted step is flatter than our previous position. Together these help push the algorithm towards a minimum. + +$$ +f(x_i + \alpha_k p_i) \le f(x_i) + c_1 \alpha_k \nabla f(x_i)^T p_i +-p^T_i \Delta f(x_i + \alpha_k p_i) \le -c_2 p^T_i \Delta f(x_i) +$$ + $$ \theta^{(i + 1)} \leftarrow \frac{\theta^{(i + 1)} + \theta^{(i)}}{2}. $$ -We repeat this halving of steps until -$\Psi (\theta^{(i + 1)}) \ge \Psi (\theta^{(i)})$, or until a maximum number -of linesearch steps is reached. By default, this maximum is set to 0, which -means the Newton solver performs no linesearch. For certain problems, adding -a linsearch can make the optimization more stable. +We repeat this halving of steps until $\Psi (\theta^{(i + 1)}) \ge \Psi (\theta^{(i)})$, or until a maximum number of linesearch steps is reached. +For certain problems, adding a linesearch can make the optimization and solver more stable. + +### Solver Strategies -The embedded Laplace approximation uses a custom Newton solver, specialized -to find the mode of $p(\theta \mid \phi, y)$. -A keystep for efficient optimization is to insure all matrix inversions are -numerically stable. This can be done using the Woodburry-Sherman-Morrison -formula and requires one of three matrix decompositions: +The embedded Laplace approximation uses a custom Newton solver, specialized to find the mode of $p(\theta \mid \phi, y)$. +A key step for efficient optimization is to ensure all matrix inversions are numerically stable. +This can be done using the Woodbury-Sherman-Morrison formula and requires one of three matrix decompositions: -1. Cholesky decomposition of the Hessian of the negative log likelihood -$W = - \partial^2_\theta \log p(y \mid \theta, \phi)$ +1. Cholesky decomposition of the Hessian of the negative log likelihood $W = - \partial^2_\theta \log p(y \mid \theta, \phi)$. 2. Cholesky decomposition of the prior covariance matrix $K(\phi)$. 3. LU-decomposition of $I + KW$, where $I$ is the identity matrix. -The first solver (1) should be used if the negative log likelihood is -positive-definite. Otherwise the user should rely on (2). In rarer cases where -it is not numerically safe to invert the covariance matrix $K$, users can -use the third solver as a last-resort option. +The first solver (1) should be used if the negative log likelihood is positive-definite. +Otherwise the user should rely on (2). +In rarer cases where it is not numerically safe to invert the covariance matrix $K$, users can use the third solver as a last-resort option. ### Sparse Hessian of the log likelihood -A key step to speed up computation is to take advantage of the sparsity of -the Hessian of the log likelihood, +A key step to speed up computation is to take advantage of the sparsity of $H$, the Hessian of the log likelihood with respect to the latent variables, $$ H = \frac{\partial^2}{\partial \theta^2} \log p(y \mid \theta, \phi). $$ -For example, if the observations $(y_1, \cdots, y_N)$ are conditionally -independent and each depends on only depend on one component of $\theta$, -such that +For example, if the observations $(y_1, \cdots, y_N)$ are conditionally independent and each depends on only one component of $\theta$, the log likelihood decomposes into a sum of per-observation terms, $$ \log p(y \mid \theta, \phi) = \sum_{i = 1}^N \log p(y_i \mid \theta_i, \phi), $$ -then the Hessian is diagonal. This leads to faster calculations of the Hessian -and subsequently sparse matrix operations. This case is common in Gaussian -process models, and certain hierarchical models. +and the Hessian is diagonal. +This leads to faster calculations of the Hessian and subsequently sparse matrix operations. +This case is common in Gaussian process models, and certain hierarchical models. -Stan's suite of functions for the embedded Laplace approximation are not -equipped to handle arbitrary sparsity structures; instead, they work on -block-diagonal Hessians, and the user can specify the size $B$ of these blocks. -The user is responsible for working out what $B$ is. If the Hessian is dense, -then we simply set $B = N$. +Stan's suite of functions for the embedded Laplace approximation exploits block-diagonal structure in the Hessian, where the user specifies the block size B. +The user can specify the size $B$ of these blocks. +The user is responsible for working out what $B$ is. If the Hessian is dense, then we simply set $B = N$. +The diagonal case above corresponds to B = 1. +Arbitrary sparsity patterns beyond block-diagonal structure are not currently supported. -NOTE: currently, there is no support for sparse prior covariance matrix. -We expect this to be supported in future versions of Stan. diff --git a/src/stan-users-guide/gaussian-processes.qmd b/src/stan-users-guide/gaussian-processes.qmd index 58f05db3b..ffbde0880 100644 --- a/src/stan-users-guide/gaussian-processes.qmd +++ b/src/stan-users-guide/gaussian-processes.qmd @@ -476,7 +476,19 @@ parameters { vector[N] eta; } model { - // ... + vector[N] f; + { + matrix[N, N] L_K; + matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho); + + // diagonal elements + for (n in 1:N) { + K[n, n] = K[n, n] + delta; + } + + L_K = cholesky_decompose(K); + f = L_K * eta; + } rho ~ inv_gamma(5, 5); alpha ~ std_normal(); a ~ std_normal(); @@ -510,7 +522,7 @@ the marginal likelihood as follows: $$ \hat p_\mathcal{L}(y \mid \rho, \alpha, a) = \frac{p(f^* \mid \alpha, \rho) p(y \mid f^*, a)}{ - \hat p_\mathcal{L}(f \mid \rho, \alpha, a, y)}, + \hat p_\mathcal{L}(f^* \mid \rho, \alpha, a, y)}, $$ where $f^*$ is the mode of $p(f \mid \rho, \alpha, a, y)$, obtained via numerical optimization. @@ -533,7 +545,24 @@ functions { } ``` -We then increment `target` in the model block with the approximation to +The embedded Laplace relies on calculations of the log likelihood's Hessian, +$\partial^2 \log p(y \mid f, a, \rho, \alpha) / \partial f^2$, and these +calculations can be much faster when the Hessian is sparse. In particular, +it is expected that the Hessian is block diagonal. In the `transformed data` +block we can specify the block size of the Hessian. +```stan +transformed data { + int hessian_block_size = 1; +} +``` +For example, if $y_i$ depends only on $f_i$, then the Hessian of the log +likelihood is diagonal and the block size is 1. On the other hand, if the +Hessian is not sparse, then we set the hessian block size +to $N$, where $N$ is the dimension of $f$. Currently, Stan does not check +the block size of the Hessian and so the user is responsible for correctly +specifying the block size. + +Finally, we increment `target` in the model block with the approximation to $\log p(y \mid \rho, \alpha, a)$. ```stan model { @@ -541,7 +570,7 @@ model { alpha ~ std_normal(); sigma ~ std_normal(); - target += laplace_marginal(ll_function, (a, y), + target += laplace_marginal(ll_function, (a, y), hessian_block_size, cov_function, (rho, alpha, x, N, delta)); } ``` @@ -549,51 +578,56 @@ Notice that we do not need to construct $f$ explicitly, since it is marginalized out. Instead, we recover the GP function in `generated quantities`: ```stan generated quantities { - vector[N] f = laplace_latent_rng(ll_function, (a, y), + vector[N] f = laplace_latent_rng(ll_function, (a, y), hessian_block_size, cov_function, (rho, alpha, x, N, delta)); } ``` Users can set the control parameters of the embedded Laplace approximation, via `laplace_marginal_tol` and `laplace_latent_tol_rng`. When using these -functions, the user must set *all* the control parameters. +functions, the user must set *all* the control options and store them in a tuple. +These control parameters mostly concern the numerical optimizer used to find +the mode $f^*$ of $p(f \mid \rho, \alpha, a)$. ```stan transformed data { -// ... + tuple(vector[N], real, int, int, int, int) laplace_ops; + laplace_ops.1 = rep_vector(0, N); // starting point for Laplace optimizer + laplace_ops.2 = 1.49e-8; // tolerance for optimizer + laplace_ops.3 = 500; // maximum number of steps for optimizer. + laplace_ops.4 = 1; // solver type being used. + laplace_ops.5 = 1000; // max number of steps for linesearch. + laplace_ops.6 = 1; // allow_fallback (1: TRUE, 0: FALSE) +``` +If users want to depart from the defaults for only some of the control +parameters, a tuple with the default values (as above) can be created with the +helper callable `generate_laplace_options()`, and the specific control +parameter can then be modified, +```stan +transformed data { + tuple(vector[N], real, int, int, int, int, int) laplace_ops = + generate_laplace_options(N); - vector[N] f_init = rep_vector(0, N); // starting point for optimizer. - real tol = 1e-6; // optimizer's tolerance for Laplace approx. - int max_num_steps = 1e3; // maximum number of steps for optimizer. - int hessian_block_size = 1; // when hessian of log likelihood is block - // diagonal, size of block (here 1). - int solver = 1; // which Newton optimizer to use; default is 1, - // use 2 and 3 only for special cases. - max_steps_linesearch = 0; // if >= 1, optimizer does a lineseach with - // specified number of steps. + laplace_ops.2 = 1e-6; // make tolerance of the optimizer less strict. } - -// ... - +``` +The tuple `laplace_ops` is then passed is then passed to `laplace_marginal_tol` +and `laplace_rng_tol`. +```stan model { // ... - target += laplace_marginal(ll_function, (a, y), - cov_function, (rho, alpha, x, N, delta), - f_init, tol, max_num_steps, hessian_block_size, - solver, max_steps_linesearch); + target += laplace_marginal_tol(ll_function, (a, y), hessian_block_size, + cov_function, (rho, alpha, x, N, delta), + laplace_ops); } generated quantities { - vector[N] f = laplace_latent_rng(ll_function, (a, y), + vector[N] f = laplace_latent_rng(ll_function, (a, y), hessian_block_size, cov_function, (rho, alpha, x, N, delta), - f_init, tol, max_num_steps, - hessian_block_size, solver, - max_steps_linesearch); + laplace_ops); } ``` -For details about the control parameters, see @Margossian:2023. - Stan also provides support for a limited menu of built-in functions, including the Poisson distribution with a log link and and prior mean $m$. When using such @@ -696,13 +730,19 @@ functions { // ... +transformed data { + int hessian_block_size = 1; +} + +// ... + model { - target += laplace_marginal(ll_function, (a, z), + target += laplace_marginal(ll_function, (a, z), hessian_block_size, cov_function, (rho, alpha, x, N, delta)); } generated quantities { - vector[N] f = laplace_latent_rng(ll_function, (a, z), + vector[N] f = laplace_latent_rng(ll_function, (a, z), hessian_block_size, cov_function, (rho, alpha, x, N, delta)); } ```