---
title: "Using custom priors, likelihood, or movements in outbreaker2"
author: "Thibaut Jombart"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 2
vignette: >
%\VignetteIndexEntry{Using custom priors, likelihood, or movements in outbreaker2}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width=8,
fig.height=5,
fig.path="figs-customisation/"
)
```
In this vignette, we show how custom functions for priors, likelihood, or
movement of parameters and augmented data can be used in *outbreaker2*. In all
these functions, the process will be similar:
1. write your own function with the right arguments
2. pass this function as an argument to a `custom...` function
3. pass the result to *outbreaker2*
Note that 2-3 can be a single step if passing the function to the arguments of
*outbreaker2* directly. Also note that **all priors and likelihoods are expected
on a log scale**. Finally, also note that while the various `custom...`
functions will try to some extent to check that the provided functions are
valid, such tests are very difficult to implement. In short: you are using these
custom features at your own risks - make sure these functions work before
passing them to *outbreaker2*.
# Customising priors
Priors of *outbreaker2* must be a function of an `outbreaker_param` list (see
`?outbreaker_param`). Here, we decide to use a step function rather than the
default Beta function as a prior for *pi*, the reporting probability, and a flat
prior between 0 and 1 for the mutation rate (which is technically a probability
in the basic genetic model used in *outbreaker2*).
We start by defining two functions: an auxiliary function `f` which returns
values on the natural scale, and which we can use for plotting the prior
distribution, and then a function `f_pi` which will be used for the
customisation.
```{r, f_pi}
f <- function(pi) {
ifelse(pi < 0.8, 0, 5)
}
f_pi <- function(param) {
log(f(param$pi))
}
plot(f, type = "s", col = "blue",
xlab = expression(pi), ylab = expression(p(pi)),
main = expression(paste("New prior for ", pi)))
```
While `f` is a useful function to visualise the prior, `f_pi` is the function
which will be passed to `outbreaker`. To do so, we pass it to `custom_priors`:
```{r, custom_prior}
library(outbreaker2)
f_mu <- function(param) {
if (param$mu < 0 || param$mu > 1) {
return(-Inf)
} else {
return(0.0)
}
}
priors <- custom_priors(pi = f_pi, mu = f_mu)
priors
```
Note that `custom_priors` does more than just adding the custom function to a
list. For instance, the following customisations are all wrong, and rightfully
rejected:
```{r, wrong_prior, error = TRUE}
## wrong: not a function
## should be pi = function(x){0.0}
custom_priors(pi = 0.0)
## wrong: two arguments
custom_priors(pi = function(x, y){0.0})
```
We can now use the new priors to run `outbreaker` on the `fake_outbreak` data
(see [*introduction vignette*](introduction.html)):
```{r, run_custom_priors}
dna <- fake_outbreak$dna
dates <- fake_outbreak$sample
w <- fake_outbreak$w
data <- outbreaker_data(dna = dna, dates = dates, w_dens = w)
## we set the seed to ensure results won't change
set.seed(1)
res <- outbreaker(data = data, priors = priors)
```
We can check the results first by looking at the traces, and then by plotting
the posterior distributions of `pi` and `mu`, respectively:
```{r, traces_custom_priors}
plot(res)
plot(res, "pi", burnin = 500)
plot(res, "mu", burnin = 500)
plot(res, "pi", type = "density", burnin = 500)
plot(res, "mu", type = "hist", burnin = 500)
```
Note that we are using density and histograms here for illustrative purposes,
but there is no reason to prefer one or the other for a specific
parameter.
Interestingly, the trace of `pi` suggests that the MCMC oscillates between two
different states, on either bound of the interval on which the prior is positive
(it is `-Inf` outside (0.8; 1)). This may be a consequence of the step function,
which causes sharp 'cliffs' in the posterior landscape. What shall one do to
derive good samples from the posterior distribution in this kind of situation?
There are several options, which in fact apply to typical cases of multi-modal
posterior distributions:
- Avoid 'cliffs', i.e. sharp drops in the posterior landscape, typically created
by using step-functions in likelihoods and in priors.
- Use larger samples, i.e. run more MCMC iterations.
- Use a different sampler, better than Metropolis-Hasting at deriving samples
from multi-modal distributions.
Because we know what the real transmission tree is for this dataset, we can
assess how the new priors impacted the inference of the transmission tree.
```{r, tree_custom_priors}
summary(res, burnin = 500)
tree <- summary(res, burnin = 500)$tree
comparison <- data.frame(case = 1:30,
inferred = paste(tree$from),
true = paste(fake_outbreak$ances),
stringsAsFactors = FALSE)
comparison$correct <- comparison$inferred == comparison$true
comparison
mean(comparison$correct)
```
# Customizing likelihood
Likelihood functions customisation works identically to prior functions. The
only difference is that custom functions will take two arguments (`data` and
`param`) instead of one in the prior functions. The function used to specify
custom likelihood is `custom_likelihoods`. Each custom function will correspond
to a specific likelihood component:
```{r, likelihood_components}
custom_likelihoods()
```
see `?custom_likelihoods` for details of these components, and see the section
'Extending the model' for new, other components. As for `custom_priors`, a few
checks are performed by `custom_likelihoods`:
```{r, wrong_likelihood, error = TRUE}
## wrong: not a function
custom_likelihoods(genetic = "fubar")
## wrong: only one argument
custom_likelihoods(genetic = function(x){ 0.0 })
```
A trivial customisation is to disable some or all of the likelihood components
of the model by returning a finite constant. Here, we apply this to two cases:
first, we will disable all likelihood components as a sanity check, making sure
that the transmission tree landscape is explored freely by the MCMC. Second, we
will recreate the [Wallinga & Teunis (1994)](http://dx.doi.org/10.1093/aje/kwh255)
model, by disabling specific components.
## A null model
```{r, null_model}
f_null <- function(data, param) {
return(0.0)
}
null_model <- custom_likelihoods(genetic = f_null,
timing_sampling = f_null,
timing_infections = f_null,
reporting = f_null,
contact = f_null)
null_model
```
We also specify settings via the `config` argument to avoid detecting imported
cases, reduce the number of iterations and sampling each of them:
```{r, run_null_model, cache = TRUE}
null_config <- list(find_import = FALSE,
n_iter = 500,
sample_every = 1)
set.seed(1)
res_null <- outbreaker(data = data,
config = null_config,
likelihoods = null_model)
```
```{r, res_null_model}
plot(res_null)
plot(res_null, "pi")
plot(res_null, "mu")
```
By typical MCMC standards, these traces look appaling, as they haven't reach
stationarity (i.e. same mean and variance over time), and are grossly
autocorrelated in parts. Fair enough, as these are only the first 500 iterations
of the MCMC, so that autocorrelation is expected. In fact, what we observe here
literally is the random walk across the posterior landscape, which in this case
is only impacted by the priors.
We can check that transmission trees are indeed freely explored:
```{r, null_trees}
plot(res_null, type = "alpha")
```
Do **not** try to render the corresponding network using `plot(..., type =
"network")` as the force-direction algorithm will go insane. However, this
network can be visualised using *igraph*, extracting the edges and nodes from
the plot (without displaying it):
```{r, null_net}
## extract nodes and edges from the visNetwork object
temp <- plot(res_null, type = "network", min_support = 0)
class(temp)
head(temp$x$edges)
head(temp$x$nodes)
## make an igraph object
library(igraph)
net_null <- graph.data.frame(temp$x$edges,
vertices = temp$x$nodes[1:4])
plot(net_null, layout = layout.circle,
main = "Null model, posterior trees")
```
We can derive similar diagnostics for the number of generations betweens cases
(`kappa`), only constrained by default settings to be between 1 and 5, and for
the infection dates (*t_inf*):
```{r, res_null_diag}
plot(res_null, type = "kappa")
plot(res_null, type = "t_inf")
```
Finally, we can verify that the distributions of `mu` and `pi` match their
priors, respectively an exponential distribution with rate 1000 and a beta with
parameters 10 and 1. Here, we get a qualitative assessment by comparing the
observed distribution (histograms) to the densities of similar sized random
samples from the priors:
```{r, res_null_priors}
par(xpd=TRUE)
hist(res_null$mu, prob = TRUE, col = "grey",
border = "white",
main = "Distribution of mu")
invisible(replicate(30,
points(density(rexp(500, 1000)), type = "l", col = "blue")))
hist(res_null$pi, prob = TRUE, col = "grey",
border = "white", main = "Distribution of pi")
invisible(replicate(30,
points(density(rbeta(500, 10, 1)), type = "l", col = "blue")))
```
## A model using symptom onset only
We can use data and likelihood customisation to change the default *outbreaker2*
model into a [Wallinga & Teunis (1994)](http://dx.doi.org/10.1093/aje/kwh255)
model. To do so, we need to:
- Remove the DNA sequences from the data; alternatively we could also specify a
'null' function (i.e. returning a finite constant, as above) for the genetic
likelihood.
- Disable all likelihood components other than `timing_infections` using
`custom_likelihoods`. This means that the dates provided will be treated as
dates of symptom onset, and the timing distribution `w` will be taken as the
serial interval.
- Disable the detection of imported cases, and forcing all `kappa` values to be
1.
While these are fairly major changes, they are straightforward to implement. We
first create the dataset and custom likelihood functions:
```{r, wt_custom}
onset_data <- outbreaker_data(dates = fake_outbreak$onset,
w_dens = fake_outbreak$w)
wt_model <- custom_likelihoods(timing_sampling = f_null,
reporting = f_null)
```
To fix parameters or augmented data (here, fix all `kappa` values to 1), we set
the initial states to the desired values and disable the corresponding moves:
```{r, wt_config}
wt_config <- create_config(init_kappa = 1, move_kappa = FALSE,
init_pi = 1, move_pi = FALSE,
move_mu = FALSE)
```
We can now run the analyses for this new model:
```{r, run_wt, cache = TRUE}
set.seed(1)
res_wt <- outbreaker(data = onset_data,
config = wt_config,
likelihoods = wt_model)
```
```{r, res_wt}
plot(res_wt)
plot(res_wt, burnin = 500)
plot(res_wt, burnin = 500, type = "alpha")
summary(res_wt)
```
As before for the 'null' model, the transmission tree is very poorly resolved in
this case. We use the same approach to visualise it: extract nodes and edges
from the `visNetork` object, use this information to create an `igraph` object,
and visualise the result using a circular layout:
```{r, wt_net}
## extract nodes and edges from the visNetwork object
temp <- plot(res_wt, type = "network", min_support = 0.05)
class(temp)
head(temp$x$edges)
head(temp$x$nodes)
## make an igraph object
net_wt <- graph.data.frame(temp$x$edges,
vertices = temp$x$nodes[1:4])
plot(net_wt, layout = layout.circle,
main = "WT model, posterior trees")
```
# Customising movements
Customising movements works in similar ways to priors and likelihoods. In
practice, this type of customisation is more complex as it most likely will
require evaluation of likelihoods and priors, and therefore require the user to
know which functions to all, and how. These are documented in the [API
vignette](Rcpp_API.html). In the following, we provide two examples:
- a (fake) Gibbs sampler for the movement of the mutation rate `mu`
- a new 'naive' movement of ancestries in which infectors are picked at random
from all cases
But before getting into these, we need to explicit how movements are happening
in *outbreaker2*.
## Movements in *outbreaker2*
At the core of the `outbreaker` function, movements are implemented as a list of
functions, which are all evaluated in turn during every iteration of the
MCMC. All movement functions must obey two rules:
- The first argument must be an `outbreaker_param` object (typically called
`param` in the original code); see `?create_param` for details.
- All movement functions must return a valid, `outbreaker_param` object.
However, a new difficulty compared to prior or likelihood customisation is that
different movements may require different components of the model, and a
different set of arguments after `param`. In fact, this can be seen by examining
the arguments of all the default movement functions:
```{r, move_defaults}
lapply(custom_moves(), args)
```
To handle this difficulty, *outbreaker2* transforms every movement function
before running the MCMC into a new function with a single parameter `param`,
attaching all other required argument to the function's environment. The
function achieving this transformation is called `bind_moves`. This function
'knows' what these components are for known moves listed aboves. For new,
unknown moves, it attaches the following components, provided they are used as
arguments of the new function:
- `data`: the processed data; see `?outbreaker_data`
- `config`: the configuration list; see `create_config`
- `likelihoods`: a list of custom likelihood functions; see
`?custom_likelihoods`
- `priors`: a list of custom prior functions; see `?custom_priors`
See examples in `?bind_moves` for details of how this process works.
## A (fake) Gibbs sampler for `mu`
A Gibbs sampler supposes that the conditional distribution of a parameter is
known and can directly be sampled from. Here, we use this principle to provide a
toy example of custom movement for `mu`, assuming that this conditional
distribution is always an Exponential distribution with a rate of 1000. This is
easy to implement; to make sure that the function is actually used, we set a
global variable changed when the function is called.
```{r, custom_move_mu}
move_mu <- function(param, config) {
NEW_MOVE_HAS_BEEN_USED <<- TRUE
param$mu <- rexp(1, 1000)
return(param)
}
moves <- custom_moves(mu = move_mu)
quick_config <- list(n_iter = 500, sample_every = 1, find_import = FALSE)
```
Note that the new movement function `move_mu` has two arguments, and that we do
not specify `config`. Internally, what happens is:
```{r, bind_moves}
## bind quick_config to function
move_mu_intern <- bind_to_function(move_mu, config = quick_config)
## new function has just one argument
move_mu_intern
## 'config' is in the function's environment
names(environment(move_mu_intern))
## 'config' is actually 'quick_config'
identical(environment(move_mu_intern)$config, quick_config)
```
We perform a quick run using this new movement:
```{r, run_custom_move_mu}
NEW_MOVE_HAS_BEEN_USED <- FALSE
set.seed(1)
res_move_mu <- outbreaker(data, quick_config, moves = moves)
NEW_MOVE_HAS_BEEN_USED
plot(res_move_mu)
plot(res_move_mu, "pi")
plot(res_move_mu, "mu")
```
This short, full trace, clearly hasn't mixed well (as is to be expected). But
while we see the effect of accept/reject sampling (Metropolis algorithm) for
`pi` with a lot of autocorrelation, the trace of `mu` shows complete
independence between successive values. While the Gibbs sampler used here is not
correct, this result is: a Gibbs sampler will be more efficient than the
classical Metropolis(-Hasting) algorithm for a given number a iterations.
## A new movement of ancestries
Moves of ancestries are done in two ways in outbreaker: by picking ancestors at
random from any prior case, and by swapping cases from a transmission
link. Here, we implement a new move, which will propose infectors which have
been infected on the same day of the current infector. As before, we will use
global variables to keep track of the resulting movements (see `N_ACCEPT` and
`N_REJECT`).
```{r, new_move_ances}
api <- get_cpp_api()
new_move_ances <- function(param, data, custom_likelihoods = NULL) {
for (i in 1:data$N) {
current_ances <- param$alpha[i]
if (!is.na(current_ances)) {
## find cases infected on the same days
current_t_inf <- param$t_inf[current_ances]
pool <- which(param$t_inf == current_t_inf)
pool <- setdiff(pool, i)
if (length(pool) > 0) {
## propose new ancestor
current_ll <- api$cpp_ll_all(data, param, i = i, custom_likelihoods)
param$alpha[i] <- sample(pool, 1)
new_ll <- api$cpp_ll_all(data, param, i = i, custom_likelihoods)
## likelihood ratio - no correction, move is symmetric
ratio <- exp(new_ll - current_ll)
## accept / reject
if (runif(1) <= ratio) { # accept
N_ACCEPT <<- N_ACCEPT + 1
} else { # reject
N_REJECT <<- N_REJECT + 1
param$alpha[i] <- current_ances
}
}
}
}
return(param)
}
moves <- custom_moves(new_move = new_move_ances)
```
We can now use this new move in our transmission tree reconstruction. We will
use a shorter chain than the defaults as this new move is likely to be computer
intensive.
```{r, run_new_move, cache = TRUE}
N_ACCEPT <- 0
N_REJECT <- 0
set.seed(1)
res_new_move <- outbreaker(data, list(move_kappa = FALSE), moves = moves)
N_ACCEPT
N_REJECT
```
```{r, res_new_move}
plot(res_new_move)
plot(res_new_move, type = "alpha")
summary(res_new_move)
```
Results show a switch to a new mode at about 5000 iterations. Let us compare the
consensus tree to the actual one (store in `fake_outbreak$ances`):
```{r, check_new_move}
summary(res_new_move, burnin = 5000)
tree2 <- summary(res_new_move, burnin = 5000)$tree
comparison <- data.frame(case = 1:30,
inferred = paste(tree2$from),
true = paste(fake_outbreak$ances),
stringsAsFactors = FALSE)
comparison$correct <- comparison$inferred == comparison$true
comparison
mean(comparison$correct)
```