---
title: "Introduction to outbreaker2"
author: "Thibaut Jombart"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 2
vignette: >
%\VignetteIndexEntry{Introduction to outbreaker2}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width=8,
fig.height=5,
fig.path="figs-introduction/"
)
```
This tutorial provides a worked example of outbreak reconstruction using
*outbreaker2*. For installation guidelines, a general overview of the package's
functionalities as well as other resources, see the 'overview' vignette:
```{r, eval=FALSE}
vignette("Overview", package = "outbreaker2")
```
We will be analysing a small simulated outbreak distributed with the package,
`fake_outbreak`. This dataset contains simulated dates of onsets, partial
contact tracing data and pathogen genome sequences for 30 cases:
```{r, data}
library(ape)
library(outbreaker2)
col <- "#6666cc"
fake_outbreak
```
Here, we will use the dates of case isolation `$sample`, DNA sequences `$dna`, contact tracing data `$ctd` and the empirical distribution of the generation time `$w`, which can be visualised as:
```{r, w}
plot(fake_outbreak$w, type = "h", xlim = c(0, 5),
lwd = 30, col = col, lend = 2,
xlab = "Days after infection",
ylab = "p(new case)",
main = "Generation time distribution")
```
# Running the analysis with defaults
By default, *outbreaker2* uses the settings defined by `create_config()`; see
the documentation of this function for details. Note that the main function of
*outbreaker2* is called `outbreaker` (without number). The function's arguments are:
```{r}
args(outbreaker)
```
The only mandatory input really is the data. For most cases, customising the
method will be done through `config` and the function `create_config()`, which
creates default and alters settings such as prior parameters, length and rate of
sampling from the MCMC, and definition of which parameters should be estimated
('moved'). The last arguments of `outbreaker` are used to specify custom prior,
likelihood, and movement functions, and are detailed in the '*Customisation*'
vignette.
Let us run the analysis with default settings:
```{r, first_run, cache = TRUE}
dna <- fake_outbreak$dna
dates <- fake_outbreak$sample
ctd <- fake_outbreak$ctd
w <- fake_outbreak$w
data <- outbreaker_data(dna = dna, dates = dates, ctd = ctd, w_dens = w)
## we set the seed to ensure results won't change
set.seed(1)
res <- outbreaker(data = data)
```
This analysis will take around 40 seconds on a modern computer. Note that
*outbreaker2* is slower than *outbreaker* for the same number of iterations, but
the two implementations are actually different. In particular, *outbreaker2*
performs many more moves than the original package for each iteration of the
MCMC, resulting in more efficient mixing. In short: *outbreaker2* is slower, but
it requires far less iterations.
Results are stored in a `data.frame` with the special class `outbreaker_chains`:
```{r}
class(res)
dim(res)
res
```
Each row of `res` contains a sample from the MCMC. For each, informations about
the step (iteration of the MCMC), log-values of posterior, likelihood and
priors, and all parameters and augmented data are returned. Ancestries
(i.e. indices of the most recent ancestral case for a given case), are indicated
by `alpha_[index of the case]`, dates of infections by `t_inf_[index of the
case]`, and number of generations between cases and their infector / ancestor by
`kappa_[index of the case]`:
```{r}
names(res)
```
# Analysing the results
## Graphics
Results can be visualised using `plot`, which has several options and can be
used to derive various kinds of graphics (see `?plot.outbreaker_chains`). The
basic plot shows the trace of the log-posterior values, which is useful to
assess mixing:
```{r, basic_trace}
plot(res)
```
The second argument of `plot` can be used to visualise traces of any
other column in `res`:
```{r, traces}
plot(res, "prior")
plot(res, "mu")
plot(res, "t_inf_15")
```
`burnin` can be used to discard the first iterations prior to mixing:
```{r, basic_trace_burn}
## compare this to plot(res)
plot(res, burnin = 2000)
```
`type` indicates the type of graphic to plot; roughly:
- `trace` for traces of the MCMC (default)
- `hist`, `density` to assess distributions of quantitative values
- `alpha`, `network` to visualise ancestries / transmission tree; note that
`network` opens up an interactive plot and requires a web browser with
Javascript enabled; the argument `min_support` is useful to select only the
most supported ancestries and avoid displaying too many links
- `kappa` to visualise the distributions generations between cases and their
ancestor / infector
Here are a few examples:
```{r, many_plots}
plot(res, "mu", "hist", burnin = 2000)
plot(res, "mu", "density", burnin = 2000)
plot(res, type = "alpha", burnin = 2000)
plot(res, type = "t_inf", burnin = 2000)
plot(res, type = "kappa", burnin = 2000)
plot(res, type = "network", burnin = 2000, min_support = 0.01)
```
## Using `summary`
The summary of results derives various distributional statistics for posterior,
likelihood and prior densities, as well as for the quantitative parameters. It
also builds a consensus tree, by finding for each case the most frequent
infector / ancestor in the posterior samples. The corresponding frequencies are
reported as 'support'. The most frequent value of kappa is also reported as 'generations':
```{r, summary}
summary(res)
```
# Customising settings and priors
As said before, most customisation can be achieved via `create_config`.
In the following, we make the following changes to the defaults:
- increase the number of iterations to 30,000
- set the sampling rate to 20
- use a star-like initial tree
- disable to movement of `kappa`, so that we assume that all cases have
observed
- set a lower rate for the exponential prior of `mu` (10 instead of 1000)
```{r, config2, cache = TRUE}
config2 <- create_config(n_iter = 3e4,
sample_every = 20,
init_tree ="star",
move_kappa = FALSE,
prior_mu = 10)
set.seed(1)
res2 <- outbreaker(data, config2)
plot(res2)
plot(res2, burnin = 2000)
```
We can see that the burnin is around 2,500 iterations (i.e. after the initial
step corresponding to a local optimum). We get the consensus tree from the new
results, and compare the inferred tree to the actual ancestries stored in the
dataset (`fake_outbreak$ances`):
```{r, res2}
summary(res2, burnin = 3000)
tree2 <- summary(res2, burnin = 3000)$tree
comparison <- data.frame(case = 1:30,
inferred = paste(tree2$from),
true = paste(fake_outbreak$ances),
stringsAsFactors = FALSE)
comparison$correct <- comparison$inferred == comparison$true
comparison
mean(comparison$correct)
```
Let's visualise the posterior trees:
```{r, net2}
plot(res2, type = "network", burnin = 3000, min_support = 0.01)
```