This tutorial provides a worked example of outbreak reconstruction using outbreaker2. For installation guidelines, a general overview of the package’s functionalities as well as other resources, see the ‘overview’ vignette:
We will be analysing a small simulated outbreak distributed with the
package, fake_outbreak
. This dataset contains simulated
dates of onsets, partial contact tracing data and pathogen genome
sequences for 30 cases:
library(ape)
#>
#> Attaching package: 'ape'
#> The following objects are masked from 'package:igraph':
#>
#> degree, edges, mst, ring
library(outbreaker2)
col <- "#6666cc"
fake_outbreak
#> $onset
#> [1] 0 2 4 4 6 6 6 7 7 8 8 8 8 9 9 9 9 10 10 10 10 10 10 10 10
#> [26] 10 10 10 11 11
#>
#> $sample
#> [1] 3 5 6 6 7 9 8 9 9 9 11 10 10 10 10 11 11 12 11 13 12 13 11 12 11
#> [26] 11 13 12 14 14
#>
#> $dna
#> 30 DNA sequences in binary format stored in a matrix.
#>
#> All sequences of same length: 10000
#>
#> Labels:
#> ...
#>
#> Base composition:
#> a c g t
#> 0.251 0.242 0.251 0.256
#> (Total: 300 kb)
#>
#> $w
#> [1] 0.04255319 0.21276596 0.42553191 0.31914894
#>
#> $ances
#> [1] NA 1 2 NA 3 4 4 5 6 6 7 8 9 5 5 7 7 8 9 10 11 11 13 13 13
#> [26] 17 17 NA 10 13
#>
#> $ctd
#> i j
#> 1 4 7
#> 2 10 29
#> 3 13 30
#> 4 11 22
#> 5 13 25
#> 6 16 7
#> 7 13 23
#> 8 10 6
#> 9 18 8
#> 10 14 5
#> 11 17 26
#> 12 11 21
#> 13 3 5
#> 14 17 27
#> 15 2 3
#> 16 17 7
#> 17 19 9
#> 18 12 8
#> 19 22 6
#> 20 30 9
#> 21 17 20
#> 22 21 29
#> 23 11 14
#> 24 18 9
#> 25 25 29
#> 26 1 17
#> 27 23 7
#> 28 25 3
#> 29 3 9
#> 30 1 4
#> 31 13 4
#> 32 21 27
#> 33 16 2
#> 34 29 3
#> 35 21 5
Here, we will use the dates of case isolation $sample
,
DNA sequences $dna
, contact tracing data $ctd
and the empirical distribution of the generation time $w
,
which can be visualised as:
plot(fake_outbreak$w, type = "h", xlim = c(0, 5),
lwd = 30, col = col, lend = 2,
xlab = "Days after infection",
ylab = "p(new case)",
main = "Generation time distribution")
By default, outbreaker2 uses the settings defined by
create_config()
; see the documentation of this function for
details. Note that the main function of outbreaker2 is called
outbreaker
(without number). The function’s arguments
are:
args(outbreaker)
#> function (data = outbreaker_data(), config = create_config(),
#> priors = custom_priors(), likelihoods = custom_likelihoods(),
#> moves = custom_moves())
#> NULL
The only mandatory input really is the data. For most cases,
customising the method will be done through config
and the
function create_config()
, which creates default and alters
settings such as prior parameters, length and rate of sampling from the
MCMC, and definition of which parameters should be estimated (‘moved’).
The last arguments of outbreaker
are used to specify custom
prior, likelihood, and movement functions, and are detailed in the
‘Customisation’ vignette.
Let us run the analysis with default settings:
dna <- fake_outbreak$dna
dates <- fake_outbreak$sample
ctd <- fake_outbreak$ctd
w <- fake_outbreak$w
data <- outbreaker_data(dna = dna, dates = dates, ctd = ctd, w_dens = w)
## we set the seed to ensure results won't change
set.seed(1)
res <- outbreaker(data = data)
This analysis will take around 40 seconds on a modern computer. Note that outbreaker2 is slower than outbreaker for the same number of iterations, but the two implementations are actually different. In particular, outbreaker2 performs many more moves than the original package for each iteration of the MCMC, resulting in more efficient mixing. In short: outbreaker2 is slower, but it requires far less iterations.
Results are stored in a data.frame
with the special
class outbreaker_chains
:
class(res)
#> [1] "outbreaker_chains" "data.frame"
dim(res)
#> [1] 201 98
res
#>
#>
#> ///// outbreaker results ///
#>
#> class: outbreaker_chains data.frame
#> dimensions 201 rows, 98 columns
#> ancestries not shown: alpha_1 - alpha_30
#> infection dates not shown: t_inf_1 - t_inf_30
#> intermediate generations not shown: kappa_1 - kappa_30
#>
#> /// head //
#> step post like prior mu pi eps
#> 1 1 -1198.0669 -1199.4211 1.354240 0.0001000000 0.9000000 0.5000000
#> 2 50 -576.3898 -578.4358 2.046006 0.0001322360 0.9719080 0.6238018
#> 3 100 -579.1822 -580.9235 1.741283 0.0001218749 0.9395509 0.5810956
#> lambda
#> 1 0.05000000
#> 2 0.06582716
#> 3 0.10193745
#>
#> ...
#> /// tail //
#> step post like prior mu pi eps
#> 199 9900 -563.6538 -565.2708 1.616983 0.0001236515 0.9266639 0.7634021
#> 200 9950 -567.0488 -569.0806 2.031785 0.0001179255 0.9703719 0.7057554
#> 201 10000 -564.4037 -565.6457 1.242090 0.0001456115 0.8888590 0.6301657
#> lambda
#> 199 0.07471296
#> 200 0.07418330
#> 201 0.08802858
Each row of res
contains a sample from the MCMC. For
each, informations about the step (iteration of the MCMC), log-values of
posterior, likelihood and priors, and all parameters and augmented data
are returned. Ancestries (i.e. indices of the most recent ancestral case
for a given case), are indicated by
alpha_[index of the case]
, dates of infections by
t_inf_[index of the case]
, and number of generations
between cases and their infector / ancestor by
kappa_[index of the case]
:
names(res)
#> [1] "step" "post" "like" "prior" "mu" "pi"
#> [7] "eps" "lambda" "alpha_1" "alpha_2" "alpha_3" "alpha_4"
#> [13] "alpha_5" "alpha_6" "alpha_7" "alpha_8" "alpha_9" "alpha_10"
#> [19] "alpha_11" "alpha_12" "alpha_13" "alpha_14" "alpha_15" "alpha_16"
#> [25] "alpha_17" "alpha_18" "alpha_19" "alpha_20" "alpha_21" "alpha_22"
#> [31] "alpha_23" "alpha_24" "alpha_25" "alpha_26" "alpha_27" "alpha_28"
#> [37] "alpha_29" "alpha_30" "t_inf_1" "t_inf_2" "t_inf_3" "t_inf_4"
#> [43] "t_inf_5" "t_inf_6" "t_inf_7" "t_inf_8" "t_inf_9" "t_inf_10"
#> [49] "t_inf_11" "t_inf_12" "t_inf_13" "t_inf_14" "t_inf_15" "t_inf_16"
#> [55] "t_inf_17" "t_inf_18" "t_inf_19" "t_inf_20" "t_inf_21" "t_inf_22"
#> [61] "t_inf_23" "t_inf_24" "t_inf_25" "t_inf_26" "t_inf_27" "t_inf_28"
#> [67] "t_inf_29" "t_inf_30" "kappa_1" "kappa_2" "kappa_3" "kappa_4"
#> [73] "kappa_5" "kappa_6" "kappa_7" "kappa_8" "kappa_9" "kappa_10"
#> [79] "kappa_11" "kappa_12" "kappa_13" "kappa_14" "kappa_15" "kappa_16"
#> [85] "kappa_17" "kappa_18" "kappa_19" "kappa_20" "kappa_21" "kappa_22"
#> [91] "kappa_23" "kappa_24" "kappa_25" "kappa_26" "kappa_27" "kappa_28"
#> [97] "kappa_29" "kappa_30"
Results can be visualised using plot
, which has several
options and can be used to derive various kinds of graphics (see
?plot.outbreaker_chains
). The basic plot shows the trace of
the log-posterior values, which is useful to assess mixing:
The second argument of plot
can be used to visualise
traces of any other column in res
:
burnin
can be used to discard the first iterations prior
to mixing:
type
indicates the type of graphic to plot; roughly:
trace
for traces of the MCMC (default)
hist
, density
to assess distributions
of quantitative values
alpha
, network
to visualise ancestries
/ transmission tree; note that network
opens up an
interactive plot and requires a web browser with Javascript enabled; the
argument min_support
is useful to select only the most
supported ancestries and avoid displaying too many links
kappa
to visualise the distributions generations
between cases and their ancestor / infector
Here are a few examples:
summary
The summary of results derives various distributional statistics for posterior, likelihood and prior densities, as well as for the quantitative parameters. It also builds a consensus tree, by finding for each case the most frequent infector / ancestor in the posterior samples. The corresponding frequencies are reported as ‘support’. The most frequent value of kappa is also reported as ‘generations’:
summary(res)
#> $step
#> first last interval n_steps
#> 1 10000 50 201
#>
#> $post
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -1198.1 -564.5 -561.8 -565.2 -559.2 -553.3
#>
#> $like
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -1199.4 -566.4 -563.7 -567.1 -561.2 -555.4
#>
#> $prior
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.6565 1.7928 2.0318 1.9224 2.1461 2.2984
#>
#> $mu
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 9.733e-05 1.301e-04 1.448e-04 1.475e-04 1.646e-04 2.108e-04
#>
#> $pi
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.8329 0.9449 0.9704 0.9593 0.9828 0.9995
#>
#> $tree
#> from to time support generations
#> 1 NA 1 -1 NA NA
#> 2 1 2 1 1.0000000 1
#> 3 2 3 3 1.0000000 1
#> 4 NA 4 2 NA NA
#> 5 3 5 4 1.0000000 1
#> 6 10 6 8 0.9950249 1
#> 7 4 7 5 1.0000000 1
#> 8 5 8 6 0.9601990 1
#> 9 4 9 5 0.9502488 1
#> 10 9 10 6 0.9950249 1
#> 11 7 11 7 0.8109453 1
#> 12 8 12 7 0.8358209 1
#> 13 9 13 7 1.0000000 1
#> 14 5 14 7 0.9900498 1
#> 15 5 15 7 0.7263682 1
#> 16 7 16 8 0.9950249 1
#> 17 7 17 7 0.9701493 1
#> 18 8 18 9 0.9651741 1
#> 19 9 19 8 1.0000000 1
#> 20 10 20 10 1.0000000 1
#> 21 11 21 10 1.0000000 1
#> 22 11 22 10 1.0000000 1
#> 23 13 23 9 1.0000000 1
#> 24 13 24 9 1.0000000 1
#> 25 13 25 9 1.0000000 1
#> 26 17 26 9 0.9751244 1
#> 27 17 27 10 1.0000000 1
#> 28 NA 28 9 NA NA
#> 29 10 29 10 1.0000000 1
#> 30 13 30 10 1.0000000 1
As said before, most customisation can be achieved via
create_config
. In the following, we make the following
changes to the defaults:
increase the number of iterations to 30,000
set the sampling rate to 20
use a star-like initial tree
disable to movement of kappa
, so that we assume that
all cases have observed
set a lower rate for the exponential prior of mu
(10
instead of 1000)
config2 <- create_config(n_iter = 3e4,
sample_every = 20,
init_tree ="star",
move_kappa = FALSE,
prior_mu = 10)
set.seed(1)
res2 <- outbreaker(data, config2)
plot(res2)
We can see that the burnin is around 2,500 iterations (i.e. after the
initial step corresponding to a local optimum). We get the consensus
tree from the new results, and compare the inferred tree to the actual
ancestries stored in the dataset (fake_outbreak$ances
):
summary(res2, burnin = 3000)
#> $step
#> first last interval n_steps
#> 3020 30000 20 1350
#>
#> $post
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -543.7 -525.3 -523.1 -523.3 -520.8 -513.7
#>
#> $like
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -548.0 -529.6 -527.5 -527.6 -525.2 -518.1
#>
#> $prior
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2.456 4.277 4.434 4.364 4.528 4.604
#>
#> $mu
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 7.752e-05 1.219e-04 1.358e-04 1.382e-04 1.541e-04 2.270e-04
#>
#> $pi
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.7877 0.9643 0.9814 0.9740 0.9917 1.0000
#>
#> $tree
#> from to time support generations
#> 1 NA 1 -1 NA NA
#> 2 1 2 1 0.9992593 1
#> 3 2 3 3 0.9992593 1
#> 4 NA 4 2 NA NA
#> 5 3 5 4 0.9970370 1
#> 6 4 6 5 0.9614815 1
#> 7 4 7 5 1.0000000 1
#> 8 5 8 6 0.9400000 1
#> 9 6 9 6 1.0000000 1
#> 10 6 10 7 1.0000000 1
#> 11 7 11 7 0.8311111 1
#> 12 8 12 8 0.8311111 1
#> 13 9 13 7 1.0000000 1
#> 14 5 14 7 0.9844444 1
#> 15 5 15 7 0.7281481 1
#> 16 7 16 8 0.9992593 1
#> 17 7 17 7 0.9881481 1
#> 18 8 18 9 0.9829630 1
#> 19 9 19 9 1.0000000 1
#> 20 10 20 10 0.9755556 1
#> 21 11 21 10 1.0000000 1
#> 22 11 22 10 1.0000000 1
#> 23 13 23 9 1.0000000 1
#> 24 13 24 10 1.0000000 1
#> 25 13 25 9 1.0000000 1
#> 26 17 26 9 1.0000000 1
#> 27 17 27 10 1.0000000 1
#> 28 NA 28 9 NA NA
#> 29 10 29 11 1.0000000 1
#> 30 13 30 11 1.0000000 1
tree2 <- summary(res2, burnin = 3000)$tree
comparison <- data.frame(case = 1:30,
inferred = paste(tree2$from),
true = paste(fake_outbreak$ances),
stringsAsFactors = FALSE)
comparison$correct <- comparison$inferred == comparison$true
comparison
#> case inferred true correct
#> 1 1 NA NA TRUE
#> 2 2 1 1 TRUE
#> 3 3 2 2 TRUE
#> 4 4 NA NA TRUE
#> 5 5 3 3 TRUE
#> 6 6 4 4 TRUE
#> 7 7 4 4 TRUE
#> 8 8 5 5 TRUE
#> 9 9 6 6 TRUE
#> 10 10 6 6 TRUE
#> 11 11 7 7 TRUE
#> 12 12 8 8 TRUE
#> 13 13 9 9 TRUE
#> 14 14 5 5 TRUE
#> 15 15 5 5 TRUE
#> 16 16 7 7 TRUE
#> 17 17 7 7 TRUE
#> 18 18 8 8 TRUE
#> 19 19 9 9 TRUE
#> 20 20 10 10 TRUE
#> 21 21 11 11 TRUE
#> 22 22 11 11 TRUE
#> 23 23 13 13 TRUE
#> 24 24 13 13 TRUE
#> 25 25 13 13 TRUE
#> 26 26 17 17 TRUE
#> 27 27 17 17 TRUE
#> 28 28 NA NA TRUE
#> 29 29 10 10 TRUE
#> 30 30 13 13 TRUE
mean(comparison$correct)
#> [1] 1
Let’s visualise the posterior trees: