SEEPS provides functionality for large, detailed, stochastic simulations of pathogen outbreaks. While initially developed for HIV, it can be used for any pathogen. Our parameter choices here are taken from HIV-1 models. The workflow of a SEEPS experiment is generally split into four stages:
gen_transmission_history_exponential_constant
The first step of a SEEPS simulation is to obtain a transmission history. SEEPS tracks an “active” population of infections and records their ancestral relation (who infected whom and when).
Let’s look at the following piece of code and break down the arguments:
set.seed(1947)
offspring_rate_fn <- SEEPS::get_biphasic_HIV_rate(
params = list("R0" = 5)
)
# Simulate
transmission_history <- SEEPS::gen_transmission_history_exponential_constant(
offspring_rate_fn = offspring_rate_fn,
maximum_population_target = 50,
minimum_population = 15,
total_steps = 48,
)
SEEPS is structured around two basic simulation modes: an exponential growth phase, and a constant population (maintenance) phase. This simulation function is ideal for cases where all observations are taken at a single point in time. We will explore the full API in [].
Heterogeneity in the rate of generating secondary infections is a
common source of variance. Here, this is user specified. Here, we
consider the age of the infection as the primary source of
heterogeneity. The exponential growth is encoded by an offspring
generation rate function f
, which determines the expected
number of offspring that should be generated by an infection of
x
months old.
For HIV infections, a simple model is the biphasic rate function where the majority of new infections occur in the first 3 months of infection, and then a smaller number of infections occur in the following 21 months. The total density is normalized so R0 infections are obtained over an average total infectious period.
This function encodes the growth rate R0.
SEEPS rate functions are implemented as functions, which accept a
vector of ages of all “active” infections, and return a vector of the
expected number of offspring per time step. All rate functions should
accept a params
argument that contains any additional
constants needed for the rate function. For HIV, we often use the
following rate function:
ages <- c(1, 2, 5)
offspring_probs <- SEEPS::biphasic_HIV_rate(6, ages, params = list("R0" = 5))
print(offspring_probs)
## [1] 0.04950495 0.04950495 1.32013201
The number of possible offspring is limited by the maximum active
population size (maximum_population_target
). This is a user
specified parameter. This value is chosen based on the available
computational resources, and desired simulation size to attain a desired
sample density. Often, we think of a “sufficiently large” population, so
our sample density is small (e.g. taking 15 tips from a population of
1000 cases), or having very good sample density (measured in X% of the
population). SEEPS will never simulate more active infections than this
quantity.
To avoid a time step during the exponential growth phase where the number of offspring is not consistent with the exponential growth assumptions, SEEPS simulators will stop at 90% of the maximum population size. This behavior can be adjusted (as we shall see below), but we find 90% to be a good balance between computational efficiency and accuracy.
In some cases, it is desirable to ensure that our population exceeds a specified minimum:
The parameter minimum_population
is used to ensure that
the population is at least this large, before exiting an exponential
growth phase.
Once the exponential growth phase is complete, the population is
maintained at a constant size. This models the application of a public
health control program which prevents further growth in the number of
active cases, but replaces treated infections to maintain the population
size. The length of the constant growth phase is specified by
total_steps
.
This section provides only a brief overview of the sampling methods available within SEEPS. See the detailed overview of available sampling methods for more details.
The full tree is rarely accessible, only samples from it are ever practically accessible. There are two modes of sampling we suggest for new users: uniform (independent) sampling and (iterative) contact tracing (with restarts). Contact tracing is more expensive, but gives a much more realistic sample.
random_ids
and offers additional
support for proportional random sampling.contact_traced_uniform_restarts_ids
which accepts the
parental histories/transmission times of the full tree, list of active
individuals to sample from, and a single parameter p
that
represents the discovery rate of iterative contact tracing. The
algorithm is restarted with a new seed until
minimum_sample_size
infections are sampled.sample <- SEEPS::contact_traced_uniform_restarts_ids(
active = transmission_history$active,
parents = transmission_history$parents,
minimum_sample_size = 6,
p = 0.4
)
Contact tracing is a random process, the list sample
contains both the final sample, but additional information about the
sampling process.
## $found
## [,1]
## [1,] 97
## [2,] 104
## [3,] 92
## [4,] 88
## [5,] 89
## [6,] 94
## [7,] 82
## [8,] 85
## [9,] 91
##
## $samples
## [1] 82 91 92 94 97 104
##
## $status
## [1] "Found enough data with 1 attempt(s)"
##
## $success
## [1] TRUE
##
## $group_ids
## [1] 1 1 1 1 1 1
success
flag should be checked to ensure that the
sampling process was successful. If success
is
FALSE
, then the sampling process was not successful, and
the sample should be discarded and repeated. Failure usually indicates
an issue with the input data structure.status
attribute reports a human-readable status
message. This should not be used for downstream analysis, but is useful
for debugging. Use success
and group_ids
attributes for downstream analysis.found
attribute lists both currently active and
inactive infections that were sampled discovered/linked via contact
tracing. Only active individuals can be sampled and are reported in the
samples
attribute.group_ids
attribute
lists which attempt at contact tracing discovered each individual. This
is useful for downstream analysis, as these often are interpreted
(sometimes erroneously) as “clusters” of infections.The full tree is rarely accessable, only samples from it are ever practically accessable. SEEPS offers two methods for extracting a subtree, the end goal is to obtain a phylogeny.
reduce_transmission_history
reconstructs the geneology
for a sample from the full tree using only the contact network. Edge
lengths correspond to times between infections. Transmission chains are
not reconstructed along edges.reduce_transmission_history_bpb
reconstructs a more
detailed genealogy, recording all transmission events necessary to model
the within-host diversity of each host. This function is paired with
geneology_to_phylogeny_bpb
which simulates within host
diversity along the genealogy. The bpb
suffix denotes that
this functionality is derived from BioPhyBreak, a
package for simulating within-host diversity.genealogy <- SEEPS::reduce_transmission_history_bpb(
sample = sample$samples,
parents = transmission_history$parents,
current_step = transmission_history$t_end
)
print(genealogy)
## $parents
## [1] 0 1 2 3 4 5 6 7 8 9 10 10 11 12 12 14 16
##
## $transmission_times
## [1] 0 1 2 3 5 23 27 30 31 33 34 35 37 37 38 40 42
##
## $sample_times
## [1] 1.001 2.001 3.001 5.001 23.001 27.001 30.001 31.001 33.001 55.000
## [11] 37.001 38.001 55.000 55.000 55.000 55.000 55.000
##
## $samples_available
## [1] 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1
##
## $transformed_sample_indices
## [1] 0 0 0 0 0 0 0 0 0 82 0 0 91 92 94 97 104
Now we can simulate within host diversity as we know where the transmission events occurred.
phylogeny <- SEEPS::geneology_to_phylogeny_bpb(
transmission_history = genealogy$parents,
infection_times = genealogy$transmission_times,
leaf_sample_ids = genealogy$transformed_sample_indices,
sample_times = genealogy$sample_times,
a = 5, b = 5
)
# print(phylogeny)
# print(phylogeny[,6])
# print(sum(phylogeny[,6]))
This phylogeny list contains 4 fields:
names_newick
contains the (relabeled) names of the
tips. Includes all unsampled tips.newick_string
contains a newick string of the
phylogeny, with the tips labeled using names_newick
tree
an ape tree representation of the newick
string.phylogeny
a phylogeny matrix that encodes the ancestral
relationship within the tree. This is the object we want to carry
forward for downstream analysis.We can convert the phylogeny to a Newick string and load it into ape to visualize.
newick_string <- SEEPS::phylogeny_to_newick(phylogeny$phylogeny,
mode = "mean", label_mode = "abs"
)
tree <- ape::read.tree(text = newick_string)
plot(ape::ladderize(tree))
box("outer")
title(
main = "Simulated phylogeny for 6 sampled infections",
xlab = "Time from index infection (Months)"
)
axis(1, col = "black", col.ticks = "black", col.axis = "black", las = 1)
Most of the tips are unsampled, so we can remove them from the tree and reveal the discoverable relationship between the sequences.
tree <- ape::drop.tip(tree, "") # unsampled tips have no index
plot(ape::ladderize(tree))
box("outer")
title(
main = "Observable phylogeny for 6 sampled infections",
xlab = "Time from most recent common ancestor (Months)"
)
axis(1, col = "black", col.ticks = "black", col.axis = "black", las = 1)
We could alternatively store only the distance relationship between the samples.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1 32 1.0010000 0.3734455 0.0018690916 0 0
## [2,] 2 32 2.0010000 1.3734455 0.0068740830 0 0
## [3,] 3 31 3.0010000 1.9947840 0.0099838767 0 0
## [4,] 4 30 5.0010000 3.4439264 0.0172368222 0 0
## [5,] 5 29 23.0010000 16.6349876 0.0832579696 0 0
## [6,] 6 27 27.0010000 1.7993404 0.0090056832 0 0
## [7,] 7 28 30.0010000 5.4709745 0.0273821801 0 0
## [8,] 8 25 31.0010000 0.9999708 0.0050048454 0 0
## [9,] 9 24 33.0010000 1.6482525 0.0082494895 0 0
## [10,] 10 26 55.0000000 27.2799548 0.1365359389 1 82
## [11,] 11 22 37.0010000 2.4823942 0.0124243614 0 0
## [12,] 12 19 38.0010000 1.6213194 0.0081146896 0 0
## [13,] 13 22 55.0000000 20.4813942 0.1025092015 1 91
## [14,] 14 18 55.0000000 16.1211358 0.0806861461 1 92
## [15,] 15 21 55.0000000 19.8310839 0.0992544039 1 94
## [16,] 16 19 55.0000000 18.6203194 0.0931945382 1 97
## [17,] 17 18 55.0000000 16.1211358 0.0806861461 1 104
## [18,] 18 20 38.8788642 3.4479771 0.0172570959 0 0
## [19,] 19 20 36.3796806 0.9487936 0.0047487037 0 0
## [20,] 20 21 35.4308870 0.2619709 0.0013111620 0 0
## [21,] 21 23 35.1689161 3.6777947 0.0184073309 0 0
## [22,] 22 23 34.5186058 3.0274844 0.0151525334 0 0
## [23,] 23 24 31.4911214 0.1383739 0.0006925602 0 0
## [24,] 24 25 31.3527475 1.3517183 0.0067653387 0 0
## [25,] 25 26 30.0010292 2.2809840 0.0114163051 0 0
## [26,] 26 27 27.7200452 2.5183856 0.0126044982 0 0
## [27,] 27 28 25.2016596 0.6716341 0.0033615227 0 0
## [28,] 28 29 24.5300255 18.1640131 0.0909107293 0 0
## [29,] 29 30 6.3660124 4.8089389 0.0240686975 0 0
## [30,] 30 31 1.5570736 0.5508576 0.0027570373 0 0
## [31,] 31 33 1.0062160 1.0162071 0.0050861079 0 0
## [32,] 32 33 0.6275545 0.6375456 0.0031909103 0 0
## [33,] 33 0 0.0000000 0.0000000 0.0000000000 0 0
## [34,] 0 0 0.0000000 0.0000000 0.0000000000 0 0
mean_distances <- SEEPS::geneology_to_distance_matrix(phylogeny$phylogeny, mode = "mean")
print(mean_distances * 0.0067 / 12 * 300) # subs / (year * site) * months/year * # of sites
## 82 91 92 94 97 104
## 82 0.000000 9.138785 9.138785 9.138785 9.138785 9.138785
## 91 9.138785 0.000000 7.875474 7.875474 7.875474 7.875474
## 92 9.138785 7.875474 0.000000 6.643413 6.555653 5.400581
## 94 9.138785 7.875474 6.643413 0.000000 6.643413 6.643413
## 97 9.138785 7.875474 6.555653 6.643413 0.000000 6.555653
## 104 9.138785 7.875474 5.400581 6.643413 6.555653 0.000000
This matrix can be constructed using either expected or sampled distances. We can sample the distances and update the phylogeny object. Note that sampling edge lengths results in the tree no longer being ultametric.
sampled_phylogeny <- SEEPS::stochastify_transmission_history(
transmission_history = phylogeny$phylogeny,
rate = 0.0067 / 12 * 300
) # subs / (year * site) * months/year * # of sites
sampled_distances <- SEEPS::geneology_to_distance_matrix(sampled_phylogeny$geneology, mode = "mu")
print(sampled_distances)
## 82 91 92 94 97 104
## 82 0 11 12 12 13 12
## 91 11 0 7 7 8 7
## 92 12 7 0 4 5 4
## 94 12 7 4 0 5 4
## 97 13 8 5 5 0 5
## 104 12 7 4 4 5 0
More details about exporting and other post-processing methods in the Exporting data vignette.