SEEPS

SEEPS provides functionality for large, detailed, stochastic simulations of pathogen outbreaks. While initially developed for HIV, it can be used for any pathogen. Our parameter choices here are taken from HIV-1 models. The workflow of a SEEPS experiment is generally split into four stages:

Simulate the pathogen transmission network (e.g. an outbreak) from a single index case.
Sample a collection of tips from the transmission network.
Extract the subtree. Optionally, process the tree to simulate within-host diversity.
Process, export, or analyze the subtree.

Simulate a transmission network with `gen_transmission_history_exponential_constant`

The first step of a SEEPS simulation is to obtain a transmission history. SEEPS tracks an “active” population of infections and records their ancestral relation (who infected whom and when).

Let’s look at the following piece of code and break down the arguments:

set.seed(1947)
offspring_rate_fn <- SEEPS::get_biphasic_HIV_rate(
    params = list("R0" = 5)
)

# Simulate
transmission_history <- SEEPS::gen_transmission_history_exponential_constant(
    offspring_rate_fn = offspring_rate_fn,
    maximum_population_target = 50,
    minimum_population = 15,
    total_steps = 48,
)

SEEPS is structured around two basic simulation modes: an exponential growth phase, and a constant population (maintenance) phase. This simulation function is ideal for cases where all observations are taken at a single point in time. We will explore the full API in [].

Exponential growth / offspring generation rate

Heterogeneity in the rate of generating secondary infections is a common source of variance. Here, this is user specified. Here, we consider the age of the infection as the primary source of heterogeneity. The exponential growth is encoded by an offspring generation rate function f, which determines the expected number of offspring that should be generated by an infection of x months old.

For HIV infections, a simple model is the biphasic rate function where the majority of new infections occur in the first 3 months of infection, and then a smaller number of infections occur in the following 21 months. The total density is normalized so R₀ infections are obtained over an average total infectious period.

This function encodes the growth rate R₀.

SEEPS rate functions are implemented as functions, which accept a vector of ages of all “active” infections, and return a vector of the expected number of offspring per time step. All rate functions should accept a params argument that contains any additional constants needed for the rate function. For HIV, we often use the following rate function:

ages <- c(1, 2, 5)
offspring_probs <- SEEPS::biphasic_HIV_rate(6, ages, params = list("R0" = 5))
print(offspring_probs)

## [1] 0.04950495 0.04950495 1.32013201

Maximum target population size

The number of possible offspring is limited by the maximum active population size (maximum_population_target). This is a user specified parameter. This value is chosen based on the available computational resources, and desired simulation size to attain a desired sample density. Often, we think of a “sufficiently large” population, so our sample density is small (e.g. taking 15 tips from a population of 1000 cases), or having very good sample density (measured in X% of the population). SEEPS will never simulate more active infections than this quantity.

To avoid a time step during the exponential growth phase where the number of offspring is not consistent with the exponential growth assumptions, SEEPS simulators will stop at 90% of the maximum population size. This behavior can be adjusted (as we shall see below), but we find 90% to be a good balance between computational efficiency and accuracy.

Minimum population

In some cases, it is desirable to ensure that our population exceeds a specified minimum:

A low R₀ may result in a premature extinction event.
A downstream analysis may assume that a minimum number of infections are present.

The parameter minimum_population is used to ensure that the population is at least this large, before exiting an exponential growth phase.

Length of the constant growth phase

Once the exponential growth phase is complete, the population is maintained at a constant size. This models the application of a public health control program which prevents further growth in the number of active cases, but replaces treated infections to maintain the population size. The length of the constant growth phase is specified by total_steps.

Sample a collection of tips

This section provides only a brief overview of the sampling methods available within SEEPS. See the detailed overview of available sampling methods for more details.

The full tree is rarely accessible, only samples from it are ever practically accessible. There are two modes of sampling we suggest for new users: uniform (independent) sampling and (iterative) contact tracing (with restarts). Contact tracing is more expensive, but gives a much more realistic sample.

Uniform sampling uses random_ids and offers additional support for proportional random sampling.
Iterative contact tracing with restarts uses contact_traced_uniform_restarts_ids which accepts the parental histories/transmission times of the full tree, list of active individuals to sample from, and a single parameter p that represents the discovery rate of iterative contact tracing. The algorithm is restarted with a new seed until minimum_sample_size infections are sampled.

sample <- SEEPS::contact_traced_uniform_restarts_ids(
    active = transmission_history$active,
    parents = transmission_history$parents,
    minimum_sample_size = 6,
    p = 0.4
)

Contact tracing is a random process, the list sample contains both the final sample, but additional information about the sampling process.

print(sample)

## $found
##       [,1]
##  [1,]   97
##  [2,]  104
##  [3,]   92
##  [4,]   88
##  [5,]   89
##  [6,]   94
##  [7,]   82
##  [8,]   85
##  [9,]   91
## 
## $samples
## [1]  82  91  92  94  97 104
## 
## $status
## [1] "Found enough data with 1 attempt(s)"
## 
## $success
## [1] TRUE
## 
## $group_ids
## [1] 1 1 1 1 1 1

The success flag should be checked to ensure that the sampling process was successful. If success is FALSE, then the sampling process was not successful, and the sample should be discarded and repeated. Failure usually indicates an issue with the input data structure.
The status attribute reports a human-readable status message. This should not be used for downstream analysis, but is useful for debugging. Use success and group_ids attributes for downstream analysis.
The found attribute lists both currently active and inactive infections that were sampled discovered/linked via contact tracing. Only active individuals can be sampled and are reported in the samples attribute.
As this algorithm will restart contact tracing with a new index case when no new infections are found, the group_ids attribute lists which attempt at contact tracing discovered each individual. This is useful for downstream analysis, as these often are interpreted (sometimes erroneously) as “clusters” of infections.

Extract the subtree

The full tree is rarely accessable, only samples from it are ever practically accessable. SEEPS offers two methods for extracting a subtree, the end goal is to obtain a phylogeny.

reduce_transmission_history reconstructs the geneology for a sample from the full tree using only the contact network. Edge lengths correspond to times between infections. Transmission chains are not reconstructed along edges.
reduce_transmission_history_bpb reconstructs a more detailed genealogy, recording all transmission events necessary to model the within-host diversity of each host. This function is paired with geneology_to_phylogeny_bpb which simulates within host diversity along the genealogy. The bpb suffix denotes that this functionality is derived from BioPhyBreak, a package for simulating within-host diversity.

genealogy <- SEEPS::reduce_transmission_history_bpb(
    sample = sample$samples,
    parents = transmission_history$parents,
    current_step = transmission_history$t_end
)
print(genealogy)

## $parents
##  [1]  0  1  2  3  4  5  6  7  8  9 10 10 11 12 12 14 16
## 
## $transmission_times
##  [1]  0  1  2  3  5 23 27 30 31 33 34 35 37 37 38 40 42
## 
## $sample_times
##  [1]  1.001  2.001  3.001  5.001 23.001 27.001 30.001 31.001 33.001 55.000
## [11] 37.001 38.001 55.000 55.000 55.000 55.000 55.000
## 
## $samples_available
##  [1] 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1
## 
## $transformed_sample_indices
##  [1]   0   0   0   0   0   0   0   0   0  82   0   0  91  92  94  97 104

Now we can simulate within host diversity as we know where the transmission events occurred.

phylogeny <- SEEPS::geneology_to_phylogeny_bpb(
    transmission_history = genealogy$parents,
    infection_times = genealogy$transmission_times,
    leaf_sample_ids = genealogy$transformed_sample_indices,
    sample_times = genealogy$sample_times,
    a = 5, b = 5
)
# print(phylogeny)
# print(phylogeny[,6])
# print(sum(phylogeny[,6]))

This phylogeny list contains 4 fields:

names_newick contains the (relabeled) names of the tips. Includes all unsampled tips.
newick_string contains a newick string of the phylogeny, with the tips labeled using names_newick
tree an ape tree representation of the newick string.
phylogeny a phylogeny matrix that encodes the ancestral relationship within the tree. This is the object we want to carry forward for downstream analysis.

Exporting

Newick string or ape tree

We can convert the phylogeny to a Newick string and load it into ape to visualize.

newick_string <- SEEPS::phylogeny_to_newick(phylogeny$phylogeny,
    mode = "mean", label_mode = "abs"
)

tree <- ape::read.tree(text = newick_string)
plot(ape::ladderize(tree))
box("outer")
title(
    main = "Simulated phylogeny for 6 sampled infections",
    xlab = "Time from index infection (Months)"
)
axis(1, col = "black", col.ticks = "black", col.axis = "black", las = 1)

Most of the tips are unsampled, so we can remove them from the tree and reveal the discoverable relationship between the sequences.

tree <- ape::drop.tip(tree, "") # unsampled tips have no index
plot(ape::ladderize(tree))
box("outer")
title(
    main = "Observable phylogeny for 6 sampled infections",
    xlab = "Time from most recent common ancestor (Months)"
)
axis(1, col = "black", col.ticks = "black", col.axis = "black", las = 1)

Pairwise distance matrix

We could alternatively store only the distance relationship between the samples.

print(phylogeny$phylogeny)

##       [,1] [,2]       [,3]       [,4]         [,5] [,6] [,7]
##  [1,]    1   32  1.0010000  0.3734455 0.0018690916    0    0
##  [2,]    2   32  2.0010000  1.3734455 0.0068740830    0    0
##  [3,]    3   31  3.0010000  1.9947840 0.0099838767    0    0
##  [4,]    4   30  5.0010000  3.4439264 0.0172368222    0    0
##  [5,]    5   29 23.0010000 16.6349876 0.0832579696    0    0
##  [6,]    6   27 27.0010000  1.7993404 0.0090056832    0    0
##  [7,]    7   28 30.0010000  5.4709745 0.0273821801    0    0
##  [8,]    8   25 31.0010000  0.9999708 0.0050048454    0    0
##  [9,]    9   24 33.0010000  1.6482525 0.0082494895    0    0
## [10,]   10   26 55.0000000 27.2799548 0.1365359389    1   82
## [11,]   11   22 37.0010000  2.4823942 0.0124243614    0    0
## [12,]   12   19 38.0010000  1.6213194 0.0081146896    0    0
## [13,]   13   22 55.0000000 20.4813942 0.1025092015    1   91
## [14,]   14   18 55.0000000 16.1211358 0.0806861461    1   92
## [15,]   15   21 55.0000000 19.8310839 0.0992544039    1   94
## [16,]   16   19 55.0000000 18.6203194 0.0931945382    1   97
## [17,]   17   18 55.0000000 16.1211358 0.0806861461    1  104
## [18,]   18   20 38.8788642  3.4479771 0.0172570959    0    0
## [19,]   19   20 36.3796806  0.9487936 0.0047487037    0    0
## [20,]   20   21 35.4308870  0.2619709 0.0013111620    0    0
## [21,]   21   23 35.1689161  3.6777947 0.0184073309    0    0
## [22,]   22   23 34.5186058  3.0274844 0.0151525334    0    0
## [23,]   23   24 31.4911214  0.1383739 0.0006925602    0    0
## [24,]   24   25 31.3527475  1.3517183 0.0067653387    0    0
## [25,]   25   26 30.0010292  2.2809840 0.0114163051    0    0
## [26,]   26   27 27.7200452  2.5183856 0.0126044982    0    0
## [27,]   27   28 25.2016596  0.6716341 0.0033615227    0    0
## [28,]   28   29 24.5300255 18.1640131 0.0909107293    0    0
## [29,]   29   30  6.3660124  4.8089389 0.0240686975    0    0
## [30,]   30   31  1.5570736  0.5508576 0.0027570373    0    0
## [31,]   31   33  1.0062160  1.0162071 0.0050861079    0    0
## [32,]   32   33  0.6275545  0.6375456 0.0031909103    0    0
## [33,]   33    0  0.0000000  0.0000000 0.0000000000    0    0
## [34,]    0    0  0.0000000  0.0000000 0.0000000000    0    0

mean_distances <- SEEPS::geneology_to_distance_matrix(phylogeny$phylogeny, mode = "mean")
print(mean_distances * 0.0067 / 12 * 300) # subs / (year * site) * months/year * # of sites

##           82       91       92       94       97      104
## 82  0.000000 9.138785 9.138785 9.138785 9.138785 9.138785
## 91  9.138785 0.000000 7.875474 7.875474 7.875474 7.875474
## 92  9.138785 7.875474 0.000000 6.643413 6.555653 5.400581
## 94  9.138785 7.875474 6.643413 0.000000 6.643413 6.643413
## 97  9.138785 7.875474 6.555653 6.643413 0.000000 6.555653
## 104 9.138785 7.875474 5.400581 6.643413 6.555653 0.000000

This matrix can be constructed using either expected or sampled distances. We can sample the distances and update the phylogeny object. Note that sampling edge lengths results in the tree no longer being ultametric.

sampled_phylogeny <- SEEPS::stochastify_transmission_history(
    transmission_history = phylogeny$phylogeny,
    rate = 0.0067 / 12 * 300
) # subs / (year * site) * months/year * # of sites
sampled_distances <- SEEPS::geneology_to_distance_matrix(sampled_phylogeny$geneology, mode = "mu")
print(sampled_distances)

##     82 91 92 94 97 104
## 82   0 11 12 12 13  12
## 91  11  0  7  7  8   7
## 92  12  7  0  4  5   4
## 94  12  7  4  0  5   4
## 97  13  8  5  5  0   5
## 104 12  7  4  4  5   0

More details about exporting and other post-processing methods in the Exporting data vignette.