--- title: "Exporting Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Exporting Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(SEEPS) require(ape) ``` In this tutorial, we'll look at how to export data from a SEEPS simulation. SEEPS generates three kinds of data for export: 1. Trees (newick string format) 2. Pairwise distance matrices (matrix) 3. Simulated sequences (fasta) Trees and pairwise distance matrices are generated from a genealogy or a phylogeny, which are coded in an array of the same structure. ## Sample data First, let's generate some data for a population of 150 individuals with 15 samples (10% sampling coverage). ```{r} set.seed(1947) params <- list( "rate_function_parameters" = list("R0" = 2), "minimum_population" = 5, "maximum_population_target" = 150, "total_steps_after_exp_phase" = 48, "mutation_rate" = 0.0067, "a" = 5, "b" = 5 ) biphasic_rate_function <- get_biphasic_HIV_rate( params = params[["rate_function_parameters"]] ) simulator_result <- gen_transmission_history_exponential_constant( minimum_population = params[["minimum_population"]], offspring_rate_fn = biphasic_rate_function, total_steps = params[["total_steps_after_exp_phase"]], maximum_population_target = params[["maximum_population_target"]], spike_root = FALSE ) target_sample <- random_fixed_size_ids( # Random sampling active = simulator_result[["active"]], minimum_size = params[["minimum_population"]], spike_root = FALSE ) transmission_history <- reduce_transmission_history( samples = target_sample[["samples"]], parents = simulator_result[["parents"]], current_step = simulator_result[["t_end"]], spike_root = FALSE ) # Get the sequence length V3_sequence <- SEEPS::lookup_sequence_by_name( organism_name = "HIV1", # nolint: object_name_linter region_name = "Env-Clinical" ) # Randomize edge lengths geneology <- stochastify_transmission_history( transmission_history = transmission_history$geneology, rate = params[["mutation_rate"]] / 12 * length(V3_sequence) ) # Provide in rate per sequence per year # Get the rate model for the V3 region rate_model <- SEEPS::get_V3_rate_model(nonzero_I = TRUE) print(geneology) sequences_fasta <- SEEPS::generate_sequences( phylogeny = geneology$geneology, branch_rate = params[["mutation_rate"]] / 12, root_sequence = V3_sequence, rate_model = rate_model, rate_per_nt = FALSE ) # Distances in phylogeny are already in per nt ``` ```{r} newick_tree <- SEEPS::phylogeny_to_newick(geneology$geneology, mode = "mean", label_mode = "abs") sequences_df <- SEEPS::fasta_string_to_dataframe(sequences_fasta) ``` ```{r} matrix <- SEEPS::build_distance_matrix_from_df(sequences_df) ``` The last 3 lines of the above code are the only lines that are needed to prepare data for export. ## Trees A phylogeny or a genealogy can be converted to a Newick tree using `phylogeny_to_newick`. This function takes one required and two optional arguments. * The `mode` option selects between the expected number `mean` or the sampled number of mutations `mu` for the branch lengths. * The `label_mode` option determines whether to use `abs` (values in 1:number of simulated infections) or `local` (values of 1:sample size). The returned Newick string can be read into ape and visualized. ```{r} print(newick_tree) ``` ```{r, fig.alt="A ladderized phylogenetic tree plot"} tree <- ape::read.tree(text = newick_tree) plot(ape::ladderize(tree)) ``` Alternatively, we can export the tree to a text file using standard library tools. ```{r, eval=FALSE} fileConnection <- file("tree.txt", "w") writeLines(newick_tree, fileConnection) close(fileConnection) ``` ## Sequences (fasta) Sequence simulation is available with SeqGen. By default, `generate_sequences` returns a string that represents a `fasta` 2-line text file. This file can be immediately written out to storage. ```{r} print(sequences_fasta) ``` The root sequence used to seed SeqGen is also returned in the fasta output, with the name/identifier of `root`. However, a serialization and reloading step is not necessary, if you are continuing to work with the sequences in R (to either run downstream analyses, or add sequencing artifacts). The `fasta_string_to_dataframe` function converts the string to a data frame, can be further manipulated before being written out to storage. ```{r} print(sequences_df) ``` ## Pairwise distance matrices A tree-free reduced representation of the evolutionary history of a set of samples is a pairwise distance matrix. Distances can be calculated either using the simulated tree (cophenetic distance metric) or calculated from simulated sequences (TN93). ### Tree-based distance matrix As a phylogeny or genealogy specifies the necessary branch lengths and topology of a tree, it can be used to calculate a distance matrix. ```{r} distance_matrix <- SEEPS::geneology_to_distance_matrix( geneology = geneology$geneology ) print(distance_matrix) ``` ### Sequence-based distance matrix The output of the sequence simulator can be converted directly to a distance matrix, as we did above. ```{r, eval=FALSE} sequences_df <- SEEPS::fasta_string_to_dataframe(sequences_fasta) matrix <- SEEPS::build_distance_matrix_from_df(sequences_df) ``` This gives a matrix. ```{r} print(matrix) ``` SeqGen uses the phylogenetic tree to determine when to branch a sequence, using a GTR+$\Gamma$ model to select for mutations. The tree-based distance matrix approach does not use information about the originating sequence, resulting in different mutational patterns. These distances are rescaled by the length of the sequences (so the entries take values in $[0, \text{len}(\text{root sequence})]$ rather than $[0,1]$), so they can be compared with the number of mutations in the tree-based distance matrix.