Real-time outbreak analysis: Ebola as a case study - part 2

11 June 2019 / [practicals] / #simulation #response #ebola #epicurve #reproduction number

Introduction

This practical is the second (out of three) part of a practical which simulates the early assessment and reconstruction of an Ebola Virus Disease (EVD) outbreak. Please make sure you have gone through part 1 before starting part 2. In part 2 of the practical, we introduce various aspects of analysis of the early stage of an outbreak, including growth rate estimation, contact tracing data, delays, and estimates of transmissibility. Part 3 of the practical will give an introduction to transmission chain reconstruction using outbreaker2.

Note: This practical is derived from earlier practicals called Ebola simulation part 1: early outbreak assessment and Ebola simulation part 2: outbreak reconstruction

Learning outcomes

By the end of this practical (part 2), you should be able to:

Estimate & interpret the growth rate & doubling time of the epidemic
Estimate the serial interval from data on pairs infector / infected individuals
Estimate & interpret the reproduction number of the epidemic
Forecast short-term future incidence

Context: A novel EVD outbreak in a fictional country in West Africa

A new EVD outbreak has been notified in a fictional country in West Africa. The Ministry of Health is in charge of coordinating the outbreak response, and have contracted you as a consultant in epidemic analysis to inform the response in real time. You have already read in an done descriptive analysis of the data (part 1 of the practical). Now let’s do some statistical analyses!

Required packages

The following packages, available on CRAN or github, are needed for this analysis. You should have installed them in part 1 but if not, install necessary packages as follows:

# install.packages("remotes")
# install.packages("readxl")
# install.packages("outbreaks")
# install.packages("incidence")
# remotes::install_github("reconhub/epicontacts@ttree")
# install.packages("distcrete")
# install.packages("epitrix")
# remotes::install_github("annecori/EpiEstim")
# remotes::install_github("reconhub/projections")
# install.packages("ggplot2")
# install.packages("magrittr")
# install.packages("binom")
# install.packages("ape")
# install.packages("outbreaker2")
# install.packages("here")

Once the packages are installed, you may need to open a new R session. Then load the libraries as follows:

library(readxl)
library(outbreaks)
library(incidence)
library(epicontacts)

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

library(distcrete)
library(epitrix)
library(EpiEstim)
library(projections)
library(ggplot2)
library(magrittr)
library(binom)
library(ape)
library(outbreaker2)

## 
## Attaching package: 'outbreaker2'

## The following object is masked from 'package:epicontacts':
## 
##     cases_pal

library(here)

## here() starts at /home/zkamvar/Documents/Websites/reconhub--learn

Read in the data processed in part 1

i_daily <- readRDS(here("data/clean/i_daily.rds"))
i_weekly <- readRDS(here("data/clean/i_weekly.rds"))
linelist <- readRDS(here("data/clean/linelist.rds"))
linelist_clean <- readRDS(here("data/clean/linelist_clean.rds"))
contacts <- readRDS(here("data/clean/contacts.rds"))

Estimating the growth rate using a log-linear model

The simplest model of incidence is probably the log-linear model, i.e. a linear regression on log-transformed incidences. For this we will work with weekly incidence, to avoid having too many issues with zero incidence (which cannot be logged).

Visualise the log-transformed incidence:

ggplot(as.data.frame(i_weekly)) + 
  geom_point(aes(x = dates, y = log(counts))) + 
  scale_x_incidence(i_weekly) +
  xlab("date") +
  ylab("log weekly incidence") + 
  theme_minimal()

What does this plot tell you about the epidemic?

In the incidence package, the function fit will estimate the parameters of this model from an incidence object (here, i_weekly). Apply it to the data and store the result in a new object called f. You can print and use f to derive estimates of the growth rate r and the doubling time, and add the corresponding model to the incidence plot:

Fit a log-linear model to the weekly incidence data:

f <- incidence::fit(i_weekly)

## Warning in incidence::fit(i_weekly): 1 dates with incidence of 0 ignored
## for fitting

## <incidence_fit object>
## 
## $model: regression of log-incidence over time
## 
## $info: list containing the following items:
##   $r (daily growth rate):
## [1] 0.04145251
## 
##   $r.conf (confidence interval):
##           2.5 %     97.5 %
## [1,] 0.02582225 0.05708276
## 
##   $doubling (doubling time in days):
## [1] 16.72148
## 
##   $doubling.conf (confidence interval):
##         2.5 %   97.5 %
## [1,] 12.14285 26.84302
## 
##   $pred: data.frame of incidence predictions (12 rows, 5 columns)

plot(i_weekly, fit = f)

Looking at the plot and fit, do you think this is a reasonable fit?

Finding a suitable threshold date for the log-linear model, based on the observed delays

Using the plot of the log(incidence) that you plotted above, and thinking about why exponential growth may not be observed in the most recent weeks, choose a threshold date and fit the log-linear model to a suitable section of the epicurve where you think we can more reliably estimate the growth rate, r, and the doubling time.

You may want to examine how long after onset of symptoms cases are hospitalised; this may inform the threshold date you choose, as follows:

summary(as.numeric(linelist_clean$date_of_hospitalisation - linelist_clean$date_of_onset))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    2.00    3.53    5.00   22.00

Assuming that cases are only ever reported on or after hospitalisation, we see that it takes on average 4 days for cases to be hospitalised, but up to 22 days, so delays in reporting can be long, and it’s sensible to assume that the last 2-3 weeks of data are likely to be incomplete.
Exponential growth is observed only up until early to early to mid-June
This is likely due to the delay between onset and reporting. This means that the cases with most recent onset have not been reported and are missing from the linelist
This can result in potentially erroneous interpretation of the recent trends in incidence when just looking at the epicurve.

# how many weeks should we discard at the end of the epicurve
n_weeks_to_discard <-

# how many weeks should we discard at the end of the epicurve
n_weeks_to_discard <- 2

min_date <- min(i_daily$dates)
max_date <- max(i_daily$dates) - n_weeks_to_discard * 7
# weekly truncated incidence
i_weekly_trunc <- subset(i_weekly, 
                         from = min_date, 
                         to = max_date) # discard last few weeks of data
# daily truncated incidence (not used for the linear regression but may be used later)
i_daily_trunc <- subset(i_daily, 
                         from = min_date, 
                         to = max_date) # remove last two weeks of data

Refit and plot your log-linear model as before but using the truncated data i_weekly_trunc.

f <- incidence::fit(i_weekly_trunc)
f

## <incidence_fit object>
## 
## $model: regression of log-incidence over time
## 
## $info: list containing the following items:
##   $r (daily growth rate):
## [1] 0.04773599
## 
##   $r.conf (confidence interval):
##           2.5 %     97.5 %
## [1,] 0.03141233 0.06405965
## 
##   $doubling (doubling time in days):
## [1] 14.52043
## 
##   $doubling.conf (confidence interval):
##         2.5 %   97.5 %
## [1,] 10.82034 22.06609
## 
##   $pred: data.frame of incidence predictions (11 rows, 5 columns)

plot(i_weekly_trunc, fit = f)

Look at the summary statistics of your fit:

summary(f$model)

## 
## Call:
## stats::lm(formula = log(counts) ~ dates.x, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79781 -0.44508 -0.00138  0.35848  0.69880 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.296579   0.320461   0.925    0.379    
## dates.x     0.047736   0.007216   6.615 9.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5298 on 9 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.8105 
## F-statistic: 43.76 on 1 and 9 DF,  p-value: 9.754e-05

You can look at the goodness of fit (Rsquared), the estimated slope (growth rate) and the corresponding doubling time as below:

# does the model fit the data well? 
adjRsq_model_fit <- summary(f$model)$adj.r.squared
# what is the estimated growth rate of the epidemic?
daily_growth_rate <- f$model$coefficients['dates.x']
# confidence interval:
daily_growth_rate_CI <- confint(f$model, 'dates.x', level=0.95)
# what is the doubling time of the epidemic? 
doubling_time_days <- log(2) / daily_growth_rate
# confidence interval:
doubling_time_days_CI <- log(2) / rev(daily_growth_rate_CI)

Although the log-linear is a simple and quick method for early epidemic assessment, care must be taken to only fit to the point that there is epidemic growth. Note that it may be difficult to define this point.

The log-linear model can’t be readily applied if there are days with no cases as you can’t take the log of 0. Here we have aggregated to weekly data to avoid this issue, and have set weeks with 0 cases to NA so they are ignored in the analysis.
Here we had quite a few weeks of data to work with but early on in an outbreak a similar analysis may lead to very uncertain estimates of the growth rate and doubling time due to small sample size.
If data on reporting delays had been available, a more informed decision about the number of weeks/days to discard could have been made.

Contact Tracing - Looking at contacts

Contact tracing is one of the pillars of an Ebola outbreak response. This involves identifying and following up any at risk individuals who have had contact with a known case (i.e. may have been infected). For Ebola, contacts are followed up for 21 days (the upper bound of the incubation period). This ensures that contacts who become symptomatic can be isolated quickly, reducing the chance of further transmission. We use the full linelist here rather than linelist_clean where we discarded entries with errors in the dates. However, the contact may still be valid.

Using the function make_epicontacts in the epicontacts package, create a new epicontacts object called epi_contacts. Make sure you check the column names of the relevant to and from arguments.

epi_contacts <- make_epicontacts(linelist = linelist, contacts = contacts, 
                                 id = "case_id", # name of identifier in linelist
                                 from = "infector", # name of 'from' col in the contacts
                                 to = "case_id",  # name of 'to' col in the contacts
                                 directed = TRUE)

epi_contacts

## 
## /// Epidemiological Contacts //
## 
##   // class: epicontacts
##   // 169 cases in linelist; 60 contacts;  directed 
## 
##   // linelist
## 
## # A tibble: 169 x 11
##    id    generation date_of_infecti… date_of_onset date_of_hospita…
##    <chr>      <dbl> <date>           <date>        <date>          
##  1 d1fa…          0 NA               2014-04-07    2014-04-17      
##  2 5337…          1 2014-04-09       2014-04-15    2014-04-20      
##  3 f5c3…          1 2014-04-18       2014-04-21    2014-04-25      
##  4 6c28…          2 NA               2014-04-27    2014-04-27      
##  5 0f58…          2 2014-04-22       2014-04-26    2014-04-29      
##  6 4973…          0 2014-03-19       2014-04-25    2014-05-02      
##  7 f914…          3 NA               2014-05-03    2014-05-04      
##  8 881b…          3 2014-04-26       2014-05-01    2014-05-05      
##  9 e66f…          2 NA               2014-04-21    2014-05-06      
## 10 20b6…          3 NA               2014-05-05    2014-05-06      
## # … with 159 more rows, and 6 more variables: date_of_outcome <date>,
## #   outcome <chr>, gender <chr>, hospital <chr>, lon <dbl>, lat <dbl>
## 
##   // contacts
## 
## # A tibble: 60 x 3
##    from   to     source 
##    <chr>  <chr>  <chr>  
##  1 d1fafd 53371b other  
##  2 f5c3d8 0f58c4 other  
##  3 0f58c4 881bd4 other  
##  4 f5c3d8 d58402 other  
##  5 20b688 d8a13d funeral
##  6 2ae019 a3c8b8 other  
##  7 20b688 974bc1 funeral
##  8 2ae019 72b905 funeral
##  9 40ae5f b8f2fd funeral
## 10 f1f60f 09e386 other  
## # … with 50 more rows

You can easily plot these contacts, but with a little bit of tweaking (see ?vis_epicontacts) you can customise for example shapes by gender and arrow colours by source of exposure (or other variables of interest):

# for example, look at the reported source of infection of the contacts.
table(epi_contacts$contacts$source, useNA = "ifany")

## 
## funeral   other 
##      20      40

p <- plot(epi_contacts, node_shape = "gender", shapes = c(m = "male", f = "female"), node_color = "gender", edge_color = "source", selector = FALSE)
p

Using the function match (see ?match) check that the visualised contacts are indeed cases.

match(contacts$case_id, linelist$case_id)

##  [1]   2   5   8  14  15  16  18  20  21  22  24  25  26  27  30  33  37
## [18]  38  40  46  48  51  58  59  62  64  68  69  70  71  73  75  79  84
## [35]  86  88  90  94  95  96  98 103 108 115 116 122 126 131 133 142 145
## [52] 146 147 148 153 155 157 160 162 166

Once you are happy that these are all indeed cases, look at the network:

does it look like there is super-spreading (heterogeneous transmission)?
looking at the gender of the cases, can you deduce anything from this? Are there any visible differences by gender?
There seems to be some superspreading, with some cases coming from a single case (11f8ea infecting 5 other individuals. There does not seem to be any immediate differences between the gender of cases

Estimating transmissibility ( $R$ )

Branching process model

The transmissibility of the disease can be assessed through the estimation of the reproduction number R, defined as the expected number of secondary cases per infected case. In the early stages of an outbreak, and assuming a large population with no immunity, this quantity is also the basic reproduction number $R_0$ , i.e. $R$ in a large fully susceptible population.

The package EpiEstim implements a Bayesian estimation of $R$ , using dates of onset of symptoms and information on the serial interval distribution, i.e. the distribution of time from symptom onset in a case and symptom onset in their infector (see Cori et al., 2013, AJE 178: 1505–1512).

Briefly, EpiEstim uses a simple model describing incidence on a given day as Poisson distributed, with a mean determined by the total force of infection on that day:

$$ I_t ∼ Poisson(\lambda_t)$$

where $I_t$ is the incidence (based on symptom onset) on day $t$ and $\lambda_t$ is the force of infection on that day. Noting R the reproduction number and w() the discrete serial interval distribution, we have:

$$\lambda_t = R \sum_{s=1}^t I_sw(t-s)$$

The likelihood (probability of observing the data given the model and parameters) is defined as a function of R:

$$L(I) = p(I|R) = \prod_{t = 1}^{T} f_P(I_t, \lambda_t)$$

where $f_P(.,\mu)$ is the probability mass function of a Poisson distribution with mean $\mu$ .

Estimating the serial interval

As data was collected on pairs of infector and infected individuals, this should be sufficient to estimate the serial interval distribution. If that was not the case, we could have used data from past outbreaks instead.

Use the function get_pairwise can be used to extract the serial interval (i.e. the difference in date of onset between infectors and infected individuals):

si_obs <- get_pairwise(epi_contacts, "date_of_onset")
summary(si_obs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.500   9.117  12.250  30.000

## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 1.000   5.000   6.500   9.117  12.250  30.000 
hist(si_obs, breaks = 0:30,
     xlab = "Days after symptom onset", ylab = "Frequency",
     main = "Serial interval (empirical distribution)",
     col = "grey", border = "white")

What do you think of this distribution? Make any adjustment you would deem necessary, and then use the function fit_disc_gamma from the package epitrix to fit a discretised Gamma distribution to these data. Your results should approximately look like:

si_fit <- fit_disc_gamma(si_obs, w = 1)
si_fit

## $mu
## [1] 8.612892
## 
## $cv
## [1] 0.7277355
## 
## $sd
## [1] 6.267907
## 
## $ll
## [1] -183.4215
## 
## $converged
## [1] TRUE
## 
## $distribution
## A discrete distribution
##   name: gamma
##   parameters:
##     shape: 1.88822148063956
##     scale: 4.5613778865727

si_fit contains various information about the fitted delays, including the estimated distribution in the $distribution slot. You can compare this distribution to the empirical data in the following plot:

si <- si_fit$distribution
si

## A discrete distribution
##   name: gamma
##   parameters:
##     shape: 1.88822148063956
##     scale: 4.5613778865727

## compare fitted distribution
hist(si_obs, xlab = "Days after symptom onset", ylab = "Frequency",
     main = "Serial interval: fit to data", col = "salmon", border = "white",
     50, ylim = c(0, 0.15), freq = FALSE, breaks = 0:35)
points(0:60, si$d(0:60), col = "#9933ff", pch = 20)
points(0:60, si$d(0:60), col = "#9933ff", type = "l", lty = 2)

Would you trust this estimation of the generation time? How would you compare it to actual estimates from the West African EVD outbreak (WHO Ebola Response Team (2014) NEJM 371:1481–1495) with a mean of 15.3 days and a standard deviation 9.3 days?

skewed distribution with much shorter mean than estimated in NEJM 371:1481–1495
large majority of pairs have SI <15 days
cases may remember more precisely their recent exposures which may lead to underestimation of the serial interval
when estimating the serial interval in real-time, longer serial intervals may not yet have been observed because of right censoring
this estimate is based on few observations so there is uncertainty on the serial interval estimates

Estimation of the Reproduction Number

Now that we have estimates of the serial interval, we can use this information to estimate transmissibility of the disease (as measured by $R_0$ ). Make sure you use a daily (not weekly) incidence object truncated to the period where you have decided there is exponential growth (i_daily_trunc).

Using the estimates of the mean and standard deviation of the serial interval you just obtained, use the function estimate_R to estimate the reproduction number (see ?estimate_R) and store the result in a new object R.

Before using estimate_R, you need to create a config object using the make_config function, where you will specify the time window where you want to estimate the reproduction number as well as the mean_si and std_si to use. For the time window, use t_start = 2 (you can only estimate R from day 2 onwards as you are conditioning on the past observed incidence) and specify t_end = length(i_daily_trunc$counts) to estimate R up to the last date of your truncated incidence i_daily_trunc.

config <- make_config(mean_si = si_fit$mu, # mean of the si distribution estimated earlier
                      std_si = si_fit$sd,  # standard deviation of si dist estimated earlier
                      t_start = 2,         # starting day of time window
                      t_end = length(i_daily_trunc$counts)) # final day of time window

R <- # use estimate_R using method = "parametric_si"
plot(R, legend = FALSE)

R <- estimate_R(i_daily_trunc, method = "parametric_si", config = config)
plot(R, legend = FALSE)

## TableGrob (3 x 1) "arrange": 3 grobs
##       z     cells    name           grob
## incid 1 (1-1,1-1) arrange gtable[layout]
## R     2 (2-2,1-1) arrange gtable[layout]
## SI    3 (3-3,1-1) arrange gtable[layout]

Extract the median and 95% credible intervals (95%CrI) for the reproduction number as follows:

R_median <- R$R$`Median(R)`
R_CrI <- c(R$R$`Quantile.0.025(R)`, R$R$`Quantile.0.975(R)`)
R_median

## [1] 1.278192

R_CrI

## [1] 1.068374 1.513839

Interpret these results: what do you make of the reproduction number? What does it reflect? Based on the last part of the epicurve, some colleagues suggest that incidence is going down and the outbreak may be under control. What is your opinion on this?

Note that these results are highly dependent on the estimated serial interval - a higher mean SI will lead to higher R estimates.
The results will also be sensitive to the number of data points discarded towards the end of the available data.

Note that you could have estimated R0 directly from the growth rate and the serial interval, using the formula described in Wallinga and Lipsitch, Proc Biol Sci, 2007: $R_0 = \frac{1}{\int_{s=0}^{+\infty}e^{-rs}w(s)ds}$ , and implemented in the function r2R0 of the epitrix package. Although this may seem like a complicated formula, the reasoning behind it is simple and illustrated in the figure below: for an observed exponentially growing incidence curve, if you know the serial interval, you can derive the reproduction number.

Estimating R0 from the growth rate and the serial
interval.

Compared to the figure above, there is uncertainty in the growth rate r, and the serial interval has a full distribution rather than a single value. This can be accounted for in estimating R as follows:

# generate a sample of R0 estimates from the growth rate and serial interval we estimated above 
R_sample_from_growth_rate <- lm2R0_sample(f$model, # log-linear model which contains our estimates of the growth rate r
                                          si$d(1:100), # serial interval distribution (truncated after 100 days)
                                          n = 1000) # desired sample size
# plot these:
hist(R_sample_from_growth_rate)

# what is the median?
R_median_from_growth_rate <- median(R_sample_from_growth_rate)
R_median_from_growth_rate # compare with R_median

## [1] 1.416094

# what is the 95%CI?
R_CI_from_growth_rate <- quantile(R_sample_from_growth_rate, c(0.025, 0.975))
R_CI_from_growth_rate # compare with R_CrI

##     2.5%    97.5% 
## 1.284975 1.578711

Note the above estimates are slighlty different from those obtained using the branching process model. There are a few reasons for this. First, you used more detailed data (daily vs weekly incidence) for the branching process (EpiEstim) estimate. Furthermore, the log-linear model puts the same weight on all data points, whereas the branching process model puts a different weight on each data point (depending on the incidence observed at each time step). This may lead to slighlty different R estimates.

Short-term incidence forecasting

The function project from the package projections can be used to simulate plausible epidemic trajectories by simulating daily incidence using the same branching process as the one used to estimate $R$ in EpiEstim. All that is needed is one, or several values of $R$ and a serial interval distribution, stored as a distcrete object.

Here, we first illustrate how we can simulate 5 random trajectories using a fixed value of $R$ = 1.28, the median estimate of R from above.
Use the same daily truncated incidence object as above to simulate future incidence.

#?project
small_proj <- project(i_daily_trunc,# incidence object 
                      R = R_median, # R estimate to use
                      si = si,      # serial interval distribution
                      n_sim = 5,    # simulate 5 trajectories
                      n_days = 10,  # over 10 days
                      R_fix_within = TRUE) # keep the same value of R every day

# look at each projected trajectory (as columns):
as.matrix(small_proj)

##            [,1] [,2] [,3] [,4] [,5]
## 2014-06-18    4    0    4    1    2
## 2014-06-19    3    6    4    4    2
## 2014-06-20    4    4    5    2    3
## 2014-06-21    6    5    8    7    1
## 2014-06-22    9    2    5    6    5
## 2014-06-23    5    5    7    6    5
## 2014-06-24    7    4    2    7    2
## 2014-06-25    7    4    2    3    5
## 2014-06-26    2    6    3    4    3
## 2014-06-27    8    6    8    4    3

You can either use a single value R for the entire trajectory (R_fix_within = TRUE) or resample R at each time step (R_fix_within = FALSE).
R_fix_within = TRUE means that the trajectory is associated with a single R value and easier to understand
It also gives more extreme values of R and more conservative projections

Using the same principle, generate 1,000 trajectories for the next 2 weeks, using a range of plausible values of $R$ .
The posterior distribution of R is gamma distributed (see Cori et al. AJE 2013) so you can use the rgamma function to randomly draw values from that distribution. You will also need to use the function gamma_mucv2shapescale from the epitrix package as shown below.

sample_R <- function(R, n_sim = 1000)
{
  mu <- R$R$`Mean(R)`
  sigma <- R$R$`Std(R)`
  Rshapescale <- gamma_mucv2shapescale(mu = mu, cv = sigma / mu)
  R_sample <- rgamma(n_sim, shape = Rshapescale$shape, scale = Rshapescale$scale)
  return(R_sample)
}

R_sample <- sample_R(R, 1000) # sample 1000 values of R from the posterior distribution
hist(R_sample, col = "grey")  # plot histogram of sample
abline(v = R_median, col = "red") # show the median estimated R as red solid vertical line
abline(v = R_CrI, col = "red", lty = 2) # show the 95%CrI of R as red dashed vertical lines

Store the results of your new projections in an object called proj.

proj <- project(x = i_daily_trunc, 
                R = R_sample, # now using a sample of R values
                si = si, 
                n_sim = 1000, 
                n_days = 14, # project over 2 weeks
                R_fix_within = TRUE)

You can visualise the projections as follows:

plot(i_daily_trunc) %>% add_projections(proj, c(0.025, 0.5, 0.975))

How would you interpret the following summary of the projections?

apply(proj, 1, summary)

##         2014-06-18 2014-06-19 2014-06-20 2014-06-21 2014-06-22 2014-06-23
## Min.         0.000      0.000      0.000       0.00      0.000      0.000
## 1st Qu.      2.000      3.000      3.000       3.00      3.000      3.000
## Median       4.000      4.000      4.000       4.00      4.000      4.000
## Mean         3.829      4.167      4.121       4.31      4.534      4.602
## 3rd Qu.      5.000      5.250      5.000       6.00      6.000      6.000
## Max.        12.000     12.000     13.000      14.00     13.000     13.000
##         2014-06-24 2014-06-25 2014-06-26 2014-06-27 2014-06-28 2014-06-29
## Min.         0.000      0.000      0.000      0.000      0.000      0.000
## 1st Qu.      3.000      3.000      3.000      3.000      3.000      4.000
## Median       5.000      5.000      5.000      5.000      5.000      5.000
## Mean         4.716      4.921      5.052      5.266      5.287      5.696
## 3rd Qu.      6.000      6.000      7.000      7.000      7.000      7.000
## Max.        13.000     16.000     15.000     17.000     17.000     18.000
##         2014-06-30 2014-07-01
## Min.         0.000      0.000
## 1st Qu.      4.000      4.000
## Median       5.000      6.000
## Mean         5.709      5.987
## 3rd Qu.      7.000      8.000
## Max.        29.000     18.000

apply(proj, 1, function(x) mean(x > 0)) # proportion of trajectories with at least

## 2014-06-18 2014-06-19 2014-06-20 2014-06-21 2014-06-22 2014-06-23 
##      0.976      0.986      0.981      0.984      0.987      0.987 
## 2014-06-24 2014-06-25 2014-06-26 2014-06-27 2014-06-28 2014-06-29 
##      0.986      0.994      0.991      0.991      0.986      0.986 
## 2014-06-30 2014-07-01 
##      0.992      0.993

                                        # one case on each given day

apply(proj, 1, mean) # mean daily number of cases

## 2014-06-18 2014-06-19 2014-06-20 2014-06-21 2014-06-22 2014-06-23 
##      3.829      4.167      4.121      4.310      4.534      4.602 
## 2014-06-24 2014-06-25 2014-06-26 2014-06-27 2014-06-28 2014-06-29 
##      4.716      4.921      5.052      5.266      5.287      5.696 
## 2014-06-30 2014-07-01 
##      5.709      5.987

apply(apply(proj, 2, cumsum), 1, summary) # projected cumulative number of cases in

##         2014-06-18 2014-06-19 2014-06-20 2014-06-21 2014-06-22 2014-06-23
## Min.         0.000      1.000      2.000      5.000      6.000      9.000
## 1st Qu.      2.000      6.000      9.000     13.000     17.000     21.000
## Median       4.000      8.000     12.000     16.000     20.000     25.000
## Mean         3.829      7.996     12.117     16.427     20.961     25.563
## 3rd Qu.      5.000     10.000     15.000     20.000     25.000     30.000
## Max.        12.000     17.000     26.000     33.000     42.000     55.000
##         2014-06-24 2014-06-25 2014-06-26 2014-06-27 2014-06-28 2014-06-29
## Min.        10.000      11.00     14.000     16.000     16.000     17.000
## 1st Qu.     25.000      28.75     32.000     37.000     41.000     46.000
## Median      30.000      34.00     39.000     45.000     50.000     55.000
## Mean        30.279      35.20     40.252     45.518     50.805     56.501
## 3rd Qu.     36.000      41.00     47.000     53.000     59.000     66.000
## Max.        65.000      72.00     86.000    100.000    113.000    129.000
##         2014-06-30 2014-07-01
## Min.         19.00     23.000
## 1st Qu.      50.00     55.000
## Median       60.00     66.000
## Mean         62.21     68.197
## 3rd Qu.      73.00     80.000
## Max.        158.00    173.000

                                          # the next two weeks

apply(proj, 1, summary) shows a summary of incidence on each day
apply(proj, 1, function(x) mean(x > 0)) shows the proportion of trajectories with at least one case on each given day
apply(proj, 1, mean) shows the mean daily number of cases
apply(apply(proj, 2, cumsum), 1, summary) shows the projected additional cumulative number of cases in the next two weeks

According to these results, what are the chances that more cases will appear in the near future? Is this outbreak being brought under control? Would you recommend scaling up / down the response? Is this consistent with your estimate of $R$ ?

the uncertainty is wide and becomes wider the further into the future.
the central trend suggests increasing number of cases
this is based on the assumption that transmissibility has remained constant over the course of the outbreak so far and will remain constant in the future
all this relies on our estimated serial interval distribution—a higher mean SI would lead to larger R estimates and therefore more pessimistic incidence projections.

Pause !

Please let a demonstrator know when you have reached this point before proceeding further.

Accounting for uncertainty in the serial interval estimates when estimating the reproduction number

Note that this section is independent from the following ones, please skip if you don’thave time.

EpiEstim allows to explicitly account for the uncertainty in the serial interval estimates because of limited sample size of pairs of infector/infected individuals. Note that it also allows accounting for uncertainty on the dates of symptom onset for each of these pairs (which is not needed here).

Use the method = "si_from_data" option in estimate_R. To use this option, you need to create a data frame with 4 columns: EL, ER, SL and SR for the left (L) and right ® bounds of the observed time of symptoms in the infector (E) and infected (S for secondary) cases. Here we derive this from si_obs as follows:

si_data <- data.frame(EL = rep(0L, length(si_obs)), 
                      ER = rep(1L, length(si_obs)), 
                      SL = si_obs, SR = si_obs + 1L)

We can then feed this into estimate_R (but this will take some time to run as it estimates the serial interval distribution using an MCMC and fully accounts for the uncertainty in the serial interval to estimate the reproduction number).

config <- make_config(t_start = 2, 
                      t_end = length(i_daily_trunc$counts))
R_variableSI <- estimate_R(i_daily_trunc, method = "si_from_data", 
                           si_data = si_data,
                           config = config)

## Running 8000 MCMC iterations 
## MCMCmetrop1R iteration 1 of 8000 
## function value = -186.93263
## theta = 
##    0.61048
##    1.53754
## Metropolis acceptance rate = 1.00000
## 
## MCMCmetrop1R iteration 1001 of 8000 
## function value = -187.66955
## theta = 
##    0.60839
##    1.72039
## Metropolis acceptance rate = 0.55145
## 
## MCMCmetrop1R iteration 2001 of 8000 
## function value = -186.33760
## theta = 
##    0.84673
##    1.39242
## Metropolis acceptance rate = 0.56322
## 
## MCMCmetrop1R iteration 3001 of 8000 
## function value = -186.84716
## theta = 
##    0.95161
##    1.19723
## Metropolis acceptance rate = 0.56448
## 
## MCMCmetrop1R iteration 4001 of 8000 
## function value = -187.56424
## theta = 
##    0.86940
##    1.46982
## Metropolis acceptance rate = 0.55711
## 
## MCMCmetrop1R iteration 5001 of 8000 
## function value = -186.49130
## theta = 
##    0.76777
##    1.49876
## Metropolis acceptance rate = 0.55449
## 
## MCMCmetrop1R iteration 6001 of 8000 
## function value = -186.49278
## theta = 
##    0.86552
##    1.39227
## Metropolis acceptance rate = 0.55257
## 
## MCMCmetrop1R iteration 7001 of 8000 
## function value = -186.98037
## theta = 
##    1.00900
##    1.19664
## Metropolis acceptance rate = 0.55092
## 
## 
## 
## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
## The Metropolis acceptance rate was 0.55012
## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
## 
## Gelman-Rubin MCMC convergence diagnostic was successful.
## 
## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 
## Estimating the reproduction number for these serial interval estimates...
##  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

# checking that the MCMC converged: 
R_variableSI$MCMC_converged

## [1] TRUE

# and plotting the output:
plot(R_variableSI)

## TableGrob (3 x 1) "arrange": 3 grobs
##       z     cells    name           grob
## incid 1 (1-1,1-1) arrange gtable[layout]
## R     2 (2-2,1-1) arrange gtable[layout]
## SI    3 (3-3,1-1) arrange gtable[layout]

Look at the new median estimated R and 95%CrI: how different are they from your previous estimates? Do you think the size of the contacts dataset has had an impact on your results?

R_variableSI_median <- R_variableSI$R$`Median(R)`
R_variableSI_CrI <- c(R_variableSI$R$`Quantile.0.025(R)`, R_variableSI$R$`Quantile.0.975(R)`)
R_variableSI_median

## [1] 1.296516

R_variableSI_CrI

## [1] 1.078988 1.544788

Estimating time-varying transmissibility

When the assumption that ( $R$ ) is constant over time becomes untenable, an alternative is to estimating time-varying transmissibility using the instantaneous reproduction number ( $R_t$ ). This approach, introduced by Cori et al. (2013), is also implemented in the package EpiEstim. It estimates ( $R_t$ ) for a custom time windows (default is a succession of sliding time windows), using the same Poisson likelihood described above. In the following, we estimate transmissibility for 1-week sliding time windows (the default of estimate_R):

config = make_config(list(mean_si = si_fit$mu, std_si = si_fit$sd))  
# t_start and t_end are automatically set to estimate R over 1-week sliding windows by default.

Rt <-         # use estimate_R using method = "parametric_si"
  
# look at the most recent Rt estimates:
tail(Rt$R[, c("t_start", "t_end", "Median(R)", 
             "Quantile.0.025(R)", "Quantile.0.975(R)")])

Rt <- estimate_R(incid = i_daily_trunc,      # incidence object
                 method = "parametric_si",   # use parametric serial interval
                 config = config)            # config specified above

## Default config will estimate R on weekly sliding windows.
##     To change this change the t_start and t_end arguments.

# look at the most recent Rt estimates:
tail(Rt$R[, c("t_start", "t_end", "Median(R)", 
             "Quantile.0.025(R)", "Quantile.0.975(R)")])

##    t_start t_end Median(R) Quantile.0.025(R) Quantile.0.975(R)
## 60      61    67 1.2417304         0.8144152          1.797603
## 61      62    68 1.0045473         0.6318309          1.501326
## 62      63    69 0.8404935         0.5074962          1.294848
## 63      64    70 1.0276438         0.6538993          1.522464
## 64      65    71 1.0335607         0.6576643          1.531230
## 65      66    72 1.0337804         0.6578041          1.531556

Plot the estimate of $R$ over time:

plot(Rt, legend = FALSE)

## Warning: Removed 1 rows containing missing values (geom_path).

## TableGrob (3 x 1) "arrange": 3 grobs
##       z     cells    name           grob
## incid 1 (1-1,1-1) arrange gtable[layout]
## R     2 (2-2,1-1) arrange gtable[layout]
## SI    3 (3-3,1-1) arrange gtable[layout]

How would you interpret this result? What is the caveat of this representation?

What would you have concluded if instead of using i_daily_trunc as above, you ad used the entire epidemics curve io.e. i_daily?

Rt_whole_incid <-             # use estimate_R using method = "parametric_si", 
                              # the same config as above but i_daily instead of i_daily_trunc
  
# look at the most recent Rt estimates:
tail(Rt_whole_incid$R[, c("t_start", "t_end", 
                         "Median(R)", "Quantile.0.025(R)", "Quantile.0.975(R)")])

Rt_whole_incid <- estimate_R(incid = i_daily, 
                             method = "parametric_si", 
                             config = config)

## Default config will estimate R on weekly sliding windows.
##     To change this change the t_start and t_end arguments.

## Warning in estimate_R_func(incid = incid, method = method, si_sample = si_sample, : You're estimating R too early in the epidemic to get the desired
##             posterior CV.

tail(Rt_whole_incid$R[, c("t_start", "t_end", 
                         "Median(R)", "Quantile.0.025(R)", "Quantile.0.975(R)")])

##    t_start t_end Median(R) Quantile.0.025(R) Quantile.0.975(R)
## 74      75    81 1.2330741         0.8412787         1.7310657
## 75      76    82 1.0008292         0.6564151         1.4488601
## 76      77    83 0.9432201         0.6128291         1.3753773
## 77      78    84 0.8202251         0.5158976         1.2258508
## 78      79    85 0.7452526         0.4566772         1.1356909
## 79      80    86 0.5146158         0.2811874         0.8515131

# the above incorrectly infers that the recent transmissibility is <1

the above assumes constant R within sliding time-window

Save data and outputs

This is the end of part 2 of the practical. Before going on to part 3, you’ll need to save the following objects:

saveRDS(linelist, here("data/clean/linelist.rds"))
saveRDS(linelist_clean, here("data/clean/linelist_clean.rds"))
saveRDS(epi_contacts, here("data/clean/epi_contacts.rds"))
saveRDS(si, here("data/clean/si.rds"))

Real-time outbreak analysis: Ebola as a case study - part 2

Introduction

Learning outcomes

Context: A novel EVD outbreak in a fictional country in West Africa

Required packages

Read in the data processed in part 1

Estimating the growth rate using a log-linear model

Finding a suitable threshold date for the log-linear model, based on the observed delays

Contact Tracing - Looking at contacts

Estimating transmissibility ( $R$ )

Branching process model

Estimating the serial interval

Estimation of the Reproduction Number

Short-term incidence forecasting

Pause !

Accounting for uncertainty in the serial interval estimates when estimating the reproduction number

Estimating time-varying transmissibility

Save data and outputs

Real-time outbreak analysis: Ebola as a case study - part 1

Real-time outbreak analysis: Ebola as a case study - part 3

Introduction

Learning outcomes

Context: A novel EVD outbreak in a fictional country in West Africa

Required packages

Read in the data processed in part 1

Estimating the growth rate using a log-linear model

Finding a suitable threshold date for the log-linear model, based on the observed delays

Contact Tracing - Looking at contacts

Estimating transmissibility ($R$)

Branching process model

Estimating the serial interval

Estimation of the Reproduction Number

Short-term incidence forecasting

Pause !

Accounting for uncertainty in the serial interval estimates when estimating the reproduction number

Estimating time-varying transmissibility

Save data and outputs

Estimating transmissibility ( $R$ )