COVID19.Analytics

CRAN_Status_Badge CRAN checks Downloads

Introduction

The “covid19.analytics” R package allows users to obtain live* worldwide data from the novel CoronaVirus Disease originally reported in 2019, CoViD-19, as published by the JHU CCSE repository [1], as well as, provide basic analysis tools and functions to investigate these datasets.

The goal of this package is to make the latest data promptly available to researchers and the scientific community.

Data Accessibility

The covid19.data() function allows users to obtain realtime data about the CoViD19 reported cases from the JHU’s CCSE repository, in the following modalities: * “aggregated” data for the latest day, with a great ‘granularity’ of geographical regions (ie. cities, provinces, states, countries) * “time series” data for larger accumulated geographical regions (provinces/countries)

The datasets also include information about the different categories (status) “confirmed”/“deaths”/“recovered” of the cases reported daily per country/region/city.

This data-acquisition function, will first attempt to retrieve the data directly from the JHU repository with the latest updates. If for what ever reason this fails (eg. problems with the connection) the package will load a preserved “image” of the data which is not the latest one but it will still allow the user to explore this older dataset. In this way, the package offers a more robust and resilient approach to the quite dynamical situation with respect to data availability and integrity.

Data retrieval options

argument description
aggregated latest number of cases aggregated by country
Time Series data
ts-confirmed time series data of confirmed cases
ts-deaths time series data of fatal cases
ts-recovered time series data of recovered cases
ts-ALL all time series data combined
Deprecated data formats
ts-dep-confirmed time series data of confirmed cases as originally reported (deprecated)
ts-dep-deaths time series data of deaths as originally reported (deprecated)
ts-dep-recovered time series data of recovered cases as originally reported (deprecated)
Combined
ALL all of the above

covid19-Sequencing data

The covid19.genomic.data() allows users to obtain the covid19’s genomic sequencing data from NCBI [2].

Analytical & Graphical Indicators

In addition to the access and retrieval of the data, the package includes some basics functions to estimate totals per regions/country/cities, growth rates and daily changes in the reported number of cases.

Overview of the Main Functions from the “covid19.analytics” Package

Function Description Main Type of Output
Data Acquisition
covid19.data obtain live* worldwide data for covid19 virus, from the JHU’s CCSE repository [1] return dataframes/list with the collected data
covid19.genomic.data obtain covid19’s genomic sequencing data from NCBI [2] list, with the RNA seq data in the “$NC_045512.2” entry
Analysis
report.summary summarize the current situation, will download the latest data and summarize different quantities on screen table and static plots (pie and bar plots) with reported information, can also output the tables into a text file
tots.per.location compute totals per region and plot time series for that specific region/country static plots: data + models (exp/linear, Poisson, Gamma), mosaic and histograms when more than one location are selected
growth.rate compute changes and growth rates per region and plot time series for that specific region/country static plots: data + models (linear,Poisson,Exp), mosaic and histograms when more than one location are selected
Graphics and Visualization
total.plts plots in a static and interactive plot total number of cases per day, the user can specify multiple locations or global totoals static and interactive plot
live.map generates an interactive map displaying cases around the world static and interactive plot
Modelling
generate.SIR.model generates a SIR (Susceptible-Infected-Recovered) model list containing the fits for the SIR model
plt.SIR.model plot the results from the SIR model static and interactive plots

Details and Specifications of the Analytical & Visualization Functions

Reports

The report.summary() generates an overall report summarizing the different datasets. It can summarize the “Time Series” data (cases.to.process="TS"), the “aggregated” data (cases.to.process="AGG") or both (cases.to.process="ALL"). It will display the top 10 entries in each category, or the number indicated in the Nentries argument, for displaying all the records set Nentries=0.

In each case (“TS” or/and “AGG”) will present tables ordered by the different cases included, i.e. confirmed infected, deaths, recovered and active cases.

The dates when the report is generated and the date of the recorded data will be included at the beginning of each table.

It will also compute the totals, averages, standard deviations and percentages of various quantities: * it will determine the number of unique locations processed within the dataset * it will compute the total number of cases per case

Typical structure of a summary.report() output for the Time Series data:

############################################################################### 
  ##### TS-CONFIRMED Cases  -- Data dated:  2020-04-04  ::  2020-04-05 17:27:17 
################################################################################ 
  Number of Countries/Regions reported:  181 
  Number of Cities/Provinces reported:  82 
  Unique number of geographical locations combined: 259 
-------------------------------------------------------------------------------- 
  Worldwide  ts-confirmed  Totals: 1197405 
-------------------------------------------------------------------------------- 
    Country.Region Province.State Totals GlobalPerc LastDayChange
1             US                308850      25.79         33264
2          Spain                126168      10.54          6969
3          Italy                124632      10.41          4805
4        Germany                 96092       8.03          4933
-------------------------------------------------------------------------------- 
  Global Perc. Average:  0.39 (sd: 2.02) 
  Global Perc. Average in top  10 :  7.98 (sd: 7) 
-------------------------------------------------------------------------------- 
.
.
.

Typical structure of a summary.report() output for the Aggregated data:

########################################################################################################################## 
  ##### AGGREGATED Data  -- ORDERED BY  CONFIRMED Cases  -- Data dated:  2020-04-04  ::  2020-04-05 17:27:19 
########################################################################################################################## 
  Number of Countries/Regions reported: 181 
  Number of Cities/Provinces reported: 137 
  Unique number of geographical locations combined: 316 
-------------------------------------------------------------------------------------------------------------------------- 
     Country_Region Province_State Confirmed Perc.Confirmed Deaths Perc.Deaths Recovered Perc.Recovered Active Perc.Active
1          Spain                   126168          10.54  11947        9.47     34219          27.12  80002       63.41
2          Italy                   124632          10.41  15362       12.33     20996          16.85  88274       70.83
3        Germany                    96092           8.03   1444        1.50     26400          27.47  68248       71.02
.
.
.

A full example of this report for today can be seen here (updated twice a day, daily).

In addition to this, the function will also generate some graphical outputs, including pie and bar charts representing the top regions in each category.

Totals per Location & Growth Rate

It is possible to dive deeper into a particular location by using the tots.per.location() and growth.rate() functions. Theses functions are capable of processing different types of data, as far as these are “Time Series” data. It can either focus in one category (eg. “TS-confirmed”,“TS-recovered”,“TS-deaths”,) or all (“TS-all”). When these functions detect different type of categories, each category will be processed separatedly. Similarly the functions can take multiple locations, ie. just one, several ones or even “all” the locations within the data. The locations can either be countries, regions, provinces or cities. If an specified location includes multiple entries, eg. a country that has several cities reported, the functions will group them and process all these regions as the location requested.

Totals per Location

This function will plot the number of cases as a function of time for the given locations and type of categories, in two plots: a log-scale scatter one a linear scale bar plot one.

When the function is run with multiple locations or all the locations, the figures will be adjusted to display multiple plots in one figure in a mosaic type layout.

Additionally, the function will attempt to generate different fits to match the data: * an exponential model using a Linear Regression method * a Poisson model using a General Linear Regression method * a Gamma model using a General Linear Regression method The function will plot and add the values of the coefficients for the models to the plots and display a summary of the results in screen.

It is possible to instruct the function to draw a “confidence band” based on a moving average, so that the trend is also displayed including a region of higher confidence based on the mean value and standard deviation computed considering a time interval set to equally dividing the total range of time over 10 equally spaced intervals.

The function will return a list combining the results for the totals for the different locations as a function of time.

Growth Rate

The growth.rate() function allows to compute daily changes and the growth rate defined as the ratio of the daily changes between two consecutive dates.

The growth.rate() shares all the features of the tots.per.location() function, i.e. can process the different types of cases and multiple locations.

The graphical output will display two plots per location: * a scatter plot with the number of changes between consecutive dates as a function of time, both in linear scale (left vertical axis) and log-scale (right vertical axis) combined * a bar plot displaying the growth rate for the particular region as a function of time.

When the function is run with multiple locations or all the locations, the figures will be adjusted to display multiple plots in one figure in a mosaic type layout. In addition to that, when there is more than one location the function will also generate two different styles of heatmaps comparing the changes per day and growth rate among the different locations (vertical axis) and time (horizontal axis).

The function will return a list combining the results for the “changes per day” and the “growth rate” as a function of time.

Plotting Totals

The function totals.plt() will generate plots of the total number of cases as a function of time. It can be used for the total data or for an specific or multiple locations. The function can generate static plots and/or interactive ones, as well, as linear and/or semi-log plots.

Plotting Cases in the World

The function live.map() will display the different cases in each corresponding location all around the world in an interactive map of the world. It can be used with time series data or aggregated data, aggregated data offers a much more detailed information about the geographical distribution.

Experimental: Modelling the evolution of the Virus spread

We are working in the development of modelling capabilities. A preliminary prototype has been included and can be accessed using the generate.SIR.model function, which implements a simple SIR (Susceptible-Infected-Recovered) ODE model using the actual data of the virus.

This function will try to identify the data points where the onset of the epidemy began and consider the following data points to generate a proper guess for the two parameters describing the SIR ODE system. After that, it will solve the different equations and provide details about the solutions as well as plot them in a static and interactive plot.

Further Features

We will continue working on adding and developing new features to the package, in particular modelling and predictive capabilities.

Installation

For using the “covi19.analytics” package, first you will need to install it.

The stable version can be downloaded from the CRAN repository:

install.packages("covid19.analytics")

To obtain the development version you can get it from the github repository, i.e.

# need devtools for installing from the github repo
install.packages("devtools")

# install bioC.logs
devtools::install_github("mponce0/covid19.analytics")

For using the package, either the stable or development version, just load it using the library function:

# load "covid19.analytics"
library(covid19.analytics)

Examples

Reading data

# obtain all the records combined for "confirmed", "deaths" and "recovered" cases -- *aggregated* data
 covid19.data.ALLcases <- covid19.data()

# obtain time series data for "confirmed" cases
 covid19.confirmed.cases <- covid19.data("ts-confirmed")

# reads all possible datasets, returning a list
 covid19.all.datasets <- covid19.data("ALL")

# reads the latest aggregated data
 covid19.ALL.agg.cases <- covid19.data("aggregated")

# reads time series data for casualties
 covid19.TS.deaths <- covid19.data("ts-deaths")

Read covid19’s genomic data

# obtain covid19's genomic data
 covid19.gen.seq <- covid19.genomic.data()

# display the actual RNA seq
 covid19.gen.seq$NC_045512.2

Some basic analysis

Summary Report

# a quick function to overview top cases per region for time series and aggregated records
report.summary()

# save the tables into a text file named 'covid19-SummaryReport_CURRENTDATE.txt'
# where *CURRRENTDATE* is the actual date
report.summary(saveReport=TRUE)

E.g. today’s report is available here

Totals per Country/Region/Province

# totals for confirmed cases for "Ontario"
tots.per.location(covid19.confirmed.cases,geo.loc="Ontario")

# total for confirmed cases for "Canada"
tots.per.location(covid19.confirmed.cases,geo.loc="Canada")

# total nbr of deaths for "Mainland China"
tots.per.location(covid19.TS.deaths,geo.loc="China")

# total nbr of confirmed cases in Hubei including a confidence band based on moving average
tots.per.location(covid19.confirmed.cases,geo.loc="Hubei", confBnd=TRUE)

Images available here

The figures show the total number of cases for different cities (provinces/regions) and countries: one the upper plot in log-scale with a linear fit to an exponential law and in linear scale in the bottom panel. Details about the models are included in the plot, in particular the growth rate which in several cases appears to be around 1.2+ as predicted by some models. Notice that in the case of Hubei, the values is closer to 1, as the dispersion of the virus has reached its logistic asymptote while in other cases (e.g. Germany and Italy –for the presented dates–) is still well above 1, indicating its exponential growth.

IMPORTANT Please notice that the “linear exponential” modelling function implements a simple (naive) and straight-forward linear regression model, which is not optimal for exponential fits. The reason is that the errors for large values of the dependent variable weight much more than those for small values when apply the exponential function to go back to the original model. Nevertheless for the sake of a quick interpretation is OK, but one should bare in mind the implications of this simplification.

We also provide two additional models, as shown in the figures above, using the Generalized Linear Model glm() function, using a Poisson and Gamma family function. In particular, the tots.per.location function will determine when is possible to automatically generate each model and display the information in the plot as well as details of the models in the console.

# read the time series data for all the cases
all.data <- covid19.data('ts-ALL')

# run on all the cases
tots.per.location(all.data,"Japan")

It is also possible to run the tots.per.location (and growth.rate) functions, on the whole data set, for which a quite large but complete mosaic figure will be generated, e.g.

# total for death cases for "ALL" the regions
tots.per.location(covid19.TS.deaths)

# or just
tots.per.location(covid19.data("ts-confirmed"))

Growth Rate

# read time series data for confirmed cases
TS.data <- covid19.data("ts-confirmed")

# compute changes and growth rates per location for all the countries
growth.rate(TS.data)

# compute changes and growth rates per location for 'Italy'
growth.rate(TS.data,geo.loc="Italy")

# compute changes and growth rates per location for 'Italy' and 'Germany'
growth.rate(TS.data,geo.loc=c("Italy","Germany"))

The previous figures show on the upper panel the number of changes on a daily basis in linear scale (thin line, left y-axis) and log scale (thicker line, right y-axis), while the bottom panel displays the growth rate for the given country/region/city.

Combining multiple geographical locations:

# obtain Time Series data
TSconfirmed <- covid19.data("ts-confirmed")

# explore different combinations of regions/cities/countries
# when combining different locations, heatmaps will also be generated comparing the trends among these locations
growth.rate(TSconfirmed,geo.loc=c("Italy","Canada","Ontario","Quebec","Uruguay"))

growth.rate(TSconfirmed,geo.loc=c("Hubei","Italy","Spain","United States","Canada","Ontario","Quebec","Uruguay"))

growth.rate(TSconfirmed,geo.loc=c("Hubei","Italy","Spain","US","Canada","Ontario","Quebec","Uruguay")

Visualization Tools

# retrieve time series data
TS.data <- covid19.data("ts-ALL")

# static and interactive plot 
totals.plt(TS.data)
# totals for Ontario and Canada, without displaying totals and one plot per page
totals.plt(TS.data, c("Canada","Ontario"), with.totals=FALSE,one.plt.per.page=TRUE)

# totals for Ontario, Canada, Italy and Uruguay; including global totals with the linear and semi-log plots arranged one next to the other
totals.plt(TS.data, c("Canada","Ontario","Italy","Uruguay"), with.totals=TRUE,one.plt.per.page=FALSE)

# totals for all the locations reported on the dataset, interactive plot will be saved as "totals-all.html"
totals.plt(TS.data, "ALL", fileName="totals-all")
# retrieve aggregated data
data <- covid19.data("aggregated")

# interactive map of aggregated cases -- with more spatial resolution
live.map(data)

# or
live.map()

# interactive map of the time series data of the confirmed cases with less spatial resolution, ie. aggregated by country
live.map(covid19.data("ts-confirmed"))

Interactive examples can be seen at https://mponce0.github.io/covid19.analytics/

Simulating the Virus spread

# read time series data for confirmed cases
data <- covid19.data("ts-confirmed")

# run a SIR model for a given geographical location
generate.SIR.model(data,"Hubei", t0=1,t1=15)
generate.SIR.model(data,"Germany",tot.population=83149300)
generate.SIR.model(data,"Uruguay", tot.population=3500000)
generate.SIR.model(data,"Ontario",tot.population=14570000)

# the function will aggregate data for a geographical location, like a country with multiple entries
generate.SIR.model(data,"Canada",tot.population=37590000)

# modelling the spread for the whole world, storing the model and generating an interactive visualization
world.SIR.model <- generate.SIR.model(data,"ALL", t0=1,t1=15, tot.population=7.8e9, staticPlt=FALSE)
# plotting and visualizing the model
plt.SIR.model(world.SIR.model,"World",interactiveFig=TRUE,fileName="world.SIR.model")

References

(*) Data can be upto 24 hs delayed wrt the latest updates.

[1] 2019 Novel CoronaVirus CoViD-19 (2019-nCoV) Data Repository by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) https://github.com/CSSEGISandData/COVID-19

[2] Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome NCBI Reference Sequence: NC_045512.2 https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2

Further Resources


Source-Credit: CDC/ Alissa Eckert, MS; Dan Higgins, MAMS

More R Resources

Dashboards