orderly is a package designed to help make analysis more reproducible. Its principal aim is to automate a series of basic steps in the process of writing analyses, making it easy to:
orderly we have two main hopes:
orderly requires a few conventions around organisation of a project, and after that tries to keep out of your way. However, these requirements are designed to make collaborative development with git easier by minimising conflicts and making backup easier by using an append-only storage system.
One often-touted goal of R over point-and-click analyses packages is that if an analysis is scripted it is more reproducible. However, essentially all analyses depend on external resources - packages, data, code, and R itself; any change in these external resources might change the results. Preventing such changes in external resources is not always possible, but tracking changes should be straightforward - all we need to know is what is being used.
For example, while reproducible research has become synonymous with literate programming this approach often increases the number of external resources. A typical
knitr document will depend on:
orderly package helps by
The core problem is that analyses have no general interface. Consider in contrast the role that functions take in programming. All functions have a set of arguments (inputs) and a return value (outputs). With
orderly, we borrow this idea, and each piece of analysis will require that the user describes what is needed and what will be produced.
The user describes the inputs of their analysis, including:
The user also provides a list of “artefacts” (file-based results) that they will produce.
It then stores metadata alongside the analysis including md5 hashes of all inputs and outputs, copies of data extracted from the database, a record of all R packages loaded at the end of the session, and (if using git) information about the git state (hash, branch and status).
Then if one of the dependencies of a report changes (the used data, code, etc), we have metadata that can be queried to identify the likely source of the change.
To illustrate, we will start with a minimal example (you can use
orderly::orderly_init to create a similar structure directly), and we will build it up to demonstrate
orderly features. In the most minimal example, we want to run a script that creates a graph. It uses no external resources.
. ├── orderly_config.yml └── src └── example ├── orderly.yml └── script.R
In this example, the
orderly_config.yml file is completely empty, but serves to mark the root of the
orderly project. We have one report, called
example, and its configuration is within
script: script.R artefacts: - staticgraph: description: A graph of things filenames: mygraph.png
There are two keys here:
scriptthe path of the script to run,
artefactsa description of the artefacts (files) that will be produced by running this script. In this case it is a graph with the filename
The script is plain R code:
png("mygraph.png") plot(1:10) dev.off()
The R code can be as long or as short as needed and can use whatever packages it needs.
orderly does not do anything with the script apart from run it so
it can be formatted freely (there are no magic comments, etc). There are no restrictions on what can be done except that it must produce the artefacts listed in
orderly.yml. If not, an error will be thrown describing what was missing.
To run the report, use
orderly::orderly_run (typically one would be in the
orderly root and so the
root directory could be omitted, but within this vignette we use a temporary directory):
id <- orderly::orderly_run("example", root = path)
## [ info ] Writing initial orderly archive version as 0.7.15
## [ name ] example
## [ id ] 20200112-130722-50ed8348
## [ start ] 2020-01-12 13:07:22
## ## > png("mygraph.png") ## ## > plot(1:10)
## ## > dev.off() ## png ## 2
## [ end ] 2020-01-12 13:07:22
## [ elapsed ] Ran report in 0.02177477 secs
## [ artefact ] mygraph.png: 7439da1b07de39a3f71b6f40bf9016ac
The return value is the id of the report (also printed on the
third line of log output) and is always in the format
YYYYMMDD-HHMMSS-abcdef01 where the last 8 characters are hex
digits (i.e., 4 random bytes). This means reports will
automatically sort nicely but we'll have some collision resistance.
##  "20200112-130722-50ed8348"
Having run the report, the directory layout now looks like:
. ├── archive ├── data │ ├── csv │ └── rds ├── draft │ └── example │ └── 20200112-130722-50ed8348 │ ├── mygraph.png │ ├── orderly.yml │ ├── orderly_run.rds │ └── script.R ├── orderly_config.yml └── src └── example ├── orderly.yml └── script.R
drafts, the directory
example/20200112-130722-50ed8348 has been
created which contains the result of running the report. In here
there are the files:
orderly.yml: this is an exact copy of the input file
script.R: this is an exact copy of the script used for the analysis
mygraph.png: the artefact created by the report
orderly_run.rds: this is metadata about the run and includes hashes of input files, of the data used, and of the output etc, along with details about the packages used and the state of git. It is stored in R's internal data format.
Every time a report is run it will create a new directory at this
level with a new id. Running the report again now might create the
We store the copies of files as run by
orderly so that even if the
input files change we can still easily get back to previous
versions of the inputs, alongside the outputs, and these are safe
from any changes to the underlying source.
You can see the list of draft reports like so:
orderly::orderly_list_drafts(root = path)
## name id ## 1 example 20200112-130722-50ed8348
Once you're happy with a report, then “commit” it with
orderly::orderly_commit(id, root = path) ## [ commit ] example/20200112-130722-50ed8348 ## [ copy ] ## [ import ] example:20200112-130722-50ed8348 ## [ success ] :) ##  "/tmp/Rtmpe9qF7y/file2f6942f02d9/archive/example/20200112-130722-50ed8348"
After this step our directory structure looks like:
. ├── archive │ └── example │ └── 20200112-130722-50ed8348 │ ├── mygraph.png │ ├── orderly.yml │ ├── orderly_run.rds │ └── script.R ├── data │ ├── csv │ └── rds ├── draft │ └── example ├── orderly.sqlite ├── orderly_config.yml └── src └── example ├── orderly.yml └── script.R
This looks very like the previous, but files have been moved from being within
draft to being within
archive. The other difference is that the index
orderly.sqlite has been created. This is a machine-readable index to all the
orderly metadata that can be used to build applications around
orderly (for example OrderlyWeb, a web portal for
orderly - see the “remotes” vignette). The documentation for the database format is available on the
orderly::orderly_new to create a directory within
src. The name is important and should not contain spaces (nor should it change as this will change the key report id and you'll lose a chain of history), then edit the file
orderly.yml within that directory.
orderly::orderly_new("new", root = path)
## Created report at '/tmp/Rtmpe9qF7y/file2f6942f02d9/src/new'
## Edit the file 'orderly.yml' within this directory
which results in a directory structure like:
. ├── archive │ └── example │ └── 20200112-130722-50ed8348 │ ├── mygraph.png │ ├── orderly.yml │ ├── orderly_run.rds │ └── script.R ├── data │ ├── csv │ └── rds ├── draft │ └── example ├── orderly.sqlite ├── orderly_config.yml └── src ├── example │ ├── orderly.yml │ └── script.R └── new └── orderly.yml
Resources to a report are expected to be read-only files that are used by the script to produce the report. Examples of the sort of files that should be used as resources are:
“Resources” cannot be modified by the report; if
orderly detects that a
resource has been changed an error will be thrown.
orderly will automatically detect any files named
README.md in a
report's source directory and copy them to the new directory too.
resources: - years.csv - data_dictionary.xlsx - report.Rmd - code_documentation.md
“Sources” are files containing R code that will be sourced (via the R
source()) before the main script is run. Often this file
contains functions or variables used by the main script. All of the
copying and sourcing will be handled by
orderly itself so there is no
need to explicitly source the files in the main script.
“Artefacts” are the output of the report. At least one artefact must be listed and files created during the running of the script must be included as artefacts (or deleted before the script finishes) or an error will be returned.
Examples of artefacts fields in
artefacts: - report: filenames: report.html description: a simple report
artefacts: - report: filenames: report.html description: a simple report - data: description: - associated data sets filenames: - data_one.csv - data_two.csv - data_three.csv - data_four.csv
When declaring an artefact we have to specify what format the artefact
is. Currently supported formats are :
interactivehtml. These tags reflect the
intent of use of the file, they have no special meaning within
It is often the case that we would like to write a report that depends
on an earlier report, e.g. one report produces a large dataset and a
later report produces a high level summary.
orderly allows a report to
directly copy an artefact file from an existing report without having
to manually copy it into the report source directory. This is handled
depends block of the report's
To use a file as a dependency it must be explicitly listed as an artefact.
An simple example might look like:
depends: - big-data-report: id: 20190425-163691-b8451bbf use: data.rds: huge-data-set.rds draft: false
This will copy the file the
huge-data-set.rds from the report
20190425-163691-b8451bbf and rename it
data.rds. This file can then be used by the report as if it were in
the source directory. The field
orderly to only use
completed reports in the archive as opposed to drafts. Setting this to
true allows uncommitted reports in
draft to be used. This can be
useful when developing a chain of related reports.
If we want a report to always use the latest version of a report
big-data-report we can set the
id field to
depends: - big-data-report: id: latest use: data.rds: huge-data-set.rds draft: false
This will find the most recent version of the report
and copy files from that directory.
To use multiple artefacts from a single report add the files into the
use block e.g.:
depends: - big-data-report: id: latest use: data.rds: huge-data-set.rds pop.csv: population_data.csv draft: false
To use artefacts from multiple reports we add multiple entries to the
depends field e.g.:
depends: - big-data-report: id: latest use: data.rds: huge-data-set.rds pop.csv: population_data.csv draft: false - report_two: id: latest use: data_b.rds: filename.rds draft: false
We can also use the same artefact from different versions of the same report. This might come up if we want to write a report that compares the output from different versions of another report. The yaml pattern for this is:
depends: - big-data-report: id: 20190425-163691-b8451bbf draft: false use: data_latest.rds: huge-data-set.rds - big-data-report: id: 20181225-172991-34c91ef1 draft: false use: data_old: huge-data-set.rds
The important feature in this example is the dashes before the report name. When all the report names are different these dashes can be omitted, but they are necessary when the report depends on different versions of the same report. Since including the dashes will never cause a problem but omitting them might, we advise that they should always be included.
Sometimes it can be useful to control how a report runs by a parameter. This could be the name of a country that an analysis applies to (though we hope to develop a better interface for this soon) through to controlling the number of iterations that an analysis runs for. Parameters are declared in the
parameters: a: default: 1 b: ~
This would declare that a report takes two parameters
a (with a default of 1), and
b (with no default). Running the report would then look like:
orderly::orderly_run("reportname", list(a = 10, b = 100))
These parameters are then present in the environment of the report, so the code can use values
The parameters will also be interpolated into any SQL queries before they are run, so if the
data: cars: query: SELECT * FROM mtcars WHERE cyl > ?a
then this will be evaluated on the SQL server with
a substituted in where the query says
?a (this is done with
There might be files that are used in (almost) every report. Examples
of these sorts of files might be document templates or organisation
logos. To set up a global resource create a directory
<root> and the following to the
Then to use any file in
your_global_dir in your report add a
global_resources field to that report's
global_resources: logo.jpg: org_logo.jpg latex_class.cls: org_latex_class.cls styles.css: org_styles.css
Currently code i.e. R source code cannot be sourced from the global resources directory. So for example utility functions common across multiple reports must be included in each report directory separately. The functionality to include global source code may be added in future versions.
One of the original aims of
orderly was to provide a set of tools for use of SQL databases within reproducible reporting. Because the SQL database is an external global resource it is difficult to work with any concept of “versioning” from R (there is no git history, no way of easily rolling back to previous versions etc). If using a central SQL server, there is configuration that should be kept out of any analysis, particularly things like passwords. Configuration problems multiply when using both “production” and “staging” systems as we would like to be able to switch between different configurations.
orderly_config.yml configuration specifies the locations of databases (there can be any number), for example:
database: source: driver: RPostgres::Postgres args: host: dbhost.example.org port: 5432 user: myusername password: s3cret dbname: mydb
This database will be referred to elsewhere as
source and it will be connected with the
RPostgres::Postgres driver (from the RPostgres package). Arguments within the
args block will be passed to the driver, in this case being the equivalent of:
DBI::dbConnect(RPostgres::Postgres, host = "dbhost.example.org", port = 5432, user = "myusername", password = "s3cret", dbname = "mydb")
The values used in the
args blocks can be environment values (e.g.,
password: $DB_PASSWORD) in which case they will be resolved from the environment before connecting. This will be useful for keeping secrets out of source control.
For SQLite databases, the
args block will typically contain only
dbname which is the path to the database file.
A report configuration (
orderly.yml) can contain a
data block, which contains sql queries, such as:
data: cars: query: SELECT * FROM mtcars WHERE cyl = 4 database: source
In this case, the query
SELECT * FROM mtcars WHERE cyl = 4 will be run against the
source database to create an object
cars in the report environment. The actual report code can use that object without having ever created the database connection or evaluating the query.
Further, the data used in the query will be captured in
data directory, and hashes of the data will be stored alongside the results. This means that even if the data in the database is a constantly moving target we can still detect if changes to the data are responsible for changes in the result of a report.
If you need to perform complicated SQL queries, then you can export the database connection directly by adding a block:
connection: con: source
which will save the connection to the
source database as the R object
con. We have used this where a report requires running queries in a loop that depend on the results of a previous query or additional data loaded into a report, or where the result of the query will be very large and we do not want to save it to disk.
Note that this reduces the amount of tracking that
orderly can do, as we have no way of knowing what is done with the connection once passed to the script.
The contents of
orderly_config.yml may contain things like secrets
(passwords) or hostnames that vary depending on deployment (e.g.,
testing locally vs running on a remote system). To customise this,
you can use environment variables within the configuration. So
rather than writing
database: source: driver: RPostgres::Postgres args: host: localhost port: 5432 user: myuser dbname: databasename password: p4ssw0rd
you might write
database: source: driver: RPostgres::Postgres args: host: $MY_DBHOST port: $MY_DBPORT user: $MY_DBUSER dbname: $MY_DBNAME password: $MY_PASSWORD
environment variables, as used this way must begin with a
dollar sign and consist only of uppercase letters, numbers and the
underscore character. You can then set the environment variables
.Renviron (either within the project or in your home
directory) file or your
.profile file. Alternatively, you can
create a file
orderly_envir.yml in the same directory as
orderly_config.yml with key-value pairs, such as
MY_DBHOST: localhost MY_DBPORT: 5432 MY_DBUSER: myuser MY_DBNAME: databasename MY_PASSWORD: p4ssw0rd
This will be read every time that
orderly_config.yml is read (in
.Renviron which is read-only at the start of a
session). This will likely be more pleasant to work with.
The advantage of using environment variables is that you can add
orderly_envir.yml file to your
.gitignore and avoid
committing system-dependent data to the central repository.
To avoid leaving passwords in plain text, you can use
vault (along with the R client
To do this, you should include the address of your vault server in the
Then, for values that you want to retrieve from the vault, set the value of the field to
<path> is the name of a vault
secret path (probably beginning with
field is the name of the field at that path. So, for example:
would look up the field
password at the path
/secret/users/database_user. This can be stored in
orderly_config.yml, in the contents of an environment variable or
orderly_envir.yml (currently this only uses the vault version 1 key-value storage)
As a report becomes more complex, the function
orderly::orderly_test_start will become useful; this function
creates the isolated environment that
orderly uses to run a report,
but then leaves you to interactively work with your report.