# The isatabr package

The isatabr package is developed as a easy-to-use package for reading, modifying and writing files in the Investigation/Study/Assay (ISA) Abstract Model of the metadata framework using the ISA tab-delimited (TAB) format.

ISA is a metadata framework to manage an increasingly diverse set of life science, environmental and biomedical experiments that employ one or a combination of technologies. Built around the Investigation (the project context), Study (a unit of research) and Assay (analytical measurements) concepts, ISA helps you to provide rich descriptions of experimental metadata (i.e. sample characteristics, technology and measurement types, sample-to-data relationships) so that the resulting data and discoveries are reproducible and reusable.

# 1 The ISA tab structure

The ISA-Tab structure is described in full detail on the ISA-tab website. The description below is mostly taken from there and slightly condensed when appropriate.

ISA-Tab uses three types of file to capture the experimental metadata:

• Investigation file
• Study file
• Assay file (with associated data files)

The Investigation file contains all the information needed to understand the overall goals and means used in an experiment; experimental steps (or sequences of events) are described in the Study and in the Assay file(s). For each Investigation file there may be one or more Studies defined with a corresponding Study file; for each Study there may be one or more Assays defined with corresponding Assay files.

In order to facilitate identification of ISA-Tab component files, specific naming patterns should be followed:

• i_*.txt for identifying the Investigation file, e.g. i_investigation.txt
• s_*.txt for identifying Study file(s), e.g. s_gene_survey.txt
• a_*.txt for identifying Assay file(s), e.g. a_transcription.txt

## 1.1 The Investigation file

The Investigation file fulfills four needs:

1. to declare key entities, such as factors, protocols, which may be referenced in the other files;
2. to track provenance of the terminologies (controlled vocabularies or ontologies) there are used, where applicable;
3. to relate each Study file to an Investigation (this only becomes necessary when two or more Study files need to be grouped);
4. to relate Assay files to Studies.

An Investigation file is structured as a table with vertical headings along the first column, and corresponding values in the subsequent columns. The following section headings must appear in the Investigation file (in order), and the study block (headings from STUDY to STUDY CONTACTS) can be repeated, one block per study associated with the investigation.

• ONTOLOGY SOURCE REFERENCE
• INVESTIGATION
• INVESTIGATION PUBLICATIONS
• INVESTIGATION CONTACTS
• STUDY
• STUDY DESIGN DESCRIPTORS
• STUDY PUBLICATIONS
• STUDY FACTORS
• STUDY ASSAYS
• STUDY PROTOCOLS
• STUDY CONTACTS

For a full description of all sections see the aforementioned site.

## 1.2 The Study file

The Study file contains contextualizing information for one or more assays, for example; the subjects studied; their source(s); the sampling methodology; their characteristics; and any treatments or manipulations performed to prepare the specimens.

## 1.3 The Assay file

The Assay file represents a portion of the experimental graph (i.e., one part of the overall structure of the workflow); each Assay file must contain assays of the same type, defined by the type of measurement (e.g. gene expression) and the technology employed (e.g. DNA microarray). Assay-related information includes protocols, additional information relating to the execution of those protocols and references to data files (whether raw or derived).

## 1.4 Example data

As an example for working with the isatabr package we will use the data set that accompanies Atwell et al. (2010). The associated files are included in the package.

# 2 Reading files in the ISA-Tab format

ISA-Tab files can be stored in two different ways, either as separate files in a directory, or as .zip file containing the files. The example data is included in both ways in the package. Both formats can be read into R using the readISATab function.

When reading ISA-Tab files from a directory, only the name of the directory, where the ISA-TAB files are located, needs to be specified.

## Read ISA-Tab files from directory.
isaObject1 <- readISATab(path = file.path(system.file("extdata/Atwell", package = "isatabr")))

When reading zipped files, both the directory, where the zip-file is located, and the name of the file need to be specified.

## Read ISA-Tab files from directory.
isaObject2 <- readISATab(path = file.path(system.file("extdata", package = "isatabr")),
zipfile = "Atwell.zip")

In both cases readISATab will automatically detect the Investigation, Study and Assay files assuming the naming conventions described in the previous section are followed. If this is not the case, the function will give an error indicating the problem. The imported ISA-Tab files are stored in an object of the S4 class ISA. Since the information is almost identical for reading files from a directory and zipped-files, the following sections will show the example for the files read from a directory only.

# 3 Accessing and updating ISA objects.

All information from the ISA-Tab files is stored within slots in the ISA object. The table below gives an overview of the different slots and a brief description of the information stored in the slot. For a more exhaustive description see help("ISA"). Note that an investigation may have multiple studies. Therefore, data concerning studies is stored in a list object, where one element in the list corresponds to one study. Likewise a study may consist of multiple assays and assay data is stored in a list object, where one element in the list corresponds to one assay.

Slot Type Description
path character path to the ISA-Tab files
iFileName character name of the investigation file
oSR data.frame ONTOLOGY SOURCE REFERENCE section of investigation file
invest data.frame INVESTIGATION section of investigation file
iPubs data.frame INVESTIGATION PUBLICATIONS section of investigation file
iContacts data.frame INVESTIGATION CONTACTS section of investigation file
study list of data.frames STUDY sections of investigation file
sDD list of data.frames STUDY DESIGN DESCRIPTORS sections of investigation file
sPubs list of data.frames STUDY PUBLICATIONS sections of investigation file
sFacts list of data.frames STUDY FACTORS sections of investigation file
sAssays list of data.frames STUDY ASSAYS sections of investigation file
sProts list of data.frames STUDY PROTOCOLS sections of investigation file
sContacts list of data.frames STUDY CONTACTS sections of investigation file
sFiles list of data.frames content of study files
aFiles list of data.frames content of assay files

All slots have corresponding functions for accessing and modifying information. The names of these access functions are the same as the slots they refer to, e.g. accessing the iFileName slot in an ISA object can be done using the iFileName() function. There is one exception to this. To prevent problems with the path() function, that already exists in quite some other packages, the path slot in an ISA object should be accessed using the isaPath() function.

## Access path for isaObjects
isaPath(isaObject1)
#> [1] "C:\\Users\\rossu027\\AppData\\Local\\Temp\\RtmpygYYN5\\Rinst2394d6a49a1\\isatabr\\extdata\\Atwell"
isaPath(isaObject2)
#> [1] "C:\\Users\\rossu027\\AppData\\Local\\Temp\\Rtmp6ZQbzc"

The path for isaObject1 shows the directory from which the files were read. As isaObject2 was read directly for a zipped archive, the files were first extracted into a temporary folder and subsequently read from there. This temporary folder is shown as the path.

The other slots are accessible in a similar way. Some more examples are shown below.

## Access studies.
isaStudies <- study(isaObject1)

## Print study names.
names(isaStudies)
#> [1] "GMI_Atwell_study"

## Access study descriptors.
isaSDD <- sDD(isaObject1)

## Shows study descriptors for study GMI_Atwell_study.
isaSDD$GMI_Atwell_study #> Study Design Type #> 1 GWAS of 107 phenotypes in Arabidopsis thaliana inbred lines using ~250k SNPs in 199 accessions #> Study Design Type Term Accession Number Study Design Type Term Source REF #> 1 <NA> <NA> It is not only possible to access the different slots in an ISA object, the slots can also be updated. As the access function, the update functions have the same name as the slots they refer to. As an example, let’s assume an error sneaked into the ONTOLOGY SOURCE REFERENCE section and we want to update one of the source versions. First have a look at the current content of the ONTOLOGY SOURCE REFERENCE section. (isaOSR <- oSR(isaObject1)) #> Term Source Name Term Source File Term Source Version #> 1 OBI http://data.bioontology.org/ontologies/OBI 23 #> 2 EFO http://data.bioontology.org/ontologies/EFO 118 #> 3 UO http://purl.obolibrary.org/obo/UO <NA> #> 4 NCBITaxon http://data.bioontology.org/ontologies/NCBITAXON 6 #> 5 PO http://data.bioontology.org/ontologies/PO 10 #> 6 GMI http://gwas.gmi.oeaw.ac.at/ <NA> #> Term Source Description #> 1 Ontology for Biomedical Investigations #> 2 Experimental Factor Ontology #> 3 Unit Ontology #> 4 National Center for Biotechnology Information (NCBI) Organismal Classification #> 5 Plant Ontology #> 6 Cataloque of Arabidopsis accessions at GMI Now we update the version of the OBI ontology source from 23 to 24. Then we update the modified ontology source data.frame in the ISA object. ## Update version number. isaOSR[1, "Term Source Version"] <- 24 ## Update oSR in ISA object. oSR(isaObject1) <- isaOSR ## Check the updated oSR. oSR(isaObject1) #> Term Source Name Term Source File Term Source Version #> 1 OBI http://data.bioontology.org/ontologies/OBI 24 #> 2 EFO http://data.bioontology.org/ontologies/EFO 118 #> 3 UO http://purl.obolibrary.org/obo/UO <NA> #> 4 NCBITaxon http://data.bioontology.org/ontologies/NCBITAXON 6 #> 5 PO http://data.bioontology.org/ontologies/PO 10 #> 6 GMI http://gwas.gmi.oeaw.ac.at/ <NA> #> Term Source Description #> 1 Ontology for Biomedical Investigations #> 2 Experimental Factor Ontology #> 3 Unit Ontology #> 4 National Center for Biotechnology Information (NCBI) Organismal Classification #> 5 Plant Ontology #> 6 Cataloque of Arabidopsis accessions at GMI In a similar way all slots in an ISA object can be accessed and updated. # 4 Processing assay files The assay files may contain information about the files used to store the actual data for the assay. Per assay file two types of data files may be referred to: 1) the file(s) containing the raw data, and 2) the file(s) containing derived data. Looking at the assay tab file in our example data, we see that the Raw Data File column is empty, no raw data files are available. However, the Derived Data File shows the file d_data.txt. ## Inspect assay tab. isaAFile <- aFiles(isaObject1) head(isaAFile$a_study1.txt)
#>   Sample Name Protocol REF Parameter Value[Organism part] Term Source REF Term Accession Number
#> 1     sample1  Phenotyping                             NA              NA                    NA
#> 2     sample2  Phenotyping                             NA              NA                    NA
#> 3     sample3  Phenotyping                             NA              NA                    NA
#> 4     sample4  Phenotyping                             NA              NA                    NA
#> 5     sample5  Phenotyping                             NA              NA                    NA
#> 6     sample6  Phenotyping                             NA              NA                    NA
#>   Parameter Value[Trait Definition File] Assay Name Raw Data File        Protocol REF
#> 1                                tdf.txt  assay1020            NA Data transformation
#> 2                                tdf.txt     assay1            NA Data transformation
#> 3                                tdf.txt  assay1131            NA Data transformation
#> 4                                tdf.txt   assay569            NA Data transformation
#> 5                                tdf.txt   assay293            NA Data transformation
#> 6                                tdf.txt   assay388            NA Data transformation
#>   Derived Data File
#> 1        d_data.txt
#> 2        d_data.txt
#> 3        d_data.txt
#> 4        d_data.txt
#> 5        d_data.txt
#> 6        d_data.txt

To read the contents of the data files, either raw or derived, in the assay tab file, we can use the processAssay() function. The exact working of this function depends on the technology type of the assay. For most technology types the data files are read as plain .txt files assuming a tab-delimited format. Only for mass spectrometry and microarray data the files are read differently (see the sections below). As the output above shows, the assay file in the example has a Data Transformation technology and is therefore read as tab-delimited file.

Before being able to process the assay file, i.e. read the data, we first have to extract the assay tabs using the getAssayTabs() function. This function extracts all the assay files from an ISA object and stores them as assayTab objects. These assayTab objects contain not only the content of the assay tab file, but also extra information, e.g. technology type.

## Get assay tabs for isaObject1.
aTabObjects <- getAssayTabs(isaObject1)

## Process assay data.
aTabObject = aTabObjects$s_study1.txt$a_study1.txt,
type = "derived")

## Display first rows and columns.
#>   Assay Name      LD  LDV      SD     SDV FT10 FT16 FT22 Seed Dormancy Emco5
#> 1   assay152 6.84105 32.6 93.0417 4.97494   74   87   89            NA    NA
#> 2   assay279      NA   NA      NA      NA   NA   NA   NA            NA    NA
#> 3   assay211      NA   NA      NA      NA   NA   NA   NA            NA    NA
#> 4   assay256      NA   NA      NA      NA   NA   NA   NA            NA    NA
#> 5   assay907      NA   NA      NA      NA   NA   NA   NA            NA    NA
#> 6   assay948      NA   NA      NA      NA   NA   NA   NA            NA    NA

The data is now stored in isaDat and can be used for further analysis within R.

## 4.1 Mass spectrometry assay files

Mass spectrometry data is often stored in Network Common Data Form (NetCDF) files, i.e. in .CDF files. Assay data containing these data will be processed in a different way than regular assay data. To be able to do this the xcms package is required. This package is available from Bioconductor.

As an example for the processing of mass spectrometry files we will use a subset of the quantitated LC/MS peaks from the spinal cords of 6 wild-type and 6 fatty acid amide hydrolase (FAAH) knockout mice described in Saghatelian et al. (2004). A more extensive version of this data set is available in the faahKO data package on Bioconductor.

## Read ISA-Tab files for faahKO.
isaObject3 <- readISATab(path = file.path(system.file("extdata/faahKO", package = "isatabr")))

After reading the ISA-Tab files, we can now process the mass spectrometry assay data. In this example the raw data is available, so when processing the assay we specify type = "raw". The rest of the code is similar to the previous section.

## Get assay tabs for isaObject3.
aTabObjects3 <- getAssayTabs(isaObject3)

## Process assay data.
aTabObject = aTabObjects3$s_Proteomic_profiling_of_yeast.txt$a_metabolite.txt,
type = "raw")

## Display output.
#> An "xcmsSet" object with 1 samples
#>
#> Time range: 2506.1-4132.1 seconds (41.8-68.9 minutes)
#> Mass range: 200.1-599.3129 m/z
#> Peaks: 470 (about 470 per sample)
#> Peak Groups: 0
#> Sample classes:
#>
#> Feature detection:
#>  o Peak picking performed on MS1.
#> Profile settings: method = bin
#>                   step = 0.1
#>
#> Memory usage: 0.0804 MB

As the output shows, processing the mass spectrometry data gives an object of class xcmsSet from the xcms package. This object contains all available information from the .CDF file that was read and can be used for further analysis.

## 4.2 Microarray assay files

Microarray data is often stored in an Affymetrix Probe Results file. These .CEL files contain information on the probe set’s intensity values, and a probe set represents a gene. Assay data containing these data will be processed in a different way than regular assay data. To be able to do this the affy package is required. This package is available from Bioconductor.

Processing microarray data is done in a very similar way as processing mass spectrometry data, as described in the previous section. The main difference is that the resulting object will in this case be an object object of class ExpressionSet, which is used as input in many Bioconductor packages.

# 5 Writing files in the ISA-Tab format.

After updating an ISA object, it can be written back to a directory using the writeISAtab() function. All content of the ISA object will be written to investigation, study and assay files following the ISA-Tab standard for file specification. By default the files are written to the current working directory, but the directory can be specified using the path argument.

## Write content of ISA object to a temporary directory.
writeISAtab(isaObject = isaObject1,
path = tempdir())

Note that existing files are always overwritten. Therefore, writing files to the same directory, from where the original files were read, will result in the original files being overwritten.

Besides writing the full ISA object it is also possible to write only the investigation file , one or more study files or one or more assay files.

## Write investigation file.
writeInvestigationFile(isaObject = isaObject1,
path = tempdir())

## Write study file.
writeStudyFiles(isaObject = isaObject1,
studyFilenames = "s_study1.txt",
path = tempdir())

## Write assay file.
writeAssayFiles(isaObject = isaObject1,
assayFilenames = "a_study1.txt",
path = tempdir())

## References

Atwell, Susanna, Yu S. Huang, Bjarni J. Vilhjálmsson, Glenda Willems, Matthew Horton, Yan Li, Dazhe Meng, et al. 2010. “Genome-Wide Association Study of 107 Phenotypes in Arabidopsis Thaliana Inbred Lines.” Nature 465 (7298): 627–31. https://doi.org/10.1038/nature08800.
Saghatelian, Alan, Sunia A. Trauger, Elizabeth J. Want, Edward G. Hawkins, Gary Siuzdak, and Benjamin F. Cravatt. 2004. “Assignment of Endogenous Substrates to Enzymes by Global Metabolite Profiling.” Biochemistry 43 (45): 14332–39. https://doi.org/10.1021/bi0480335.