Getting started with wrProteo

Wolfgang Raffelsberger



This package contains a collection of various tools for Proteomics used at the proteomics platform of the IGBMC. To get started, we need to load the packages “wrMisc” and this package (wrProteo), both are available from CRAN. The packages wrGraph and RColorBrewer get used internally by some of the functions from this package for (optional/improved) figures. Furthermore, the Bioconductor package limma will be used internally for statistical testing

## if you need to install the packages 'wrMisc','wrProteo' and 'wrGraph' from CRAN :
## The package 'wrGraph' is not obligatory, but it allows making better graphs
# This is wrProteo version no :
#> [1] '1.3.0'

Calculating Molecular Masses From Composition Formulas

Please note that molecular masses may be given in two flavours : Monoisotopic mass and average mass. Please refer eg to Wikipedia: monoisotopic mass for details. Monoisotopic masses commonly are used in mass-spectrometry and will be used by default in this package.

Molecular (mono-isotopic) masses of the atomes used here were taken from Unimod. They can be easily updated, if in the future, (mono-isotopic) molecular masses will be determined with higher precision (ie more digits).

Molecular masses based on (summed) chemical formulas

At this level (summed) atomic compositions are evaluated. Here, the number of atoms has to be written before the atom. Thus, ‘2C’ means two atoms of carbon. Empty or invalid entries will be by default returned as mass=0.0, a message will comment such issues.

The mass of an electron can be assigned using ‘e’.

massDeFormula(c("12H12O", "HO", " 2H 1 Se, 6C 2N", "HSeCN", " ", "e"))
#>  -> massDeFormula :  can't find ' ' .. setting to 0 mass
#>       12H12O         1H1O  2H1Se1e6C2N  1H1Se1e1C1N                        1e 
#> 2.040329e+02 1.700274e+01 1.819389e+02 1.069280e+02 0.000000e+00 5.485799e-04
# Note, that empty/invalid entries will be returned as a mass of 0.0 .

# Ignore empty/invalid entries
massDeFormula(c("12H12O", "HO", " 2H 1 Se, 6C 2N", "HSeCN"), rmEmpty=TRUE)
#>      12H12O        1H1O 2H1Se1e6C2N 1H1Se1e1C1N 
#>   204.03288    17.00274   181.93887   106.92797

Using the argument massTy one can switch this and similar functions from this package from default monoisotopic mass to average mass mode :

massDeFormula(c("12H12O", "HO", " 2H 1 Se, 6C 2N", "HSeCN"), massTy="aver")
#>      12H12O        1H1O 2H1Se1e6C2N 1H1Se1e1C1N 
#>   204.08808    17.00734   181.05403   105.98589

Molecular masses based on amino-acid sequence

The mass of these amino-acids can be used:

#>         A         R         N         D         C         E         Q         G 
#>  71.03711 156.10111 114.04293 115.02694 103.00918 129.04259 128.05858  57.02146 
#>         H         I         L         K         M         F         P         S 
#> 137.05891 113.08406 113.08406 128.09496 131.04048 147.06841  97.05276  87.03203 
#>         T         W         Y         V 
#> 101.04768 186.07931 163.06333  99.06841

Here the one-letter amino-acid code is used to descibre peptides or proteins.

## mass of peptide (or protein)
pep1 <- c(aa="AAAA",de="DEFDEF")
convAASeq2mass(pep1, seqN=FALSE)
#>       aa       de 
#> 302.1590 800.2865

Reading Fasta Files (from Uniprot)

This package contains a parser for Fasta-files allowing to separate different fields of meta-data like IDs, Name and Species. Here we will read a tiny example fasta-file obtained from a collection of typical contaminants in proteomics.

path1 <- system.file('extdata', package='wrProteo')
fiNa <-  "conta1.fasta"
## basic reading of Fasta
fasta1 <- readFasta2(file.path(path1,fiNa))
#>  - attr(*, "names")= chr [1:35] "TRYP_PIG TRYPSIN PRECURSOR" "TRY1_BOVIN Cationic trypsin (Fragment) - Bos taurus (Bovine)" "CTRA_BOVIN Chymotrypsinogen A - Bos taurus (Bovine)" "CTRB_BOVIN Chymotrypsinogen B - Bos taurus (Bovine)" ...

## now let's read and further separate details in annotation-fields
fasta1det <- readFasta2(file.path(path1,fiNa), tableOut=TRUE)
#>  chr [1:35, 1:9] "generic" "generic" "generic" "generic" "generic" ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:9] "database" "uniqueIdentifier" "entryName" "proteinName" ...


In most ‘Omics’ acitivities getting additional annotation may get a bit tricky. In Proteomics most mass-spectrometry software will use the informaton provided in the Fasta-file as annotation (as provided from UniProt). But this lacks for example chromosomal location information. There are are many repositories with genome-, gene- and protein-annotation and most of them are linked, but sometimes the links get broken when data-base updates are not done everywhere or are not followed by new re-matching. The Fasta-files used initially for mass-spectrometry peak-identification may be (slightly) not up to date (sometimes gene-or protein IDs do change or may even disappear) and thus will contribute to a certain percentage of entries hard to link.

In the context of adding chromosomal annotation to a list of proteins here the following concept is developed : Annotation-tables from UCSC are available for a good number of species and can be downloaded for conventient off-line search. Howwever, in the context of less common species we realized that the UniProt tables from UCSC had many times low yield in final matching. For this reason we propose the slightly more complicated route that provided finally a much higher success-rate to find chromosomal locations for a list of UniProt IDs. First one needs to download from UCSC the table corresponding to the species of question fields ‘clade’,‘genome’,‘assembly’). For ‘group’ choose ‘Genes and Gene Predictions’ and for ‘track’ choose ‘Ensembl Genes’, as table choose ‘ensGene’. In addition, it is possible to select either the entire genome-annotation or user-specified regions. In terms of ‘output-format’ one may choose ‘GTF’ (slightly more condensed, no headers) or ‘all filds from selected table’.

The strategy presented here :

Locate & download organism annotation from UCSC, read into R (readUCSCtable()) -> from R export (non-redundant) ‘enst’-IDs (still readUCSCtable() ), get corresponding UniProt-IDs at UniProt site, save as file and import result into R (readUniProtExport() ) -> (in R) combine with initial UCSC table (readUniProtExport() )

The function readUCSCtable() is able to read such files downloaded from UCSC, compressed .gz files can be read, too. In the example below we’ll just look at chromosome 11 of the human genome - to keep files small.

path1 <- system.file("extdata", package="wrProteo")
gtfFi <- file.path(path1, "UCSC_hg38_chr11extr.gtf")
UcscAnnot1 <- readUCSCtable(gtfFi)
#>  -> readUCSCtable :  File 'UCSC_hg38_chr11extr.gtf' is gtf : TRUE
#>  -> readUCSCtable :  simplifed from 2000 to 223 non-redundant gene_id

# The Ensemble transcript identifyers and their chromosomal locations :
#>      gene_id              chr     start    end      strand frame
#> [1,] "ENST00000270115.8"  "chr11" "537527" "554912" "+"    "."  
#> [2,] "ENST00000308020.6"  "chr11" "450279" "491399" "+"    "."  
#> [3,] "ENST00000311189.8"  "chr11" "532242" "535576" "-"    "."  
#> [4,] "ENST00000312165.5"  "chr11" "278570" "285304" "+"    "."  
#> [5,] "ENST00000325113.8"  "chr11" "196738" "200258" "+"    "."  
#> [6,] "ENST00000325147.13" "chr11" "202924" "207382" "-"    "."

However, this annotation does not provide protein IDs. In order to obtain the corresponding protein IDs an additional step is required : Here we will use the batch search/conversion tool from UniProt. In order to do so, we can export directly from readUCSCtable() a small text-file which can be fed into the UniProt batch-search tool.

# Here we'll redo reading the UCSC table, plus immediatley write the file for UniProt conversion 
#  (in this vignette we write to tempdir() to keep things tidy)
expFi <- file.path(tempdir(),"deUcscForUniProt2.txt")
UcscAnnot1 <- readUCSCtable(gtfFi, exportFileNa=expFi)
#>  -> readUCSCtable :  File 'UCSC_hg38_chr11extr.gtf' is gtf : TRUE
#>  -> readUCSCtable :  simplifed from 2000 to 223 non-redundant gene_id
#>  -> readUCSCtable :  Write to file : ENST00000270115, ENST00000308020, ENST00000311189, ENST00000312165 ...
#>  -> readUCSCtable :  Exporting file  'C:/Users/wraff/AppData/Local/Temp/RtmpQnVbmI/deUcscForUniProt2.txt'  for conversion on

Now everything is ready to go to UniProt for retrieving the corresponding UniProt-IDs. Since we exported Ensemble transcript IDs (ENSTxxx), select converting from ‘Ensembl Transcript’ to ‘UniProtKB’. Then, when downloading the conversion results, choose tab-separated file format (compression is recommended), this may take several seconds (depending on the size).

It is suggested to rename the downloaded file so one can easily understand its content. Note, that the function readUniProtExport() can also read .gz compressed files. To continue this vignette we’ll use a result which has been downloaded from UniProt and renamed to ‘’. One may also optionally define a specific genomic region of interest using the argument ‘targRegion’, here the entire chromosome 11 was chosen.

deUniProtFi <- file.path(path1, "")
deUniPr1 <- readUniProtExport(UniP=deUniProtFi, deUcsc=UcscAnnot1, targRegion="chr11:1-135,086,622")
#>  -> readUniProtExport :  low yield matching 'enst00000410108', 'enst00000342878' and 'enst00000325113,enst00000525282' and 'ENST00000270115', 'ENST00000308020' and 'ENST00000311189' convert all to lower case and remove version numbers ('xxx.2') for better matching
#>  -> readUniProtExport :  intergrating genomic information for 88 entries (14 not found)
#>  -> readUniProtExport :  88 IDs in output
#> 'data.frame':    88 obs. of  14 variables:
#>  $ EnsTraID     : chr  "enst00000410108" "enst00000342878" "enst00000325113,enst00000525282" "enst00000342593" ...
#>  $ UniProtID    : chr  "B8ZZS0" "Q8TD33" "Q96PU9" "F8W6Z3" ...
#>  $   : chr  "B8ZZS0_HUMAN" "SG1C1_HUMAN" "ODF3A_HUMAN" "F8W6Z3_HUMAN" ...
#>  $ Status       : chr  "unreviewed" "reviewed" "reviewed" "unreviewed" ...
#>  $ Protein.names: chr  "BET1-like protein" "Secretoglobin family 1C member 1 (Secretoglobin RYD5)" "Outer dense fiber protein 3 (Outer dense fiber of sperm tails protein 3) (Sperm tail protein SHIPPO 1) (Transcr"| __truncated__ "Outer dense fiber protein 3" ...
#>  $ Gene.names   : chr  "BET1L" "SCGB1C1" "ODF3 SHIPPO1 TISP50" "ODF3" ...
#>  $ Length       : int  152 95 254 127 102 111 92 58 88 148 ...
#>  $ chr          : chr  "chr11" "chr11" NA "chr11" ...
#>  $ start        : num  167784 193078 NA 197303 202924 ...
#>  $ end          : num  207383 194575 NA 200033 207382 ...
#>  $ strand       : chr  "-" "+" NA "+" ...
#>  $ frame        : chr  "." "." NA "." ...
#>  $ avPos        : num  187584 193826 NA 198668 205153 ...
#>  $ inTarg       : logi  FALSE FALSE NA FALSE FALSE FALSE ...

The resulting data.frame (ie the column ‘UniProtID’) may be used to complement protein annotation after importing mass-spectrometry peak- and protein-identification results. Obviously, using recent Fasta-files from UniProt for protein-identification will typically give better matching at the end.

You may note that sometimes Ensemble transcript IDs are written as ‘enst00000410108’ whereas at other places it may be written as ‘ENST00000410108.5’. The function readUniProtExport() switches to a more flexible search mode stripping of version-numbers and reading all as lower-caps, if initial direct matching reveals less than 4 hits.

Finally, it should be added, that of course other ways of retrieving annotation exist, in particular using the annotation-packages provided by Bioconductor.

Importing and Treating Data From Quantitative Proteomics

This package provides support for importing quantitation results from Proteome Discoverer, MaxQuant and Proline. All quantitation import functions offer special features for separating species for optional different treatment. Normalization aims to eliminate/reduce variability introduced to the data not linked to the original biological question. Of course, a well-designed experiment with sufficient replicates is crucial for performing efficient normalization.
Technical replicates are very frequently found in proteomics, they allow to assess the influence of repeated injection of the same material. Biological replicates allow interpreting experiments in a more general way. Multiple options for normalization are available in the package wrMisc and can be called when importing data. Global median or global mean normalization have proved to provide robust results in numerous settings. However, they depend on the hypothesis that there are no global changes. This implies that the number of proteins changing abundance in a strong way should be small.

Typical contaminants in proteomics like keratins are very abundant and thus may influence global mean normalization. The import functions from this package lets the user to choose if all proteins or a subset should be considered when determining normalization factors. For example, you may want to exclude contaminants from normalization since their dynamics may reflect other cofactors than the biological question. In the case of benchmark-tests it is very common to different species, like the human spike-in set UPS-1.
In such cases you may orient normalization on a primary species to avoid having high concentrations of spike-in standards influencing the normalization.

All import functions generate lists separating (selected) annotation data (annot) from normalized log2-quantitation data (quant) and initial quantitation (raw).

Read data from Proteome Discoverer

Proteome Discoverer is commercial software from ThermoFisher ( In Proteome Discoverer quantitation data on level of consensus-proteins should be exported to tabulated text files, which can be treated by this function. Note, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate.

path1 <- system.file("extdata", package="wrProteo")
fiNaPd <- "exampleProtDiscov1.txt"
dataPD <- readPDExport(file=fiNaPd, path=path1, normalizeMeth="median")
#>  -> readPDExport : Can't find column(s) 'Contaminant'
#>  -> readPDExport :  Note: 299 (out of 299) unrecognized species

## a summary of the quantitation data
#>      S1rep1          S1rep2          S1rep3          S2rep1     
#>  Min.   :17.63   Min.   :17.49   Min.   :18.37   Min.   :17.15  
#>  1st Qu.:20.18   1st Qu.:20.23   1st Qu.:20.03   1st Qu.:20.27  
#>  Median :21.72   Median :21.72   Median :21.72   Median :21.72  
#>  Mean   :21.75   Mean   :21.83   Mean   :21.71   Mean   :21.90  
#>  3rd Qu.:23.06   3rd Qu.:23.19   3rd Qu.:23.02   3rd Qu.:23.29  
#>  Max.   :28.16   Max.   :28.29   Max.   :28.16   Max.   :28.70  
#>  NA's   :1       NA's   :3       NA's   :1       NA's   :2      
#>      S2rep2          S2rep3     
#>  Min.   :17.58   Min.   :17.14  
#>  1st Qu.:20.23   1st Qu.:20.09  
#>  Median :21.72   Median :21.72  
#>  Mean   :21.86   Mean   :21.78  
#>  3rd Qu.:23.26   3rd Qu.:23.13  
#>  Max.   :28.65   Max.   :28.58  
#>                  NA's   :2

Read data from MaxQuant

MaxQuant is free software provided by the Max-Planck-Insutute, see Tyanova et al 2016 (doi: 10.1093/bioinformatics/btaa118). Typically MaxQuant exports by default quantitation data on level of consensus-proteins as a folder called txt with a file called proteinGroups.txt . So in a standard case one needs only to provide the path to this file. However, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate, and thus it has gotten a different name.

path1 <- system.file("extdata", package="wrProteo")
fiNaMa <- "proteinGroupsMaxQuantUps1.txt"
specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST", spike="HUMAN_UPS")
dataMQ <- readMaxQuantFile(path1, file=fiNaMa, specPref=specPref1, normalizeMeth="median")
#>  -> readMaxQuantFile :  Note: 6 proteins with unknown species
#>    by species : conta: 3  mainSpe: 245  species2: 48
#>  -> readMaxQuantFile : Found 6 composite accession-numbers, truncating (eg CON__P01966)
#>  -> readMaxQuantFile : Use column 'Accession' (has fewest, ie 0 duplicated entries) as rownames

## a summary of the quantitation data
#>    25000am.1       25000am.2       25000am.3        2500am.1    
#>  Min.   :20.95   Min.   :21.22   Min.   :20.96   Min.   :19.52  
#>  1st Qu.:27.01   1st Qu.:27.01   1st Qu.:26.97   1st Qu.:26.96  
#>  Median :27.66   Median :27.66   Median :27.66   Median :27.66  
#>  Mean   :27.74   Mean   :27.74   Mean   :27.74   Mean   :27.33  
#>  3rd Qu.:28.44   3rd Qu.:28.44   3rd Qu.:28.48   3rd Qu.:28.48  
#>  Max.   :33.20   Max.   :33.19   Max.   :33.11   Max.   :33.10  
#>  NA's   :2       NA's   :2       NA's   :3       NA's   :10     
#>     2500am.2        2500am.3        250am.1         250am.2     
#>  Min.   :18.69   Min.   :19.53   Min.   :18.64   Min.   :17.98  
#>  1st Qu.:26.99   1st Qu.:26.91   1st Qu.:27.08   1st Qu.:27.08  
#>  Median :27.66   Median :27.66   Median :27.66   Median :27.66  
#>  Mean   :27.32   Mean   :27.29   Mean   :27.70   Mean   :27.66  
#>  3rd Qu.:28.49   3rd Qu.:28.48   3rd Qu.:28.53   3rd Qu.:28.54  
#>  Max.   :33.22   Max.   :33.16   Max.   :33.11   Max.   :33.13  
#>  NA's   :9       NA's   :10      NA's   :39      NA's   :38     
#>     250am.3     
#>  Min.   :18.75  
#>  1st Qu.:27.06  
#>  Median :27.66  
#>  Mean   :27.71  
#>  3rd Qu.:28.47  
#>  Max.   :33.08  
#>  NA's   :41

Read data from Proline

Proline is free software provided by the Profi-consortium, see Bouyssié et al 2020 (doi: 10.1038/nprot.2016.136). In Proline quantitation data on level of consensus-proteins can be exported to csv or tabulated text files, which can be treated by this function. Note, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate.

normalizeMeth <- NULL
path1 <- system.file("extdata", package="wrProteo")
fiNaPl <- "exampleProlineABC.csv"
dataPL <- readProlineFile(file.path(path1,fiNaPl) )

## a summary of the quantitation data
#>  A_01.raw.mzDB.t.xml A_02.raw.mzDB.t.xml A_03.raw.mzDB.t.xml
#>  Min.   :14.26       Min.   :14.09       Min.   :14.10      
#>  1st Qu.:19.26       1st Qu.:19.26       1st Qu.:19.22      
#>  Median :20.80       Median :20.67       Median :20.72      
#>  Mean   :20.98       Mean   :20.95       Mean   :20.93      
#>  3rd Qu.:22.50       3rd Qu.:22.46       3rd Qu.:22.54      
#>  Max.   :28.66       Max.   :28.62       Max.   :28.61      
#>  NA's   :29          NA's   :27          NA's   :21         
#>  A_04.raw.mzDB.t.xml B_01.raw.mzDB.t.xml B_02.raw.mzDB.t.xml
#>  Min.   :14.35       Min.   :15.78       Min.   :12.15      
#>  1st Qu.:19.37       1st Qu.:18.83       1st Qu.:18.90      
#>  Median :20.65       Median :20.30       Median :20.22      
#>  Mean   :21.00       Mean   :20.43       Mean   :20.42      
#>  3rd Qu.:22.48       3rd Qu.:22.03       3rd Qu.:22.08      
#>  Max.   :28.73       Max.   :27.61       Max.   :27.74      
#>  NA's   :27          NA's   :77          NA's   :77         
#>  B_03.raw.mzDB.t.xml B_04.raw.mzDB.t.xml C_01.raw.mzDB.t.xml
#>  Min.   :15.32       Min.   :13.37       Min.   :14.44      
#>  1st Qu.:18.78       1st Qu.:18.61       1st Qu.:19.31      
#>  Median :20.26       Median :20.17       Median :20.92      
#>  Mean   :20.42       Mean   :20.33       Mean   :21.08      
#>  3rd Qu.:22.00       3rd Qu.:21.94       3rd Qu.:22.65      
#>  Max.   :27.75       Max.   :27.75       Max.   :28.85      
#>  NA's   :76          NA's   :73          NA's   :22         
#>  C_02.raw.mzDB.t.xml C_03.raw.mzDB.t.xml C_04.raw.mzDB.t.xml
#>  Min.   :14.35       Min.   :14.69       Min.   :13.84      
#>  1st Qu.:19.32       1st Qu.:19.18       1st Qu.:19.41      
#>  Median :20.85       Median :20.77       Median :20.80      
#>  Mean   :21.06       Mean   :21.03       Mean   :21.10      
#>  3rd Qu.:22.63       3rd Qu.:22.66       3rd Qu.:22.65      
#>  Max.   :28.84       Max.   :28.83       Max.   :28.83      
#>  NA's   :18          NA's   :20          NA's   :21


As mentioned, the aim of normalization is to remove bias in data not linked to the original (biological) question. The import functions presented above do already by default global median normalization. In overall it is important to inspect results from normalization, graphical display of histograms, boxplots or violin-plots usually work well to compare distributions. Multiple options exist for normalizing data, for example the function normalizeThis from the package wrMisc can be used to run additional normalization.

Different normalization procedures intervene with different ‘aggressiveness’, ie capacity of deforming the initial data. In general we suggest to start normalizing using ‘milder’ procedures, like global median and switch to more intervening methods if initial results seem not satisfactory. Beware, heavy normalization procedures may also alter the main information you want to analyze. Ie, some biologically true positive changes may start to fade or disappear when inappropriate normalization gets performed. Please note, that normalization should be performed before NA-imputation to avoid introducing new bias in the group of imputed values.

Imputation of NA-values

In Proteomics the quantitation of very low abundances is very challenging and typically absent or very low abundances appear in the results as 0 or NA. Typically this may be linked to the fact that no peak is detected in a MS-1 chromatogram (for a given LC elution-time) where other samples had a strong peak with successful MS-2 identification. Data treated with MaxQuant have typically a higher degree of NA values. To simplify the treatment all 0 values are transformed to NA, anyway they would not allow log2 transformation either.

Before replacing NA-values it is important to verify that such values may be associated to absent or very low abundances. To do so, we suggest to inspect groups of replicate-measurements. In particular with multiple technical replicates of the same sample it is supposed that any variability observed is not linked to the sample itself. So for each NA that occurs in the data we suggest to look what was reported for the same protein with the other (technical) replicates. This brings us to the term of ‘NA-neighbours’ (quantifications for the same protein in replicates). When drawing histograms of NA-neighbours one can visually inspect and verify that NA-neighbours are typically low abundance values, however, but not necessarily the lowest values observed in the entire data-set.

This package proposes two related strategies for NA-imputation. First, the classical imputation of NA-values using Normal distributed random data is presented. Then, a more elaborate version based on repeated implementations to obtain more robust results is presented.

## Let's inspect NA values as graphic
matrixNAinspect(dataPD$quant, gr=gl(2,3), tit="Tiny example data from ProteomeDiscoverer") 

## Let's inspect NA values as graphic
matrixNAinspect(dataMQ$quant, gr=gl(3,3), tit="Tiny example data from MaxQuant") 

## Let's inspect NA values as graphic
matrixNAinspect(dataPL$quant, gr=as.factor(substr(colnames(dataPL$quant),1,1)),
  tit="Tiny example data from Proline") 

So only if the hypothesis of NA-neighbours as typically low abundance values gets confirmed by visual inspection of the histograms, one may proceed to replacing them by low random values. If one uses a unique (very) low value for NA-replacements, this will easily pose a problem when t-tests should be employed to look for proteins changing abundance between two or more groups of samples. Therefore it is common practice to draw random values from a Normal distribution representing this lower end of abundance values. Nevertheless, the choice of the parameters of this Normal distribution is very delicate.

The functions proposed here offer automatic selection of these parameters, which have been tested in a number of different projects. However, this choice should be checked by critically inspecting the histograms of ‘NA-neighbours’ (ie successful quantitation in other replicate samples of the same protein) and the final resulting distribution. Initially all NA-neighbours are extracted. It is also worth mentioning that in the majority of data-sets we encountered, such NA-neighbours form skewed distributions. Occasionally one may also find instances where 2 out of 3 (or more replicates) are NA. The successful quantitation of such instances with 2 NA-values per group are considered even more representative, but of course much less observed values remain. Thus a primary choice is made: If the selection of (min) 2 NA-values per group has more than 300 values, this distribution will be used as base to model the distribution for drawing random values. In this case, by default the 0.18 quantile of the 2 NA-neighbour distribution will be used as mean for the new Normal distribution used for NA-replacements. If the number of 2 NA-neighbours is >= 300, (by default) the 0.1 quantile all NA-neighbour values will used. Of course, the user has also the possibility to use custom choices for these parameters.

The final replacement is done on all NA values. This also includes proteins with are all NA in a given condition as well a instances of mixed successful quantitation and NA values.

## ProteomeDiscoverer
dataPD$NArepl <- matrixNAneighbourImpute(dataPD$quant,gr=gl(2,3),
  tit="Tiny example from ProteomeDiscoverer") 
#>  -> matrixNAneighbourImpute :  n.woNA= 1785  n.NA = 9
#>     model 10 %-tile of (min 1 NA/grp) 10 NA-neighbour values
#>     imputation: mean= 18.1   sd= 0.56

## MaxQuant
dataMQ$NArepl <- matrixNAneighbourImpute(dataMQ$quant,gr=gl(3,3),tit="Tiny example from MaxQuant") 
#>  -> matrixNAneighbourImpute :  n.woNA= 2564  n.NA = 154
#>     model 10 %-tile of (min 1 NA/grp) 36 NA-neighbour values
#>     imputation: mean= 19   sd= 0.74

## Proline
dataPL$NArepl <- matrixNAneighbourImpute(dataPL$quant,gr=gl(3,4),tit="Tiny example from Proline") 
#>  -> matrixNAneighbourImpute :  n.woNA= 4012  n.NA = 488
#>     model 10 %-tile of (min 1 NA/grp) 285 NA-neighbour values
#>     imputation: mean= 16.4   sd= 0.62

However, imputing using normal distributed random data also brings the risk of occasional extreme values. In the extreme case it may happen that a given protein is all NA in one group, and by chance the random values turn out be rather high. Then the final group mean of imputed values may be even higher than the mean of another group with successful quantitations. Of course in this case it would be a bad interpretation to consider the protein in question upregulated in the sample where initially all values were NA. To circumvent this problem there are 2 options : Firstly, one may use special filtering schemes to exclude such constellations from final results or secondly, one could repeat replacement of NA-values numerous times.

This type of filtering can be performed using presenceFilt (package wrMisc).

The function testRobustToNAimputation() allows also such repeated replacement of NA-values. See also the following section.

Statistical Testing

Statistical testing in the context of proteomics data poses challenges similar to transcriptomics : Many times the number of replicate-samples is fairly low and the inter-measurement variability quite high. In some unfortunate cases proteins with rather constant quantities may appear as false positives when searching for proteins who’s abundance changes between two groups of samples : If the apparent variability is by chance too low, the respective standard-deviations will be low and a plain t-test may give very enthusiastic p-values.
Next to stringent filtering (previous section of this vignette) the shrinkage when estimating the intra-group/replicate variance from the Bioconductor package limma turns out very helpful, see Ritchie et al 2015 (doi: 10.1093/nar/gkv007). In this package the function eBayes() has been used an it’s use adopted to proteomics.

The function testRobustToNAimputation() performs NA-imputation and statistical testing (after repeated imputation) between all groups of samples the same time (as it would be inefficient to separate these two tasks). The tests underneath apply shrinkage from the empirical Bayes procedure from the bioconductor package limma. In addition, various formats of multiple test correction can be directly added to the results : Benjamini-Hochberg FDR, local false discovery rate (lfdr, using the package fdrtool, see Strimmer 2008 doi: 10.1093/bioinformatics/btn209), or modified testing by ROTS, etc …

The fact that a single round of NA-imputation may provoke false positives as well as false negatives, made it necessary to combine this (iterative) process of NA-imputation and subsequent testing in one single function.

## Impute NA-values repeatedly and run statistical testing after each round of imputations
testPL <- testRobustToNAimputation(dataPL$quant, gr=gl(3,4), lfdrInclude=FALSE) 
#> -> testRobustToNAimputation -> matrixNAneighbourImpute :  n.woNA= 4012  n.NA = 488
#>     model 10 %-tile of (min 1 NA/grp) 285 NA-neighbour values
#>     imputation: mean= 16.4   sd= 0.62
#> -> testRobustToNAimputation -> combineMultFilterNAimput :     at presenceFilt:   352 360 357   out of  375 
#> -> testRobustToNAimputation -> combineMultFilterNAimput :     at abundanceFilt:  346 355 353
#> -> testRobustToNAimputation -> combineMultFilterNAimput :    at NA> mean:   284, 338 and 288
## Note, this example is too small for reliable lfdr estimation (would give warning)

## test results: classical BH FDR
#>                 1-2       1-3          2-3
#> P0ACB7 1.277109e-01 0.1607831 2.065986e-01
#> P15078           NA 0.8787038           NA
#> P02925 1.358713e-04 0.9537559 1.132843e-04
#> P42588 2.051206e-05 0.6742809 7.391326e-06
#> P0AGE9 3.871227e-06 0.1925214 4.751582e-07
#> P0A799 3.886196e-07 0.3789325 4.504443e-07
sum(testPL$BH[,1] < 0.05, na.rm=TRUE)
#> [1] 242

## the data after repeated NA-imputation
#>        A_01.raw.mzDB.t.xml A_02.raw.mzDB.t.xml A_03.raw.mzDB.t.xml
#> P0ACB7            22.56945            22.19699            22.39452
#> P15078            17.61153            18.00776            16.35186
#> P02925            25.40560            25.28552            25.29061
#> P42588            19.35777            19.39723            19.73350
#> P0AGE9            23.43934            23.55885            23.47963
#> P0A799            25.36381            25.32356            25.17127
#>        A_04.raw.mzDB.t.xml B_01.raw.mzDB.t.xml B_02.raw.mzDB.t.xml
#> P0ACB7            22.42658            22.48411            22.66792
#> P15078            17.15051            16.29294            16.39084
#> P02925            25.33628            24.08922            24.01434
#> P42588            19.06219            17.77645            17.89538
#> P0AGE9            23.28007            22.72747            22.61432
#> P0A799            25.20114            24.21076            24.23250
#>        B_03.raw.mzDB.t.xml B_04.raw.mzDB.t.xml C_01.raw.mzDB.t.xml
#> P0ACB7            22.41511            22.90290            22.97558
#> P15078            16.34952            16.30732            17.26139
#> P02925            23.94017            24.86961            25.32381
#> P42588            18.11154            17.76764            19.64330
#> P0AGE9            22.56039            22.50835            23.64638
#> P0A799            24.10540            24.12083            25.22358
#>        C_02.raw.mzDB.t.xml C_03.raw.mzDB.t.xml C_04.raw.mzDB.t.xml
#> P0ACB7            22.85138            22.70813            22.64852
#> P15078            17.44857            17.38586            17.40649
#> P02925            25.35219            25.33581            25.38618
#> P42588            19.08661            19.58276            19.89625
#> P0AGE9            23.70737            23.60144            23.63291
#> P0A799            25.07744            25.14947            25.09119


The author wants to acknowledge the support by the IGBMC (CNRS UMR 7104, Inserm U 1258), CNRS, IGBMC, Université de Strasbourg and Inserm. My collegues from the proteomics platform at the IGBMC work very commited to provide high quality mass-spectrometry data (including some of those used here). Furthermore, many very fruitful discussions with colleages on national and international level have helped to improve and disseminate the tools presented here.


#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> Matrix products: default
#> locale:
#> [1] LC_COLLATE=C                   LC_CTYPE=French_France.1252   
#> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=French_France.1252    
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> other attached packages:
#> [1] rmarkdown_2.4  knitr_1.30     wrProteo_1.3.0 wrMisc_1.4.2  
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.25      R.methodsS3_1.8.1  magrittr_1.5       evaluate_0.14     
#>  [5] sm_2.2-5.6         rlang_0.4.8        stringi_1.5.3      wrGraph_1.1.0     
#>  [9] limma_3.44.3       R.oo_1.24.0        R.utils_2.10.1     RColorBrewer_1.1-2
#> [13] tools_4.0.3        stringr_1.4.0      xfun_0.18          yaml_2.2.1        
#> [17] compiler_4.0.3     tcltk_4.0.3        htmltools_0.5.0