This package contains a collection of various tools for Proteomics used at the proteomics platform of the IGBMC. To get started, we need to load the packages “wrMisc” and this package (wrProteo), both are available from CRAN. The packages wrGraph and RColorBrewer get used internally by some of the functions from this package for (optional/improved) figures. Furthermore, the Bioconductor package limma will be used internally for statistical testing
## if you need to install the packages 'wrMisc','wrProteo' and 'wrGraph' from CRAN : install.packages("wrMisc") install.packages("wrProteo") ## The package 'wrGraph' is not obligatory, but it allows making better graphs install.packages("wrGraph")
Please note that molecular masses may be given in two flavours : Monoisotopic mass and average mass. Please refer eg to Wikipedia: monoisotopic mass for details. Monoisotopic masses commonly are used in mass-spectrometry and will be used by default in this package.
Molecular (mono-isotopic) masses of the atomes used here were taken from Unimod. They can be easily updated, if in the future, (mono-isotopic) molecular masses will be determined with higher precision (ie more digits).
At this level (summed) atomic compositions are evaluated. Here, the number of atoms has to be written before the atom. Thus, ‘2C’ means two atoms of carbon. Empty or invalid entries will be by default returned as mass=0.0, a message will comment such issues.
The mass of an electron can be assigned using ‘e’.
massDeFormula(c("12H12O", "HO", " 2H 1 Se, 6C 2N", "HSeCN", " ", "e")) #> -> massDeFormula : can't find ' ' .. setting to 0 mass #> 12H12O 1H1O 2H1Se1e6C2N 1H1Se1e1C1N 1e #> 2.040329e+02 1.700274e+01 1.819389e+02 1.069280e+02 0.000000e+00 5.485799e-04 # Note, that empty/invalid entries will be returned as a mass of 0.0 . # Ignore empty/invalid entries massDeFormula(c("12H12O", "HO", " 2H 1 Se, 6C 2N", "HSeCN"), rmEmpty=TRUE) #> 12H12O 1H1O 2H1Se1e6C2N 1H1Se1e1C1N #> 204.03288 17.00274 181.93887 106.92797
Using the argument massTy one can switch this and similar functions from this package from default monoisotopic mass to average mass mode :
The mass of these amino-acids can be used:
AAmass() #> A R N D C E Q G #> 71.03711 156.10111 114.04293 115.02694 103.00918 129.04259 128.05858 57.02146 #> H I L K M F P S #> 137.05891 113.08406 113.08406 128.09496 131.04048 147.06841 97.05276 87.03203 #> T W Y V #> 101.04768 186.07931 163.06333 99.06841
Here the one-letter amino-acid code is used to descibre peptides or proteins.
This package contains a parser for Fasta-files allowing to separate different fields of meta-data like IDs, Name and Species. Here we will read a tiny example fasta-file obtained from a collection of typical contaminants in proteomics.
path1 <- system.file('extdata', package='wrProteo') fiNa <- "conta1.fasta" ## basic reading of Fasta fasta1 <- readFasta2(file.path(path1,fiNa)) str(fasta1) #> Named chr [1:35] "FPTDDDDKIVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTLDNDIMLIKLSSPATLNSRVATV"| __truncated__ ... #> - attr(*, "names")= chr [1:35] "TRYP_PIG TRYPSIN PRECURSOR" "TRY1_BOVIN Cationic trypsin (Fragment) - Bos taurus (Bovine)" "CTRA_BOVIN Chymotrypsinogen A - Bos taurus (Bovine)" "CTRB_BOVIN Chymotrypsinogen B - Bos taurus (Bovine)" ... ## now let's read and further separate details in annotation-fields fasta1det <- readFasta2(file.path(path1,fiNa), tableOut=TRUE) str(fasta1det) #> chr [1:35, 1:9] "generic" "generic" "generic" "generic" "generic" ... #> - attr(*, "dimnames")=List of 2 #> ..$ : NULL #> ..$ : chr [1:9] "database" "uniqueIdentifier" "entryName" "proteinName" ...
In most ‘Omics’ acitivities getting additional annotation may get a bit tricky. In Proteomics most mass-spectrometry software will use the informaton provided in the Fasta-file as annotation (as provided from UniProt). But this lacks for example chromosomal location information. There are are many repositories with genome-, gene- and protein-annotation and most of them are linked, but sometimes the links get broken when data-base updates are not done everywhere or are not followed by new re-matching. The Fasta-files used initially for mass-spectrometry peak-identification may be (slightly) not up to date (sometimes gene-or protein IDs do change or may even disappear) and thus will contribute to a certain percentage of entries hard to link.
In the context of adding chromosomal annotation to a list of proteins here the following concept is developed : Annotation-tables from UCSC are available for a good number of species and can be downloaded for conventient off-line search. Howwever, in the context of less common species we realized that the UniProt tables from UCSC had many times low yield in final matching. For this reason we propose the slightly more complicated route that provided finally a much higher success-rate to find chromosomal locations for a list of UniProt IDs. First one needs to download from UCSC the table corresponding to the species of question fields ‘clade’,‘genome’,‘assembly’). For ‘group’ choose ‘Genes and Gene Predictions’ and for ‘track’ choose ‘Ensembl Genes’, as table choose ‘ensGene’. In addition, it is possible to select either the entire genome-annotation or user-specified regions. In terms of ‘output-format’ one may choose ‘GTF’ (slightly more condensed, no headers) or ‘all filds from selected table’.
The strategy presented here :
Locate & download organism annotation from UCSC, read into R (readUCSCtable()) -> from R export (non-redundant) ‘enst’-IDs (still readUCSCtable() ), get corresponding UniProt-IDs at UniProt site, save as file and import result into R (readUniProtExport() ) -> (in R) combine with initial UCSC table (readUniProtExport() )
The function readUCSCtable() is able to read such files downloaded from UCSC, compressed .gz files can be read, too. In the example below we’ll just look at chromosome 11 of the human genome - to keep files small.
path1 <- system.file("extdata", package="wrProteo") gtfFi <- file.path(path1, "UCSC_hg38_chr11extr.gtf") UcscAnnot1 <- readUCSCtable(gtfFi) #> -> readUCSCtable : File 'UCSC_hg38_chr11extr.gtf' is gtf : TRUE #> -> readUCSCtable : simplifed from 2000 to 223 non-redundant gene_id # The Ensemble transcript identifyers and their chromosomal locations : head(UcscAnnot1) #> gene_id chr start end strand frame #> [1,] "ENST00000270115.8" "chr11" "537527" "554912" "+" "." #> [2,] "ENST00000308020.6" "chr11" "450279" "491399" "+" "." #> [3,] "ENST00000311189.8" "chr11" "532242" "535576" "-" "." #> [4,] "ENST00000312165.5" "chr11" "278570" "285304" "+" "." #> [5,] "ENST00000325113.8" "chr11" "196738" "200258" "+" "." #> [6,] "ENST00000325147.13" "chr11" "202924" "207382" "-" "."
However, this annotation does not provide protein IDs. In order to obtain the corresponding protein IDs an additional step is required : Here we will use the batch search/conversion tool from UniProt. In order to do so, we can export directly from readUCSCtable() a small text-file which can be fed into the UniProt batch-search tool.
# Here we'll redo reading the UCSC table, plus immediatley write the file for UniProt conversion # (in this vignette we write to tempdir() to keep things tidy) expFi <- file.path(tempdir(),"deUcscForUniProt2.txt") UcscAnnot1 <- readUCSCtable(gtfFi, exportFileNa=expFi) #> -> readUCSCtable : File 'UCSC_hg38_chr11extr.gtf' is gtf : TRUE #> -> readUCSCtable : simplifed from 2000 to 223 non-redundant gene_id #> -> readUCSCtable : Write to file : ENST00000270115, ENST00000308020, ENST00000311189, ENST00000312165 ... #> -> readUCSCtable : Exporting file 'C:/Users/wraff/AppData/Local/Temp/RtmpQnVbmI/deUcscForUniProt2.txt' for conversion on https://www.uniprot.org/uploadlists
Now everything is ready to go to UniProt for retrieving the corresponding UniProt-IDs. Since we exported Ensemble transcript IDs (ENSTxxx), select converting from ‘Ensembl Transcript’ to ‘UniProtKB’. Then, when downloading the conversion results, choose tab-separated file format (compression is recommended), this may take several seconds (depending on the size).
It is suggested to rename the downloaded file so one can easily understand its content. Note, that the function readUniProtExport() can also read .gz compressed files. To continue this vignette we’ll use a result which has been downloaded from UniProt and renamed to ‘deUniProt_hg38chr11extr.tab’. One may also optionally define a specific genomic region of interest using the argument ‘targRegion’, here the entire chromosome 11 was chosen.
deUniProtFi <- file.path(path1, "deUniProt_hg38chr11extr.tab") deUniPr1 <- readUniProtExport(UniP=deUniProtFi, deUcsc=UcscAnnot1, targRegion="chr11:1-135,086,622") #> -> readUniProtExport : low yield matching 'enst00000410108', 'enst00000342878' and 'enst00000325113,enst00000525282' and 'ENST00000270115', 'ENST00000308020' and 'ENST00000311189' convert all to lower case and remove version numbers ('xxx.2') for better matching #> -> readUniProtExport : intergrating genomic information for 88 entries (14 not found) #> -> readUniProtExport : 88 IDs in output str(deUniPr1) #> 'data.frame': 88 obs. of 14 variables: #> $ EnsTraID : chr "enst00000410108" "enst00000342878" "enst00000325113,enst00000525282" "enst00000342593" ... #> $ UniProtID : chr "B8ZZS0" "Q8TD33" "Q96PU9" "F8W6Z3" ... #> $ Entry.name : chr "B8ZZS0_HUMAN" "SG1C1_HUMAN" "ODF3A_HUMAN" "F8W6Z3_HUMAN" ... #> $ Status : chr "unreviewed" "reviewed" "reviewed" "unreviewed" ... #> $ Protein.names: chr "BET1-like protein" "Secretoglobin family 1C member 1 (Secretoglobin RYD5)" "Outer dense fiber protein 3 (Outer dense fiber of sperm tails protein 3) (Sperm tail protein SHIPPO 1) (Transcr"| __truncated__ "Outer dense fiber protein 3" ... #> $ Gene.names : chr "BET1L" "SCGB1C1" "ODF3 SHIPPO1 TISP50" "ODF3" ... #> $ Length : int 152 95 254 127 102 111 92 58 88 148 ... #> $ chr : chr "chr11" "chr11" NA "chr11" ... #> $ start : num 167784 193078 NA 197303 202924 ... #> $ end : num 207383 194575 NA 200033 207382 ... #> $ strand : chr "-" "+" NA "+" ... #> $ frame : chr "." "." NA "." ... #> $ avPos : num 187584 193826 NA 198668 205153 ... #> $ inTarg : logi FALSE FALSE NA FALSE FALSE FALSE ...
The resulting data.frame (ie the column ‘UniProtID’) may be used to complement protein annotation after importing mass-spectrometry peak- and protein-identification results. Obviously, using recent Fasta-files from UniProt for protein-identification will typically give better matching at the end.
You may note that sometimes Ensemble transcript IDs are written as ‘enst00000410108’ whereas at other places it may be written as ‘ENST00000410108.5’. The function readUniProtExport() switches to a more flexible search mode stripping of version-numbers and reading all as lower-caps, if initial direct matching reveals less than 4 hits.
Finally, it should be added, that of course other ways of retrieving annotation exist, in particular using the annotation-packages provided by Bioconductor.
This package provides support for importing quantitation results from Proteome Discoverer, MaxQuant and Proline. All quantitation import functions offer special features for separating species for optional different treatment. Normalization aims to eliminate/reduce variability introduced to the data not linked to the original biological question. Of course, a well-designed experiment with sufficient replicates is crucial for performing efficient normalization.
Technical replicates are very frequently found in proteomics, they allow to assess the influence of repeated injection of the same material. Biological replicates allow interpreting experiments in a more general way. Multiple options for normalization are available in the package wrMisc and can be called when importing data. Global median or global mean normalization have proved to provide robust results in numerous settings. However, they depend on the hypothesis that there are no global changes. This implies that the number of proteins changing abundance in a strong way should be small.
Typical contaminants in proteomics like keratins are very abundant and thus may influence global mean normalization. The import functions from this package lets the user to choose if all proteins or a subset should be considered when determining normalization factors. For example, you may want to exclude contaminants from normalization since their dynamics may reflect other cofactors than the biological question. In the case of benchmark-tests it is very common to different species, like the human spike-in set UPS-1.
In such cases you may orient normalization on a primary species to avoid having high concentrations of spike-in standards influencing the normalization.
All import functions generate lists separating (selected) annotation data (annot) from normalized log2-quantitation data (quant) and initial quantitation (raw).
Proteome Discoverer is commercial software from ThermoFisher (www.thermofisher.com). In Proteome Discoverer quantitation data on level of consensus-proteins should be exported to tabulated text files, which can be treated by this function. Note, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate.
path1 <- system.file("extdata", package="wrProteo") fiNaPd <- "exampleProtDiscov1.txt" dataPD <- readPDExport(file=fiNaPd, path=path1, normalizeMeth="median") #> -> readPDExport : Can't find column(s) 'Contaminant' #> -> readPDExport : Note: 299 (out of 299) unrecognized species
## a summary of the quantitation data summary(dataPD$quant) #> S1rep1 S1rep2 S1rep3 S2rep1 #> Min. :17.63 Min. :17.49 Min. :18.37 Min. :17.15 #> 1st Qu.:20.18 1st Qu.:20.23 1st Qu.:20.03 1st Qu.:20.27 #> Median :21.72 Median :21.72 Median :21.72 Median :21.72 #> Mean :21.75 Mean :21.83 Mean :21.71 Mean :21.90 #> 3rd Qu.:23.06 3rd Qu.:23.19 3rd Qu.:23.02 3rd Qu.:23.29 #> Max. :28.16 Max. :28.29 Max. :28.16 Max. :28.70 #> NA's :1 NA's :3 NA's :1 NA's :2 #> S2rep2 S2rep3 #> Min. :17.58 Min. :17.14 #> 1st Qu.:20.23 1st Qu.:20.09 #> Median :21.72 Median :21.72 #> Mean :21.86 Mean :21.78 #> 3rd Qu.:23.26 3rd Qu.:23.13 #> Max. :28.65 Max. :28.58 #> NA's :2
MaxQuant is free software provided by the Max-Planck-Insutute, see Tyanova et al 2016 (doi: 10.1093/bioinformatics/btaa118). Typically MaxQuant exports by default quantitation data on level of consensus-proteins as a folder called txt with a file called proteinGroups.txt . So in a standard case one needs only to provide the path to this file. However, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate, and thus it has gotten a different name.
path1 <- system.file("extdata", package="wrProteo") fiNaMa <- "proteinGroupsMaxQuantUps1.txt" specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST", spike="HUMAN_UPS") dataMQ <- readMaxQuantFile(path1, file=fiNaMa, specPref=specPref1, normalizeMeth="median") #> -> readMaxQuantFile : Note: 6 proteins with unknown species #> by species : conta: 3 mainSpe: 245 species2: 48 #> -> readMaxQuantFile : Found 6 composite accession-numbers, truncating (eg CON__P01966) #> -> readMaxQuantFile : Use column 'Accession' (has fewest, ie 0 duplicated entries) as rownames
## a summary of the quantitation data summary(dataMQ$quant) #> 25000am.1 25000am.2 25000am.3 2500am.1 #> Min. :20.95 Min. :21.22 Min. :20.96 Min. :19.52 #> 1st Qu.:27.01 1st Qu.:27.01 1st Qu.:26.97 1st Qu.:26.96 #> Median :27.66 Median :27.66 Median :27.66 Median :27.66 #> Mean :27.74 Mean :27.74 Mean :27.74 Mean :27.33 #> 3rd Qu.:28.44 3rd Qu.:28.44 3rd Qu.:28.48 3rd Qu.:28.48 #> Max. :33.20 Max. :33.19 Max. :33.11 Max. :33.10 #> NA's :2 NA's :2 NA's :3 NA's :10 #> 2500am.2 2500am.3 250am.1 250am.2 #> Min. :18.69 Min. :19.53 Min. :18.64 Min. :17.98 #> 1st Qu.:26.99 1st Qu.:26.91 1st Qu.:27.08 1st Qu.:27.08 #> Median :27.66 Median :27.66 Median :27.66 Median :27.66 #> Mean :27.32 Mean :27.29 Mean :27.70 Mean :27.66 #> 3rd Qu.:28.49 3rd Qu.:28.48 3rd Qu.:28.53 3rd Qu.:28.54 #> Max. :33.22 Max. :33.16 Max. :33.11 Max. :33.13 #> NA's :9 NA's :10 NA's :39 NA's :38 #> 250am.3 #> Min. :18.75 #> 1st Qu.:27.06 #> Median :27.66 #> Mean :27.71 #> 3rd Qu.:28.47 #> Max. :33.08 #> NA's :41
Proline is free software provided by the Profi-consortium, see Bouyssié et al 2020 (doi: 10.1038/nprot.2016.136). In Proline quantitation data on level of consensus-proteins can be exported to csv or tabulated text files, which can be treated by this function. Note, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate.
normalizeMeth <- NULL path1 <- system.file("extdata", package="wrProteo") fiNaPl <- "exampleProlineABC.csv" dataPL <- readProlineFile(file.path(path1,fiNaPl) )
## a summary of the quantitation data summary(dataPL$quant) #> A_01.raw.mzDB.t.xml A_02.raw.mzDB.t.xml A_03.raw.mzDB.t.xml #> Min. :14.26 Min. :14.09 Min. :14.10 #> 1st Qu.:19.26 1st Qu.:19.26 1st Qu.:19.22 #> Median :20.80 Median :20.67 Median :20.72 #> Mean :20.98 Mean :20.95 Mean :20.93 #> 3rd Qu.:22.50 3rd Qu.:22.46 3rd Qu.:22.54 #> Max. :28.66 Max. :28.62 Max. :28.61 #> NA's :29 NA's :27 NA's :21 #> A_04.raw.mzDB.t.xml B_01.raw.mzDB.t.xml B_02.raw.mzDB.t.xml #> Min. :14.35 Min. :15.78 Min. :12.15 #> 1st Qu.:19.37 1st Qu.:18.83 1st Qu.:18.90 #> Median :20.65 Median :20.30 Median :20.22 #> Mean :21.00 Mean :20.43 Mean :20.42 #> 3rd Qu.:22.48 3rd Qu.:22.03 3rd Qu.:22.08 #> Max. :28.73 Max. :27.61 Max. :27.74 #> NA's :27 NA's :77 NA's :77 #> B_03.raw.mzDB.t.xml B_04.raw.mzDB.t.xml C_01.raw.mzDB.t.xml #> Min. :15.32 Min. :13.37 Min. :14.44 #> 1st Qu.:18.78 1st Qu.:18.61 1st Qu.:19.31 #> Median :20.26 Median :20.17 Median :20.92 #> Mean :20.42 Mean :20.33 Mean :21.08 #> 3rd Qu.:22.00 3rd Qu.:21.94 3rd Qu.:22.65 #> Max. :27.75 Max. :27.75 Max. :28.85 #> NA's :76 NA's :73 NA's :22 #> C_02.raw.mzDB.t.xml C_03.raw.mzDB.t.xml C_04.raw.mzDB.t.xml #> Min. :14.35 Min. :14.69 Min. :13.84 #> 1st Qu.:19.32 1st Qu.:19.18 1st Qu.:19.41 #> Median :20.85 Median :20.77 Median :20.80 #> Mean :21.06 Mean :21.03 Mean :21.10 #> 3rd Qu.:22.63 3rd Qu.:22.66 3rd Qu.:22.65 #> Max. :28.84 Max. :28.83 Max. :28.83 #> NA's :18 NA's :20 NA's :21
As mentioned, the aim of normalization is to remove bias in data not linked to the original (biological) question. The import functions presented above do already by default global median normalization. In overall it is important to inspect results from normalization, graphical display of histograms, boxplots or violin-plots usually work well to compare distributions. Multiple options exist for normalizing data, for example the function normalizeThis from the package wrMisc can be used to run additional normalization.
Different normalization procedures intervene with different ‘aggressiveness’, ie capacity of deforming the initial data. In general we suggest to start normalizing using ‘milder’ procedures, like global median and switch to more intervening methods if initial results seem not satisfactory. Beware, heavy normalization procedures may also alter the main information you want to analyze. Ie, some biologically true positive changes may start to fade or disappear when inappropriate normalization gets performed. Please note, that normalization should be performed before NA-imputation to avoid introducing new bias in the group of imputed values.
In Proteomics the quantitation of very low abundances is very challenging and typically absent or very low abundances appear in the results as 0 or NA. Typically this may be linked to the fact that no peak is detected in a MS-1 chromatogram (for a given LC elution-time) where other samples had a strong peak with successful MS-2 identification. Data treated with MaxQuant have typically a higher degree of NA values. To simplify the treatment all 0 values are transformed to NA, anyway they would not allow log2 transformation either.
Before replacing NA-values it is important to verify that such values may be associated to absent or very low abundances. To do so, we suggest to inspect groups of replicate-measurements. In particular with multiple technical replicates of the same sample it is supposed that any variability observed is not linked to the sample itself. So for each NA that occurs in the data we suggest to look what was reported for the same protein with the other (technical) replicates. This brings us to the term of ‘NA-neighbours’ (quantifications for the same protein in replicates). When drawing histograms of NA-neighbours one can visually inspect and verify that NA-neighbours are typically low abundance values, however, but not necessarily the lowest values observed in the entire data-set.
This package proposes two related strategies for NA-imputation. First, the classical imputation of NA-values using Normal distributed random data is presented. Then, a more elaborate version based on repeated implementations to obtain more robust results is presented.
## Let's inspect NA values as graphic matrixNAinspect(dataPD$quant, gr=gl(2,3), tit="Tiny example data from ProteomeDiscoverer")
## Let's inspect NA values as graphic matrixNAinspect(dataMQ$quant, gr=gl(3,3), tit="Tiny example data from MaxQuant")
## Let's inspect NA values as graphic matrixNAinspect(dataPL$quant, gr=as.factor(substr(colnames(dataPL$quant),1,1)), tit="Tiny example data from Proline")
So only if the hypothesis of NA-neighbours as typically low abundance values gets confirmed by visual inspection of the histograms, one may proceed to replacing them by low random values. If one uses a unique (very) low value for NA-replacements, this will easily pose a problem when t-tests should be employed to look for proteins changing abundance between two or more groups of samples. Therefore it is common practice to draw random values from a Normal distribution representing this lower end of abundance values. Nevertheless, the choice of the parameters of this Normal distribution is very delicate.
The functions proposed here offer automatic selection of these parameters, which have been tested in a number of different projects. However, this choice should be checked by critically inspecting the histograms of ‘NA-neighbours’ (ie successful quantitation in other replicate samples of the same protein) and the final resulting distribution. Initially all NA-neighbours are extracted. It is also worth mentioning that in the majority of data-sets we encountered, such NA-neighbours form skewed distributions. Occasionally one may also find instances where 2 out of 3 (or more replicates) are NA. The successful quantitation of such instances with 2 NA-values per group are considered even more representative, but of course much less observed values remain. Thus a primary choice is made: If the selection of (min) 2 NA-values per group has more than 300 values, this distribution will be used as base to model the distribution for drawing random values. In this case, by default the 0.18 quantile of the 2 NA-neighbour distribution will be used as mean for the new Normal distribution used for NA-replacements. If the number of 2 NA-neighbours is >= 300, (by default) the 0.1 quantile all NA-neighbour values will used. Of course, the user has also the possibility to use custom choices for these parameters.
The final replacement is done on all NA values. This also includes proteins with are all NA in a given condition as well a instances of mixed successful quantitation and NA values.
## ProteomeDiscoverer dataPD$NArepl <- matrixNAneighbourImpute(dataPD$quant,gr=gl(2,3), tit="Tiny example from ProteomeDiscoverer") #> -> matrixNAneighbourImpute : n.woNA= 1785 n.NA = 9 #> model 10 %-tile of (min 1 NA/grp) 10 NA-neighbour values #> imputation: mean= 18.1 sd= 0.56
## MaxQuant dataMQ$NArepl <- matrixNAneighbourImpute(dataMQ$quant,gr=gl(3,3),tit="Tiny example from MaxQuant") #> -> matrixNAneighbourImpute : n.woNA= 2564 n.NA = 154 #> model 10 %-tile of (min 1 NA/grp) 36 NA-neighbour values #> imputation: mean= 19 sd= 0.74
## Proline dataPL$NArepl <- matrixNAneighbourImpute(dataPL$quant,gr=gl(3,4),tit="Tiny example from Proline") #> -> matrixNAneighbourImpute : n.woNA= 4012 n.NA = 488 #> model 10 %-tile of (min 1 NA/grp) 285 NA-neighbour values #> imputation: mean= 16.4 sd= 0.62
However, imputing using normal distributed random data also brings the risk of occasional extreme values. In the extreme case it may happen that a given protein is all NA in one group, and by chance the random values turn out be rather high. Then the final group mean of imputed values may be even higher than the mean of another group with successful quantitations. Of course in this case it would be a bad interpretation to consider the protein in question upregulated in the sample where initially all values were NA. To circumvent this problem there are 2 options : Firstly, one may use special filtering schemes to exclude such constellations from final results or secondly, one could repeat replacement of NA-values numerous times.
This type of filtering can be performed using presenceFilt (package wrMisc).
The function testRobustToNAimputation() allows also such repeated replacement of NA-values. See also the following section.
Statistical testing in the context of proteomics data poses challenges similar to transcriptomics : Many times the number of replicate-samples is fairly low and the inter-measurement variability quite high. In some unfortunate cases proteins with rather constant quantities may appear as false positives when searching for proteins who’s abundance changes between two groups of samples : If the apparent variability is by chance too low, the respective standard-deviations will be low and a plain t-test may give very enthusiastic p-values.
Next to stringent filtering (previous section of this vignette) the shrinkage when estimating the intra-group/replicate variance from the Bioconductor package limma turns out very helpful, see Ritchie et al 2015 (doi: 10.1093/nar/gkv007). In this package the function eBayes() has been used an it’s use adopted to proteomics.
The function testRobustToNAimputation() performs NA-imputation and statistical testing (after repeated imputation) between all groups of samples the same time (as it would be inefficient to separate these two tasks). The tests underneath apply shrinkage from the empirical Bayes procedure from the bioconductor package limma. In addition, various formats of multiple test correction can be directly added to the results : Benjamini-Hochberg FDR, local false discovery rate (lfdr, using the package fdrtool, see Strimmer 2008 doi: 10.1093/bioinformatics/btn209), or modified testing by ROTS, etc …
The fact that a single round of NA-imputation may provoke false positives as well as false negatives, made it necessary to combine this (iterative) process of NA-imputation and subsequent testing in one single function.
## Impute NA-values repeatedly and run statistical testing after each round of imputations testPL <- testRobustToNAimputation(dataPL$quant, gr=gl(3,4), lfdrInclude=FALSE) #> -> testRobustToNAimputation -> matrixNAneighbourImpute : n.woNA= 4012 n.NA = 488 #> model 10 %-tile of (min 1 NA/grp) 285 NA-neighbour values #> imputation: mean= 16.4 sd= 0.62 #> -> testRobustToNAimputation -> combineMultFilterNAimput : at presenceFilt: 352 360 357 out of 375 #> -> testRobustToNAimputation -> combineMultFilterNAimput : at abundanceFilt: 346 355 353 #> -> testRobustToNAimputation -> combineMultFilterNAimput : at NA> mean: 284, 338 and 288 ## Note, this example is too small for reliable lfdr estimation (would give warning) ## test results: classical BH FDR head(testPL$BH) #> 1-2 1-3 2-3 #> P0ACB7 1.277109e-01 0.1607831 2.065986e-01 #> P15078 NA 0.8787038 NA #> P02925 1.358713e-04 0.9537559 1.132843e-04 #> P42588 2.051206e-05 0.6742809 7.391326e-06 #> P0AGE9 3.871227e-06 0.1925214 4.751582e-07 #> P0A799 3.886196e-07 0.3789325 4.504443e-07 sum(testPL$BH[,1] < 0.05, na.rm=TRUE) #>  242 ## the data after repeated NA-imputation head(testPL$datImp) #> A_01.raw.mzDB.t.xml A_02.raw.mzDB.t.xml A_03.raw.mzDB.t.xml #> P0ACB7 22.56945 22.19699 22.39452 #> P15078 17.61153 18.00776 16.35186 #> P02925 25.40560 25.28552 25.29061 #> P42588 19.35777 19.39723 19.73350 #> P0AGE9 23.43934 23.55885 23.47963 #> P0A799 25.36381 25.32356 25.17127 #> A_04.raw.mzDB.t.xml B_01.raw.mzDB.t.xml B_02.raw.mzDB.t.xml #> P0ACB7 22.42658 22.48411 22.66792 #> P15078 17.15051 16.29294 16.39084 #> P02925 25.33628 24.08922 24.01434 #> P42588 19.06219 17.77645 17.89538 #> P0AGE9 23.28007 22.72747 22.61432 #> P0A799 25.20114 24.21076 24.23250 #> B_03.raw.mzDB.t.xml B_04.raw.mzDB.t.xml C_01.raw.mzDB.t.xml #> P0ACB7 22.41511 22.90290 22.97558 #> P15078 16.34952 16.30732 17.26139 #> P02925 23.94017 24.86961 25.32381 #> P42588 18.11154 17.76764 19.64330 #> P0AGE9 22.56039 22.50835 23.64638 #> P0A799 24.10540 24.12083 25.22358 #> C_02.raw.mzDB.t.xml C_03.raw.mzDB.t.xml C_04.raw.mzDB.t.xml #> P0ACB7 22.85138 22.70813 22.64852 #> P15078 17.44857 17.38586 17.40649 #> P02925 25.35219 25.33581 25.38618 #> P42588 19.08661 19.58276 19.89625 #> P0AGE9 23.70737 23.60144 23.63291 #> P0A799 25.07744 25.14947 25.09119
The author wants to acknowledge the support by the IGBMC (CNRS UMR 7104, Inserm U 1258), CNRS, IGBMC, Université de Strasbourg and Inserm. My collegues from the proteomics platform at the IGBMC work very commited to provide high quality mass-spectrometry data (including some of those used here). Furthermore, many very fruitful discussions with colleages on national and international level have helped to improve and disseminate the tools presented here.
#> R version 4.0.3 (2020-10-10) #> Platform: x86_64-w64-mingw32/x64 (64-bit) #> Running under: Windows 10 x64 (build 18363) #> #> Matrix products: default #> #> locale: #>  LC_COLLATE=C LC_CTYPE=French_France.1252 #>  LC_MONETARY=French_France.1252 LC_NUMERIC=C #>  LC_TIME=French_France.1252 #> #> attached base packages: #>  stats graphics grDevices utils datasets methods base #> #> other attached packages: #>  rmarkdown_2.4 knitr_1.30 wrProteo_1.3.0 wrMisc_1.4.2 #> #> loaded via a namespace (and not attached): #>  digest_0.6.25 R.methodsS3_1.8.1 magrittr_1.5 evaluate_0.14 #>  sm_2.2-5.6 rlang_0.4.8 stringi_1.5.3 wrGraph_1.1.0 #>  limma_3.44.3 R.oo_1.24.0 R.utils_2.10.1 RColorBrewer_1.1-2 #>  tools_4.0.3 stringr_1.4.0 xfun_0.18 yaml_2.2.1 #>  compiler_4.0.3 tcltk_4.0.3 htmltools_0.5.0