pmd
package use Paired Mass Distance (PMD) relationship to analysis the GC/LC-MS based non-targeted data. PMD means the distance between two masses or mass to charge ratios. In mass spectrometry, PMD would keep the same between two masses or two mass to charge ratios(m/z). There are twe kinds of PMD involved in this package: PMD within same retention time group and PMD from different retention time groups.
In GC/LC-MS based non-targeted analysis, peaks could be seperated by chromatograph. We could build retention time(RT) bins to assign peaks into different RT groups by retention time hierarchical clustering analysis. For each RT group, the peaks should come from same compounds or co-elutes. If certain PMD appeared in multiple RT groups, it would be related to the relationship about adducts, neutral loss, isotopologues or commen fragments ions.
The peaks from different retention time groups would like to be different compounds seperated by chromatograph. The PMD would reflect the relationship about homologous series or chemical reactions.
GlobalStd algorithm use the PMD within same RT group to find independent peaks among certain dataset. Structure/reaction directed analysis use PMD from different RT groups to screen important compounds or reactions.
The input data should be a list
object with at least two elements from a peaks list:
mz
, high resolution mass spectrometry is requiredrt
However, I suggested to add intensity and group information to the list for validation of PMD analysis.
In this package, a dataset from in vivo solid phase micro-extraction(SPME) was attached. This dataset contain 9 samples from 3 fish with triplicates samples for each fish. Here is the data strcture:
library(pmd)
data("spmeinvivo")
str(spmeinvivo)
#> List of 4
#> $ data : num [1:1459, 1:9] 1095 10439 10154 2797 90211 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:1459] "100.1/170" "100.5/86" "101/85" "103.1/348" ...
#> .. ..$ : chr [1:9] "1405_Fish1_F1" "1405_Fish1_F2" "1405_Fish1_F3" "1405_Fish2_F1" ...
#> $ group: chr [1:9] "fish1" "fish1" "fish1" "fish2" ...
#> $ mz : num [1:1459] 100 101 101 103 104 ...
#> $ rt : num [1:1459] 170.2 86.3 84.9 348.1 48.8 ...
You could build this list
object from the xcms
objects via enviGCMS
package. When you have a xcmsSet
object or XCMSnExp
object named xset
, you could use enviGCMS::getmzrt(xset)
or enviGCMS::getmzrt2(xset)
to get such list. Of course you could build such list by yourself.
GlobalStd algorithm try to find independent peaks among certain peaks list. The first step is retention time hierarchical clustering analysis. The second step is to find the relationship among adducts, neutral loss, isotopologues and commen fragments ions. The third step is to screen the independent peaks.
pmd <- getpaired(spmeinvivo, rtcutoff = 10, ng = 10)
#> 75 retention time cluster found.
#> 376 paired masses found
#> 9 unique within RT clusters high frequency PMD(s) used for further investigation.
#> 649 isotopologue(s) related paired mass found.
#> 685 multi-charger(s) related paired mass found.
plotrtg(pmd)
This plot would show the distribution of RT groups. The rtcutoff
in function getpaired
could be used to set the cutoff of the distances in retention time hierarchical clustering analysis.
The ng
in function getpaired
could be used to set cutoff of global PMD’s retention time group numbers. If ng
is 10, at least 10 of the retention time groups should contain the shown PMD relationship. You could use plotpaired
to show the distribution.
You could also show the distribution of PMD relationship by index:
# show the unique PMD found by getpaired function
for(i in 1:length(unique(pmd$paired$diff2))){
diff <- unique(pmd$paired$diff2)[i]
index <- pmd$paired$diff2 == diff
plotpaired(pmd,index)
}
You could use getstd
function to get the independent peaks.
std <- getstd(pmd)
#> 8 retention group(s) have single peaks. 14 23 32 33 54 55 56 75
#> 11 group(s) with multiple peaks while no isotope/paired relationship 4 5 7 8 11 41 42 49 68 72 73
#> 9 group(s) with multiple peaks with isotope without paired relationship 2 9 22 26 52 62 64 66 70
#> 4 group(s) with paired relationship without isotope 1 10 15 18
#> 43 group(s) with paired relationship and isotope 3 6 12 13 16 17 19 20 21 24 25 27 28 29 30 31 34 35 36 37 38 39 40 43 44 45 46 47 48 50 51 53 57 58 59 60 61 63 65 67 69 71 74
#> 270 std mass found.
Here you could plot the peaks by plotstd
function to show the distribution of independent peaks:
You could also plot the peaks distribution by assign a retention time group via plotstdrt
:
par(mfrow = c(2,3))
plotstdrt(std,rtcluster = 23,main = 'Retention time group 23')
plotstdrt(std,rtcluster = 9,main = 'Retention time group 9')
plotstdrt(std,rtcluster = 18,main = 'Retention time group 18')
plotstdrt(std,rtcluster = 67,main = 'Retention time group 67')
plotstdrt(std,rtcluster = 49,main = 'Retention time group 49')
plotstdrt(std,rtcluster = 6,main = 'Retention time group 6')
Independent peaks are supporsing generated from different compounds. We could use those peaks for MS/MS analysis instead of DIA or DDA. Here we need multiple injections for one sample since it might be impossible to get all ions’ fragmental ions in one injection with good sensitivity. You could use gettarget
to generate the index for the injections and output the peaks for each run.
# you need retention time for independent peaks
index <- gettarget(std$rt[std$stdmassindex])
#> You need 7 injections!
# output the ions for each injection
table(index)
#> index
#> 1 2 3 4 5 6 7
#> 24 39 42 52 36 33 44
# show the ions for the first injection
std$mz[index==1]
#> [1] 100.0763 120.0812 138.1031 147.9900 156.9622 157.4607 165.0787
#> [8] 166.0867 192.0132 198.1852 198.1853 198.1854 205.0872 206.0898
#> [15] 220.9350 226.1822 226.9523 239.1628 240.2335 242.2863 242.2863
#> [22] 242.2863 244.1300 250.1779 252.1237 261.2591 269.2687 270.3185
#> [29] 272.0665 272.3234 280.1630 280.2641 302.1454 308.0919 308.2953
#> [36] 308.2954 309.9372 310.9132 320.8698 325.3294 327.0091 337.3313
#> [43] 337.8946 338.3435 338.6338 340.3593 341.0937 348.0713 350.9869
#> [50] 359.3151 368.3169 374.3041 383.2804 383.3671 392.2873 392.3131
#> [57] 416.2862 424.0815 424.8970 425.2153 429.0892 429.3733 439.2698
#> [64] 442.8446 445.2767 462.3940 463.1472 467.1031 468.1028 477.2934
#> [71] 479.2613 493.3138 494.8112 507.3303 524.1178 534.4711 541.3942
#> [78] 542.1221 555.2922 556.4416 581.2431 594.1590 594.3765 595.1563
#> [85] 600.4401 601.3844 620.4358 624.8488 628.1949 649.6613 650.8519
#> [92] 652.8473 654.2092 663.4529 664.4627 669.8727 676.6791 691.6443
#> [99] 704.6384 710.2184 727.4611 730.3384 739.6479 740.8400 784.4919
#> [106] 794.6305 794.8123 795.5448 800.6217 802.5006 825.8384 832.3212
#> [113] 835.8303 859.8318 861.5014 863.3158 863.8189 869.8223 871.8053
#> [120] 875.4962 881.8224 904.8150 925.4477 941.3198 952.7628 962.3132
#> [127] 982.7763 982.8002
std$rt[index==1]
#> [1] 170.1810 348.1340 874.6950 145.0880 145.5380 144.3705 511.3690
#> [8] 511.2940 217.1820 612.2700 639.3130 415.9985 583.5530 583.3410
#> [15] 213.8030 416.1050 1029.1800 506.7940 639.1010 880.3730 538.2950
#> [22] 611.4110 504.6540 452.9630 591.4830 639.1010 639.1010 699.3160
#> [29] 141.0400 742.6450 583.7700 576.6950 583.9810 175.3150 601.1260
#> [36] 569.4105 145.4960 215.9350 218.2260 629.1980 509.5330 622.7700
#> [43] 145.4960 639.1010 639.1000 658.6000 574.0185 146.3950 508.2950
#> [50] 612.4820 570.0560 582.4840 547.0180 699.7440 665.0280 567.4840
#> [57] 169.8320 583.9830 213.7130 601.3390 717.1020 704.6740 639.3130
#> [64] 643.3840 421.8915 582.3755 717.0775 717.1030 717.1010 705.3160
#> [71] 557.4110 540.6520 890.7680 554.8390 690.5290 639.0990 439.6770
#> [78] 762.5760 512.7070 546.9100 630.9130 819.4070 561.2680 819.4050
#> [85] 455.1500 213.5920 531.0080 215.9880 642.7435 637.2775 215.6410
#> [92] 215.5645 819.4060 800.2900 527.7950 213.5285 638.8870 594.0560
#> [99] 639.0980 883.9070 538.8305 213.3840 639.1005 213.7855 667.1730
#> [106] 613.5560 215.7855 492.8580 642.9965 519.6610 213.4215 213.5140
#> [113] 213.4430 213.3840 213.3395 639.1010 639.1000 213.5020 639.1000
#> [120] 213.0800 214.4090 213.3595 476.5790 213.5480 654.3150 213.6560
#> [127] 640.2780 214.4360
You need to check the GlobalStd algorithm’s results by principal components analysis(PCA).
library(enviGCMS)
par(mfrow = c(1,2),mar = c(4,4,2,1)+0.1)
plotpca(std$data,lv = as.numeric(as.factor(std$group)),main = substitute(paste(italic('in vivo'), " SPME samples(all peaks)")))
plotpca(std$data[std$stdmassindex,],lv = as.numeric(as.factor(std$group)),main = substitute(paste(italic('in vivo'), " SPME samples(selected peaks)")))
GlobalStd algorithm in pmd
package could be treated as a way to extract pseudospectra. You could use getcluster
to get peaks groups information for all GlobalStd peaks. Then you could choose export peaks with the highest intensities in each GlobalStd peaks groups.
stdcluster <- getcluster(std)
# extract pseudospectra for std peak 71
plot(stdcluster$cluster$mz[stdcluster$cluster$i==71],stdcluster$cluster$ins[stdcluster$cluster$i==71],type = 'h',xlab = 'm/z',ylab = 'intensity',main = 'pseudospectra for GlobalStd peak 711')
# export peaks with the highest intensities in each GlobalStd peaks groups.
data <- stdcluster$data[stdcluster$stdmassindex2,]
You could also use getcorcluster
to find peaks groups by correlation analysis only.
corcluster <- getcorcluster(spmeinvivo)
#> 75 retention time cluster found.
par(mfrow = c(1,3),mar = c(4,4,2,1)+0.1)
plotpca(std$data,lv = as.numeric(as.factor(std$group)),main = substitute(paste(italic('in vivo'), " SPME samples(all peaks)")))
plotpca(std$data[std$stdmassindex,],lv = as.numeric(as.factor(std$group)),main = substitute(paste(italic('in vivo'), " SPME samples(selected peaks)")))
plotpca(std$data[corcluster$stdmassindex,],lv = as.numeric(as.factor(std$group)),main = substitute(paste(italic('in vivo'), " SPME samples(selected peaks by correlationship)")))
GlobalStd algorithm is designed to analysis data without intensity data. However, if you have intensity data, the independant peaks could be selected with more confindence. You could set up cutoff of Pearson Correlation Coefficient between peaks to refine the peaks selected by GlobalStd within same retention time groups.
std2 <- getstd(pmd,corcutoff = 0.9)
#> 8 retention group(s) have single peaks. 14 23 32 33 54 55 56 75
#> 24 group(s) with multiple peaks while no isotope/paired relationship 2 4 5 7 8 10 11 15 18 22 26 35 39 41 42 49 50 59 62 68 69 70 72 73
#> 13 group(s) with multiple peaks with isotope without paired relationship 9 12 24 27 28 34 51 52 57 60 64 66 71
#> 3 group(s) with paired relationship without isotope 1 53 74
#> 27 group(s) with paired relationship and isotope 3 6 13 16 17 19 20 21 25 29 30 31 36 37 38 40 43 44 45 46 47 48 58 61 63 65 67
#> 118 std mass found.
par(mfrow = c(1,3),mar = c(4,4,2,1)+0.1)
plotpca(std2$data,lv = as.numeric(as.factor(std2$group)),main = substitute(paste(italic('in vivo'), " SPME samples(all peaks)")))
plotpca(std$data[std$stdmassindex,],lv = as.numeric(as.factor(std$group)),main = substitute(paste(italic('in vivo'), " SPME samples(selected peaks)")))
plotpca(std2$data[std2$stdmassindex,],lv = as.numeric(as.factor(std2$group)),main = substitute(paste(italic('in vivo'), " SPME samples(selected peaks)")))
getsda
function could be used to perform Structure/reaction directed analysis. freqcutoff
could be used to filter the PMD with high frequncy.
sda <- getsda(std, freqcutoff = 10)
#> Top 50 high frequency PMD groups were remained.
#> 20 groups were found as high frequency PMD group.
#> 0 were found as high frequency PMD.
#> 1.98 were found as high frequency PMD.
#> 2.02 were found as high frequency PMD.
#> 13.98 were found as high frequency PMD.
#> 14.02 were found as high frequency PMD.
#> 14.05 were found as high frequency PMD.
#> 15.99 were found as high frequency PMD.
#> 16.03 were found as high frequency PMD.
#> 28.03 were found as high frequency PMD.
#> 30.05 were found as high frequency PMD.
#> 42.05 were found as high frequency PMD.
#> 49.02 were found as high frequency PMD.
#> 58.04 were found as high frequency PMD.
#> 66.05 were found as high frequency PMD.
#> 68.06 were found as high frequency PMD.
#> 74.02 were found as high frequency PMD.
#> 82.08 were found as high frequency PMD.
#> 88.05 were found as high frequency PMD.
#> 116.08 were found as high frequency PMD.
#> 126.14 were found as high frequency PMD.
You could use plotstdsda
to show the distribution of the selected paired peaks.
You could also use index to show the distribution of certain PMDs.
par(mfrow = c(2,3),mar = c(4,4,2,1)+0.1)
plotstdsda(sda,sda$sda$diff2 == 0)
plotstdsda(sda,sda$sda$diff2 == 13.98)
plotstdsda(sda,sda$sda$diff2 == 15.99)
plotstdsda(sda,sda$sda$diff2 == 14.02)
plotstdsda(sda,sda$sda$diff2 == 28.03)
plotstdsda(sda,sda$sda$diff2 == 58.04)
Structure/reaction directed analysis could be directily performed on all the peaks, which is slow to process:
sdaall <- getsda(spmeinvivo)
par(mfrow = c(2,3),mar = c(4,4,2,1)+0.1)
plotstdsda(sdaall,sdaall$sda$diff2 == 0)
plotstdsda(sdaall,sdaall$sda$diff2 == 13.98)
plotstdsda(sdaall,sdaall$sda$diff2 == 15.99)
plotstdsda(sdaall,sdaall$sda$diff2 == 14.02)
plotstdsda(sdaall,sdaall$sda$diff2 == 28.03)
plotstdsda(sdaall,sdaall$sda$diff2 == 58.04)
When you only have data of peaks without retention time or compounds list, structure/reaction directed analysis could also be done by getrda
function.
One peak or compounds could be involved in multiple reactions. You could construct a network by such relationship.
If you have a specific compound and want to check the metabolites of certain PMD, you could use getchain
to extract the network of that compounds
library(igraph)
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
# check metabolites of C18H39NO
chain <- getchain(spmeinvivo,diff = c(2.02,14.02,15.99,58.04,13.98),mass = 286.3101)
# show as network
net <- graph_from_data_frame(chain$sdac,directed = F)
plot(net,vertex.label=NA,vertex.size = as.numeric((names(V(net))))/25,edge.width = E(net)$diff2/20+0.00001)
If you want to see all the independant peaks’ high frequency PMDs as a network, the following code will help
sda <- getsda(std, freqcutoff = 10)
#> Top 50 high frequency PMD groups were remained.
#> 20 groups were found as high frequency PMD group.
#> 0 were found as high frequency PMD.
#> 1.98 were found as high frequency PMD.
#> 2.02 were found as high frequency PMD.
#> 13.98 were found as high frequency PMD.
#> 14.02 were found as high frequency PMD.
#> 14.05 were found as high frequency PMD.
#> 15.99 were found as high frequency PMD.
#> 16.03 were found as high frequency PMD.
#> 28.03 were found as high frequency PMD.
#> 30.05 were found as high frequency PMD.
#> 42.05 were found as high frequency PMD.
#> 49.02 were found as high frequency PMD.
#> 58.04 were found as high frequency PMD.
#> 66.05 were found as high frequency PMD.
#> 68.06 were found as high frequency PMD.
#> 74.02 were found as high frequency PMD.
#> 82.08 were found as high frequency PMD.
#> 88.05 were found as high frequency PMD.
#> 116.08 were found as high frequency PMD.
#> 126.14 were found as high frequency PMD.
df <- sda$sda
net <- graph_from_data_frame(df,directed = F)
plot(net,vertex.label=NA,vertex.size = as.numeric((names(V(net))))/25,edge.width = E(net)$diff2/20+0.00001)
# Check the degree of the nodes
# Show the degree distribution of the vertices
deg <- degree(net, mode="all")
degree_distribution(net)
#> [1] 0.000000000 0.286764706 0.257352941 0.095588235 0.088235294
#> [6] 0.058823529 0.036764706 0.058823529 0.036764706 0.044117647
#> [11] 0.007352941 0.000000000 0.007352941 0.000000000 0.000000000
#> [16] 0.000000000 0.014705882 0.000000000 0.000000000 0.000000000
#> [21] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> [26] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> [31] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> [36] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> [41] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> [46] 0.000000000 0.000000000 0.000000000 0.007352941
plot(net, vertex.size=deg/2,vertex.label=NA,vertex.size = as.numeric((names(V(net))))/25, edge.width = E(net)$diff2/20+0.00001)
# network community structure detection
ceb <- cluster_edge_betweenness(net,weights = abs(E(net)$cor), directed = F)
#> Warning in cluster_edge_betweenness(net, weights = abs(E(net)$cor),
#> directed = F): At community.c:460 :Membership vector will be selected based
#> on the lowest modularity score.
#> Warning in cluster_edge_betweenness(net, weights = abs(E(net)$cor),
#> directed = F): At community.c:467 :Modularity calculation with weighted
#> edge betweenness community detection might not make sense -- modularity
#> treats edge weights as similarities while edge betwenness treats them as
#> distances
plot(ceb, net,vertex.label=NA,)
Retention time cluster cutoff should fit the peak picking algorithm. For HPLC, 10 is suggested and 5 could be used for UPLC.
Global PMD’s retention time group numbers should be around 20 percent of the retention time cluster numbers. For example, if you find 100 retention time clusters, I suggested you use 20 as the empirical global PMD’s retention time group numbers.
As for the cutoff of the frequency of PMDs, you could change the frequency until you find certain PMD which you’re sure that it should appear in your dataset. For example, 16 should be considered as a good start. Any pmd with the frequency larger than PMD 16 could be further discussed.
Another important hint is that pre-filter your peak list by black samples or other quality control samples. Otherwise the running time would be long and lots of pmd relationship would be just from noise.
globalstd
function is a wrap funtion to process GlobalStd algorithm and structure/reaction directed analysis in one line. All the plot function could be directly used on the list
objects from globalstd
function.
result <- globalstd(spmeinvivo)
#> 75 retention time cluster found.
#> 376 paired masses found
#> 9 unique within RT clusters high frequency PMD(s) used for further investigation.
#> 649 isotopologue(s) related paired mass found.
#> 685 multi-charger(s) related paired mass found.
#> 8 retention group(s) have single peaks. 14 23 32 33 54 55 56 75
#> 11 group(s) with multiple peaks while no isotope/paired relationship 4 5 7 8 11 41 42 49 68 72 73
#> 9 group(s) with multiple peaks with isotope without paired relationship 2 9 22 26 52 62 64 66 70
#> 4 group(s) with paired relationship without isotope 1 10 15 18
#> 43 group(s) with paired relationship and isotope 3 6 12 13 16 17 19 20 21 24 25 27 28 29 30 31 34 35 36 37 38 39 40 43 44 45 46 47 48 50 51 53 57 58 59 60 61 63 65 67 69 71 74
#> 270 std mass found.
#> Top 50 high frequency PMD groups were remained.
#> 20 groups were found as high frequency PMD group.
#> 0 were found as high frequency PMD.
#> 1.98 were found as high frequency PMD.
#> 2.02 were found as high frequency PMD.
#> 13.98 were found as high frequency PMD.
#> 14.02 were found as high frequency PMD.
#> 14.05 were found as high frequency PMD.
#> 15.99 were found as high frequency PMD.
#> 16.03 were found as high frequency PMD.
#> 28.03 were found as high frequency PMD.
#> 30.05 were found as high frequency PMD.
#> 42.05 were found as high frequency PMD.
#> 49.02 were found as high frequency PMD.
#> 58.04 were found as high frequency PMD.
#> 66.05 were found as high frequency PMD.
#> 68.06 were found as high frequency PMD.
#> 74.02 were found as high frequency PMD.
#> 82.08 were found as high frequency PMD.
#> 88.05 were found as high frequency PMD.
#> 116.08 were found as high frequency PMD.
#> 126.14 were found as high frequency PMD.
An interactive document has been included in this package to show PMD analysis. You could run runPMD()
to start the Graphical user interface(GUI) for GlobalStd algorithm and structure/reaction directed analysis. You need to prepare a csv file with m/z and retention time of peaks. Such csv file could be generated by run enviGCMS::getmzrtcsv()
on the list
object from enviGCMS::getmzrt(xset)
or enviGCMS::getmzrt2(xset)
function. You could also generate the csv file by enviGCMS::getmzrt(xset,name = 'test')
or enviGCMS::getmzrt2(xset, name = 'test')
. You will find the csv file in the working dictionary named test.csv
.
pmd
package could be used to reduce the redundancy peaks for GC/LC-MS based research and perform structure/reaction directed analysis to screen known and unknown important compounds or reactions.