Paired Mass Distance(PMD) analysis for GC/LC-MS based non-targeted analysis

Miao Yu

2019-08-22

Introduction of Paired Mass Distance analysis

pmd package use Paired Mass Distance (PMD) relationship to analysis the GC/LC-MS based non-targeted data. PMD means the distance between two masses or mass to charge ratios. In mass spectrometry, PMD would keep the same between two masses or two mass to charge ratios(m/z). There are twe kinds of PMD involved in this package: PMD within same retention time group and PMD from different retention time groups.

PMD within same retention time group

In GC/LC-MS based non-targeted analysis, peaks could be seperated by chromatograph. We could build retention time(RT) bins to assign peaks into different RT groups by retention time hierarchical clustering analysis. For each RT group, the peaks should come from same compounds or co-elutes. If certain PMD appeared in multiple RT groups, it would be related to the relationship about adducts, neutral loss, isotopologues or commen fragments ions.

PMD from different retention time groups

The peaks from different retention time groups would like to be different compounds seperated by chromatograph. The PMD would reflect the relationship about homologous series or chemical reactions.

GlobalStd algorithm use the PMD within same RT group to find independent peaks among certain dataset. Structure/reaction directed analysis use PMD from different RT groups to screen important compounds or reactions.

Data format

The input data should be a list object with at least two elements from a peaks list:

However, I suggested to add intensity and group information to the list for validation of PMD analysis.

In this package, a dataset from in vivo solid phase micro-extraction(SPME) was attached. This dataset contain 9 samples from 3 fish with triplicates samples for each fish. Here is the data strcture:

library(pmd)
data("spmeinvivo")
str(spmeinvivo)
#> List of 4
#>  $ data : num [1:1459, 1:9] 1095 10439 10154 2797 90211 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:1459] "100.1/170" "100.5/86" "101/85" "103.1/348" ...
#>   .. ..$ : chr [1:9] "1405_Fish1_F1" "1405_Fish1_F2" "1405_Fish1_F3" "1405_Fish2_F1" ...
#>  $ group: chr [1:9] "fish1" "fish1" "fish1" "fish2" ...
#>  $ mz   : num [1:1459] 100 101 101 103 104 ...
#>  $ rt   : num [1:1459] 170.2 86.3 84.9 348.1 48.8 ...

You could build this list object from the xcms objects via enviGCMS package. When you have a xcmsSet object or XCMSnExp object named xset, you could use enviGCMS::getmzrt(xset) or enviGCMS::getmzrt2(xset) to get such list. Of course you could build such list by yourself.

GlobalStd algorithm

GlobalStd algorithm try to find independent peaks among certain peaks list. The first step is retention time hierarchical clustering analysis. The second step is to find the relationship among adducts, neutral loss, isotopologues and commen fragments ions. The third step is to screen the independent peaks.

Retention time hierarchical clustering

This plot would show the distribution of RT groups. The rtcutoff in function getpaired could be used to set the cutoff of the distances in retention time hierarchical clustering analysis.

Relationship among adducts, neutral loss, isotopologues and commen fragments ions

The ng in function getpaired could be used to set cutoff of global PMD’s retention time group numbers. If ng is 10, at least 10 of the retention time groups should contain the shown PMD relationship. You could use plotpaired to show the distribution.

You could also show the distribution of PMD relationship by index:

Screen the independent peaks

You could use getstd function to get the independent peaks.

Here you could plot the peaks by plotstd function to show the distribution of independent peaks:

You could also plot the peaks distribution by assign a retention time group via plotstdrt:

Use independent peaks for MS/MS validation

Independent peaks are supporsing generated from different compounds. We could use those peaks for MS/MS analysis instead of DIA or DDA. Here we need multiple injections for one sample since it might be impossible to get all ions’ fragmental ions in one injection with good sensitivity. You could use gettarget to generate the index for the injections and output the peaks for each run.

# you need retention time for independent peaks
index <- gettarget(std$rt[std$stdmassindex])
#> You need 7 injections!
# output the ions for each injection
table(index)
#> index
#>  1  2  3  4  5  6  7 
#> 24 39 42 52 36 33 44
# show the ions for the first injection
std$mz[index==1]
#>   [1] 100.0763 120.0812 138.1031 147.9900 156.9622 157.4607 165.0787
#>   [8] 166.0867 192.0132 198.1852 198.1853 198.1854 205.0872 206.0898
#>  [15] 220.9350 226.1822 226.9523 239.1628 240.2335 242.2863 242.2863
#>  [22] 242.2863 244.1300 250.1779 252.1237 261.2591 269.2687 270.3185
#>  [29] 272.0665 272.3234 280.1630 280.2641 302.1454 308.0919 308.2953
#>  [36] 308.2954 309.9372 310.9132 320.8698 325.3294 327.0091 337.3313
#>  [43] 337.8946 338.3435 338.6338 340.3593 341.0937 348.0713 350.9869
#>  [50] 359.3151 368.3169 374.3041 383.2804 383.3671 392.2873 392.3131
#>  [57] 416.2862 424.0815 424.8970 425.2153 429.0892 429.3733 439.2698
#>  [64] 442.8446 445.2767 462.3940 463.1472 467.1031 468.1028 477.2934
#>  [71] 479.2613 493.3138 494.8112 507.3303 524.1178 534.4711 541.3942
#>  [78] 542.1221 555.2922 556.4416 581.2431 594.1590 594.3765 595.1563
#>  [85] 600.4401 601.3844 620.4358 624.8488 628.1949 649.6613 650.8519
#>  [92] 652.8473 654.2092 663.4529 664.4627 669.8727 676.6791 691.6443
#>  [99] 704.6384 710.2184 727.4611 730.3384 739.6479 740.8400 784.4919
#> [106] 794.6305 794.8123 795.5448 800.6217 802.5006 825.8384 832.3212
#> [113] 835.8303 859.8318 861.5014 863.3158 863.8189 869.8223 871.8053
#> [120] 875.4962 881.8224 904.8150 925.4477 941.3198 952.7628 962.3132
#> [127] 982.7763 982.8002
std$rt[index==1]
#>   [1]  170.1810  348.1340  874.6950  145.0880  145.5380  144.3705  511.3690
#>   [8]  511.2940  217.1820  612.2700  639.3130  415.9985  583.5530  583.3410
#>  [15]  213.8030  416.1050 1029.1800  506.7940  639.1010  880.3730  538.2950
#>  [22]  611.4110  504.6540  452.9630  591.4830  639.1010  639.1010  699.3160
#>  [29]  141.0400  742.6450  583.7700  576.6950  583.9810  175.3150  601.1260
#>  [36]  569.4105  145.4960  215.9350  218.2260  629.1980  509.5330  622.7700
#>  [43]  145.4960  639.1010  639.1000  658.6000  574.0185  146.3950  508.2950
#>  [50]  612.4820  570.0560  582.4840  547.0180  699.7440  665.0280  567.4840
#>  [57]  169.8320  583.9830  213.7130  601.3390  717.1020  704.6740  639.3130
#>  [64]  643.3840  421.8915  582.3755  717.0775  717.1030  717.1010  705.3160
#>  [71]  557.4110  540.6520  890.7680  554.8390  690.5290  639.0990  439.6770
#>  [78]  762.5760  512.7070  546.9100  630.9130  819.4070  561.2680  819.4050
#>  [85]  455.1500  213.5920  531.0080  215.9880  642.7435  637.2775  215.6410
#>  [92]  215.5645  819.4060  800.2900  527.7950  213.5285  638.8870  594.0560
#>  [99]  639.0980  883.9070  538.8305  213.3840  639.1005  213.7855  667.1730
#> [106]  613.5560  215.7855  492.8580  642.9965  519.6610  213.4215  213.5140
#> [113]  213.4430  213.3840  213.3395  639.1010  639.1000  213.5020  639.1000
#> [120]  213.0800  214.4090  213.3595  476.5790  213.5480  654.3150  213.6560
#> [127]  640.2780  214.4360

Validation by principal components analysis(PCA)

You need to check the GlobalStd algorithm’s results by principal components analysis(PCA).

Comparision with other packages

GlobalStd algorithm in pmd package could be treated as a way to extract pseudospectra. You could use getcluster to get peaks groups information for all GlobalStd peaks. Then you could choose export peaks with the highest intensities in each GlobalStd peaks groups.

You could also use getcorcluster to find peaks groups by correlation analysis only.

GlobalStd algorithm with intensity data

GlobalStd algorithm is designed to analysis data without intensity data. However, if you have intensity data, the independant peaks could be selected with more confindence. You could set up cutoff of Pearson Correlation Coefficient between peaks to refine the peaks selected by GlobalStd within same retention time groups.

Structure/Reaction directed analysis

getsda function could be used to perform Structure/reaction directed analysis. freqcutoff could be used to filter the PMD with high frequncy.

sda <- getsda(std, freqcutoff = 10)
#> Top 50 high frequency PMD groups were remained.
#> 20 groups were found as high frequency PMD group.
#> 0 were found as high frequency PMD. 
#> 1.98 were found as high frequency PMD. 
#> 2.02 were found as high frequency PMD. 
#> 13.98 were found as high frequency PMD. 
#> 14.02 were found as high frequency PMD. 
#> 14.05 were found as high frequency PMD. 
#> 15.99 were found as high frequency PMD. 
#> 16.03 were found as high frequency PMD. 
#> 28.03 were found as high frequency PMD. 
#> 30.05 were found as high frequency PMD. 
#> 42.05 were found as high frequency PMD. 
#> 49.02 were found as high frequency PMD. 
#> 58.04 were found as high frequency PMD. 
#> 66.05 were found as high frequency PMD. 
#> 68.06 were found as high frequency PMD. 
#> 74.02 were found as high frequency PMD. 
#> 82.08 were found as high frequency PMD. 
#> 88.05 were found as high frequency PMD. 
#> 116.08 were found as high frequency PMD. 
#> 126.14 were found as high frequency PMD.

You could use plotstdsda to show the distribution of the selected paired peaks.

plotstdsda(sda)

You could also use index to show the distribution of certain PMDs.

par(mfrow = c(2,3),mar = c(4,4,2,1)+0.1)
plotstdsda(sda,sda$sda$diff2 == 0)
plotstdsda(sda,sda$sda$diff2 == 13.98)
plotstdsda(sda,sda$sda$diff2 == 15.99)
plotstdsda(sda,sda$sda$diff2 == 14.02)
plotstdsda(sda,sda$sda$diff2 == 28.03)
plotstdsda(sda,sda$sda$diff2 == 58.04)

Structure/reaction directed analysis could be directily performed on all the peaks, which is slow to process:

sdaall <- getsda(spmeinvivo)
par(mfrow = c(2,3),mar = c(4,4,2,1)+0.1)
plotstdsda(sdaall,sdaall$sda$diff2 == 0)
plotstdsda(sdaall,sdaall$sda$diff2 == 13.98)
plotstdsda(sdaall,sdaall$sda$diff2 == 15.99)
plotstdsda(sdaall,sdaall$sda$diff2 == 14.02)
plotstdsda(sdaall,sdaall$sda$diff2 == 28.03)
plotstdsda(sdaall,sdaall$sda$diff2 == 58.04)

Structure/reaction directed analysis for peaks/compounds only

When you only have data of peaks without retention time or compounds list, structure/reaction directed analysis could also be done by getrda function.

Structure/Reaction Network

One peak or compounds could be involved in multiple reactions. You could construct a network by such relationship.

If you have a specific compound and want to check the metabolites of certain PMD, you could use getchain to extract the network of that compounds

If you want to see all the independant peaks’ high frequency PMDs as a network, the following code will help

Parameters selection

Retention time cluster cutoff should fit the peak picking algorithm. For HPLC, 10 is suggested and 5 could be used for UPLC.

Global PMD’s retention time group numbers should be around 20 percent of the retention time cluster numbers. For example, if you find 100 retention time clusters, I suggested you use 20 as the empirical global PMD’s retention time group numbers.

As for the cutoff of the frequency of PMDs, you could change the frequency until you find certain PMD which you’re sure that it should appear in your dataset. For example, 16 should be considered as a good start. Any pmd with the frequency larger than PMD 16 could be further discussed.

Another important hint is that pre-filter your peak list by black samples or other quality control samples. Otherwise the running time would be long and lots of pmd relationship would be just from noise.

Wrap function

globalstd function is a wrap funtion to process GlobalStd algorithm and structure/reaction directed analysis in one line. All the plot function could be directly used on the list objects from globalstd function.

result <- globalstd(spmeinvivo)
#> 75 retention time cluster found.
#> 376 paired masses found
#> 9 unique within RT clusters high frequency PMD(s) used for further investigation.
#> 649 isotopologue(s) related paired mass found.
#> 685 multi-charger(s) related paired mass found.
#> 8 retention group(s) have single peaks. 14 23 32 33 54 55 56 75
#> 11 group(s) with multiple peaks while no isotope/paired relationship 4 5 7 8 11 41 42 49 68 72 73
#> 9 group(s) with multiple peaks with isotope without paired relationship 2 9 22 26 52 62 64 66 70
#> 4 group(s) with paired relationship without isotope 1 10 15 18
#> 43 group(s) with paired relationship and isotope 3 6 12 13 16 17 19 20 21 24 25 27 28 29 30 31 34 35 36 37 38 39 40 43 44 45 46 47 48 50 51 53 57 58 59 60 61 63 65 67 69 71 74
#> 270 std mass found.
#> Top 50 high frequency PMD groups were remained.
#> 20 groups were found as high frequency PMD group.
#> 0 were found as high frequency PMD. 
#> 1.98 were found as high frequency PMD. 
#> 2.02 were found as high frequency PMD. 
#> 13.98 were found as high frequency PMD. 
#> 14.02 were found as high frequency PMD. 
#> 14.05 were found as high frequency PMD. 
#> 15.99 were found as high frequency PMD. 
#> 16.03 were found as high frequency PMD. 
#> 28.03 were found as high frequency PMD. 
#> 30.05 were found as high frequency PMD. 
#> 42.05 were found as high frequency PMD. 
#> 49.02 were found as high frequency PMD. 
#> 58.04 were found as high frequency PMD. 
#> 66.05 were found as high frequency PMD. 
#> 68.06 were found as high frequency PMD. 
#> 74.02 were found as high frequency PMD. 
#> 82.08 were found as high frequency PMD. 
#> 88.05 were found as high frequency PMD. 
#> 116.08 were found as high frequency PMD. 
#> 126.14 were found as high frequency PMD.

Shiny application

An interactive document has been included in this package to show PMD analysis. You could run runPMD() to start the Graphical user interface(GUI) for GlobalStd algorithm and structure/reaction directed analysis. You need to prepare a csv file with m/z and retention time of peaks. Such csv file could be generated by run enviGCMS::getmzrtcsv() on the list object from enviGCMS::getmzrt(xset) or enviGCMS::getmzrt2(xset) function. You could also generate the csv file by enviGCMS::getmzrt(xset,name = 'test') or enviGCMS::getmzrt2(xset, name = 'test'). You will find the csv file in the working dictionary named test.csv.

Conclusion

pmd package could be used to reduce the redundancy peaks for GC/LC-MS based research and perform structure/reaction directed analysis to screen known and unknown important compounds or reactions.