(1) Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
(2) Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
(3) Department of Statis-tics, Allameh Tabataba'i University, Tehran, Iran
This package has developed a tool for performing a novel pathway enrichment analysis based on Bayesian network (BNrich) to investigate the topology features of the pathways. This algorithm as a biologically intuitive, method, analyzes of most structural data acquired from signaling pathways such as causal relationships between genes using the property of Bayesian networks and also infer finalized networks more conveniently by simplifying networks in the early stages and using Least Absolute Shrinkage Selector Operator (LASSO). impacted pathways are ultimately prioritized the by Fisher’s Exact Test on significant parameters. Here, we provide an instance code that applies BNrich in all of the fields described above.
This document offers an introductory overview of how to use the package. The BNrich tool uses Bayesian Network (BN) in a new topology-based pathway analysis (TPA) method. The BN has been demonstrated as a beneficial technique for integrating and modeling biological data into causal relationships (1–4). The proposed method utilizes BN to model variations in downstream components (children) as a consequence of the change in upstream components (parents). For this purpose, The method employs 187 KEGG human non-metabolic pathways (5–7) which their cycles were eliminated manually by a biological intuitive, as BN structures and gene expression data to estimate its parameters (8,9). The cycles of inferred networks were eliminated on the basis of biologically intuitive rules instead of using computing algorithms (10). The inferred networks are simplified in two steps; unifying genes and LASSO. Similarly, the originally continuous gene expression data is used to BN parameters learning, rather than discretized data (8). The algorithm estimates regression coefficients by continuous data based on the parameter learning techniques in the BN (11,12). The final impacted pathways are gained by Fisher’s exact test. This method can represent effective genes and biological relations in impacted pathways based on a significant level.
install.packages("BNrich_0.1.0.tar.gz", type="source", repos=NULL) library("BNrich")
At first, we can load all the 187 preprocessed KEGG pathways which their cycles were removed, the data frame includes information about the pathways and vector of pathway ID.
destfile = tempfile("files", fileext = ".rda") files <- fetch_data_file() load(destfile)
Note that it's better to use (for example:)
destfile = "./R/BNrich-start.rda" to save essential files permanently.
The input data should be as two data frames in states disease and (healthy) control. The row names of any data frame are KEGG geneID and the number of subjects in any of them should not be less than 20, otherwise the user may encounters error in
LASSO step. Initially, we can load dataset example.
The example data extracted from a part of
GSE47756 dataset, the gene expression data from colorectal cancer study (13).
Data <- system.file("extdata", "Test_DATA.RData", package = "BNrich", mustWork = TRUE) load(Data) head(dataH)
Initially, we need to unify gene products based on 187 imported signaling pathways (
mapkG list) in two states disease (dataD) and control (dataH). This is the first simplification step, unifying nodes in signaling pathways with genes those exist in gene expression data.
unify_results <- unify_path(dataH, dataD, mapkG, pathway.id)
unify_path function performs the following processes:
• Split datasets into KEGG pathways
• Delete all gene expression data are not in pathways
• Removes all gene products in pathways are not in dataset platforms
• Remove any pathways with the number of edges is less than 5
This function returns a list contain
data_d are lists contain data frames related to control and disease objects unified for any signaling pathways. The
mapkG1 is a list contains unified signaling pathways and
pathway.id1 is new pathway ID vector based on remained pathways.
In the example dataset, the number of edges in the one pathway becomes less than 5 and are removed:
mapkG1 <- unify_results$mapkG1 length(mapkG1)
As well, the number of edges reduces in the remaining pathways. In first pathway
hsa:01521 the number of edges from 230 reduces to 204:
A graphNEL graph with directed edges Number of Nodes = 79 Number of Edges = 230
pathway.id1 <- unify_results$pathway.id1 pathway.id1
A graphNEL graph with directed edges Number of Nodes = 71 Number of Edges = 204
Now we can construct BN structures based on unified signaling pathways and consequently need the results of
BN <- BN_struct(unify_results$mapkG1)
BN_struct function returns a list contains BNs structures reconstructed from all
Given that the data used is continuous, each node is modeled as a regression line on its parents (11,14). Thus, on some of these regression lines, the number of these independent variables is high, so in order to avoid the collinearity problem, we need to use the Lasso regression (15,16).
We perform this function for any node with more than one parent, in all BNs achieved by
BN_struct function, based on control and disease data obtained by
data_h <- unify_results$data_h data_d <- unify_results$data_d LASSO_results <- LASSO_BN(BN, data_h, data_d)
The LASSO_BN function returns a list contains two lists BN_H and BN_D are simplified BNs structures based on LASSO regression related to healthy and disease objects. This function lead to reduce number of edges too:
Now we can estimate (learn) parameters for any BNs based on healthy and disease data lists.
BN_H <- LASSO_results$BN_H BN_D <- LASSO_results$BN_D esti_results <- esti_par(BN_H, BN_D, data_h, data_d)
esti_par function returns a list contains four lists. The
BN_d, are lists of BNs which their parameters learned by control and disease objects data. The
coef_d are lists of parameters of
As you can see in below, node
hsa:1978 in the first BN has one parent. The coefficient in control (healthy) data is
0.6958609 and in disease data is
Parameters of node hsa:1978 (Gaussian distribution) Conditional density: hsa:1978 | hsa:2475 Coefficients: (Intercept) hsa:2475 2.8841264 0.6958609 Standard deviation of the residuals: 0.3489612
Parameters of node hsa:1978 (Gaussian distribution) Conditional density: hsa:1978 | hsa:2475 Coefficients: (Intercept) hsa:2475 0.9046357 1.1870730 Standard deviation of the residuals: 0.2713789
We require the variance of the BNs parameters to perform the T-test between the corresponding parameters.
BN_h <- esti_results$BNs_h BN_d <- esti_results$BNs_d coef_h <- esti_results$coef_h coef_d <- esti_results$coef_d var_mat_results<- var_mat (data_h, coef_h, BN_h, data_d, coef_d, BN_d)
var_mat function returns a list contains two lists
var_mat_Bd which are the variance-covariance matrixes for any parameters of
BN_d. The variance-covariance matrixes for the fifth node,
hsa:1978, in first BN in two states control and disease is as follow:
[,1] [,2] [1,] 10.177073 -3.630152 [2,] -3.630152 1.296990
[,1] [,2] [1,] 3.549338 -1.0392040 [2,] -1.039204 0.3053785
T-test perfoms between any corresponding parameters between each pair of learned BNs,
BN_d, in disease and control states. Assumptions are unequal sample sizes and unequal variances for all samples.
var_mat_Bh <- var_mat_results $var_mat_Bh var_mat_Bd <- var_mat_results $var_mat_Bd Ttest_results <- parm_Ttest(data_h, coef_h, BN_h, data_d, coef_d, BN_d, var_mat_Bh, var_mat_Bd, pathway.id1) head(Ttest_results)
|From||To||pathway.number||pathwayID||Pval||coefficient in disease||coefficient in control||fdr|
This function returns a data frame contains T-test results for all parameters in all final BNs. The row that is intercept in
From variable, shows significance level for gene product that is shown in
To variable. The rest of the data frame rows shows significance level for any edge of networks.
In the last step we can determine enriched pathways by own threshold on
fdr. Hence we run the Fisher's exact test for any final pathways. As stated above, the
Ttest_results is a data frame contains T-test results for all parameters in final BNs achieved by
parm_Ttest function and
fdr.value A numeric threshold to determine significant parameters (default is 0.05).
BNrich_results <- BNrich(Ttest_results, pathway.id1, PathName_final, fdr.value = 0.05) head(BNrich_results)
|hsa:05202||1.64E-17||2.47E-15||156||Transcriptional misregulation in cancer|
The following package and versions were used in the production of this vignette.
R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale:  LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252  LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C  LC_TIME=English_United Kingdom.1252 attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  BNrich_0.1.0 loaded via a namespace (and not attached):  Rcpp_1.0.2 codetools_0.2-16 lattice_0.20-38 corpcor_1.6.9  foreach_1.4.7 glmnet_2.0-18 digest_0.6.20 grid_3.6.1  stats4_3.6.1 evaluate_0.14 graph_1.63.0 Matrix_1.2-17  rmarkdown_1.14 bnlearn_4.5 iterators_1.0.12 tools_3.6.1  parallel_3.6.1 xfun_0.8 yaml_2.2.0 rsconnect_0.8.15  compiler_3.6.1 BiocGenerics_0.31.5 htmltools_0.3.6 knitr_1.24