Requirements of the package will be installed when loading the library. The package requirements/dependencies are included in the
DESCRIPTION file of the package. The complete source code can be accessed through .
Since the pipeline partly uses a Python script for computing the the
molnet/inst/requirements.txt found in the molnet repository can be used via pip install or using the
The package also supports installation with conda except for one dependency (Ray) which has to be installed with pip or the function above.
The main purpose of the pipeline is to easily and efficiently generate, reduce, and, combine molecular networks from different patient groups to compute a differential drug interaction score based on drug targets. This allows for improved predictions of the effect of cancer drugs on patient groups with different characteristics. The following exemplary pipeline use case showcases the usage of molecular breast cancer data with ER+ (Estrogen receptor-positive) being group 1 and ER- (Estrogen receptor-negative) being group 2 patient samples. The sample data is included within the package.
[image created with biorender.com]
The breast cancer data by Krug et al., 2020 used for this tutorial is already preprocessed and only includes samples with tumor purity >0.5 and known ER status. Metabolite data was sampled randomly to generate distributions similar to those reported e.g. in (Terunuma et al., 2014). Krug et al., 2020 published data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC). The dataset contains observations from:
|Number of genes||Preprocessing||Identifier|
|mRNA||13915||RNA expression quantified,log2-transformed FPKM values, NAs set to -11 , non-reduced mRNA dataset with all 13195 genes||Ncbi ID, Gene name|
|Protein||5809 (ER+) 5845 (ER-)||Normalized, standardized||Gene name, Ncbi RefSeq ID, String|
|Phosphoprotein||10272 (ER+) 11318 (ER-)||Normalized, remove phosphosites with >50% NAs||Gene name, Ncbi RefSeq ID|
|Metabolite||275 from 33 (ER+) 34 (ER-) samples||metabolites with >50% NAs||Pubchem, Metabolon ID|
First you load the preprocessed data. This data is included in the package and does not need to be manually loaded but can be directly accessed once
library(molnet) is called.
Then you can use formatting functions to bring the data into the required input format:
In the beginning individual layer objects are created using . The user supplies data stratified over two groups and identifiers as shown in the code chunk below. The identifiers need to be in the same order as the components in the data. Additionally, the layer is named by the
name argument. All individual layers are passed in a single list. Naming of the columns in the respective identifier data frames need to be consistent i.e. if the mRNA layer and the protein layers should be connected both need to contain a column with the same identifier name. In this example mRNA and Proteins both contain
gene_name. The data frame
drug_gene_interactions should contain identifiers with consistent column names as well as drug names and IDs.
head(protein_data$group1$identifiers) #> refSeq gene_name STRING_id #> 1 NP_001254479.2 TTN 9606.ENSP00000467141 #> 2 NP_001243779.1 TTN 9606.ENSP00000467141 #> 3 NP_596870.2 TTN 9606.ENSP00000467141 #> 4 NP_569827.2 MADD 9606.ENSP00000310933 #> 5 NP_036555.1 RPL13A 9606.ENSP00000375730 #> 6 NP_001122305.1 ZBTB4 9606.ENSP00000307858 head(mrna_data$group1$identifiers) #> gene_name #> 1 A1BG #> 2 A1BG-AS1 #> 3 A1CF #> 4 A2M #> 5 A2ML1 #> 6 A2MP1
For computational reasons we use a subset of 100 genes for this vignette. Below we create individual layers.
number_of_genes <- 100 # set for subsetting # Create individual layers mrna_layer <- make_layer(name="mrna", data_group1=mrna_data$group1$data[,1:number_of_genes], data_group2=mrna_data$group2$data[,1:number_of_genes], identifier_group1=data.frame(gene_name=mrna_data$group1$identifiers[1:number_of_genes,]), identifier_group2=data.frame(gene_name=mrna_data$group2$identifiers[1:number_of_genes,]) ) protein_layer <- make_layer(name="protein", data_group1=protein_data$group1$data[,1:number_of_genes], data_group2=protein_data$group2$data[,1:number_of_genes], identifier_group1=protein_data$group1$identifiers[1:number_of_genes,], identifier_group2=protein_data$group2$identifiers[1:number_of_genes,] )
The pipeline requires a list containing all individual layers. Below we are creating the list.
Then you need to supply the inter-layer connections.
to need to match with a name in the previously created
layers by . The established connection will result in an undirected graph. It is possible to create connections based on mutual identifiers present in both connected layers or based on an interaction table supplied that you need to supply. In case connections are matched on identifiers
connect_on specifies the shared identifier and the edge weight is passed to
weight. The default weight is 1. In case connections are established using an interaction table it has to be passed to
weight specifies the name of the column in the interaction table which is used as edge weight.
inter_layer_connections = list( make_connection(from = 'mrna', to = 'protein', connect_on = 'gene_name', weight = 1), make_connection(from = 'protein', to = 'phosphoprotein', connect_on = 'gene_name', weight = 1), make_connection(from = 'protein', to = 'metabolite', connect_on = metabolite_protein_interaction, weight = 'combined_score') )
For running the pipeline
drug-target interactions are required. The the function generates the required format. The argument
interaction_table maps drugs to targets. The argument
match_on specifies the column of the
interaction_table used to match drugs to targets.
When the input layers, connection and drug targets are created they are checked automatically for validity. The function below checks for a variety of possible input formatting errors.
The pipeline can be run as a whole or in individual steps. To set global pipeline options the
settings list needs to be created using the
molnet_settings function. This function contains default parameters that can be modified as shown in the code chunk below. Please be aware of the python executable used the pipeline part implemented in python and the dependencies installed (i.e. python/python3). The intermediate pipeline output is deactivated for this example (see below) but especially for large data files consider turning it on (the default). Specify the location of files with the
saving_path parameter. If not specified all files will only be written to a temporary directory. You can save individual graphs, combined graphs, drug targets and correlation matrices for individual graphs. Combined graphs without annotations and interaction score graphs are always saved as part of the handover between R and Python. For a detailed explanation of the possible settings please refer to the function documentation
settings <- molnet_settings( handling_missing_data = list( default = "pairwise.complete.obs", mrna = "all.obs" ), save_individual_graphs = FALSE, save_combined_graphs = FALSE, save_drug_targets = FALSE, python_executable = "python3" ) # disable multi-threading for example run; # not recommended for actual data processing WGCNA::disableWGCNAThreads() #>
To run the whole pipeline from beginning-to-end the
start_pipeline function should be used.
The pipeline can also be used modular. Modules refer to the different steps:
[image created with biorender.com]
In step one individual graphs are generated by specifying the layers as described above. A list of layers and the settings list are passed. In this step edges are established based on correlation computation and reduced by the specified reduction method. Reduction can be done based on significance of the correlation (
p_value) or WGCNA::pickHardThreshold (or our alternative implementation that has less overhead).
In this step individual graphs are combined to a single combined graph per group based on the inter-layer connections. The function creates the disjunct union of the individual graphs and adds inter-layer edges with the specified weight.
combined_graphs <- generate_combined_graphs(individual_graphs[["graphs"]], individual_graphs[["annotations"]], inter_layer_connections, settings) #> Connecting mrna and protein #> by id. #> group1: #> generating inter-layer edgelist... #> done. #> group2: #> generating inter-layer edgelist... #> done. #> Connecting protein and phosphoprotein #> by id. #> group1: #> generating inter-layer edgelist... #> done. #> group2: #> generating inter-layer edgelist... #> done. #> Connecting protein and metabolite #> by table. #> group1: #> generating inter-layer edgelist... #> done. #> group2: #> generating inter-layer edgelist... #> done. #> Combining graphs. #> group1: #> done. #> group2: #> done.
In this step filtering of the drug-targets takes place. The function finds node IDs in the combined graph that are targeted by the drugs and maps drugs to their target nodes. Additionally, edgelists are returned containing the incident edges of drug target nodes that have to be considered in interaction score calculation.
In this step, the previously computed combined graph containing the iGraph object/annotations is used in combination with the drug target edge list to calculate an interaction score. The interaction score is computed with python. The function writes the input data (combined graphs for both groups and lists of edges adjacent to drug targets for both groups) to files and calls a python script for calculating the score. Output files written by the python script are two graphs in .gml format containing the interaction score as weight. These are loaded and returned in a named list.
ATTENTION: Data exchange via files is mandatory and takes long for large data. Interaction score computation is slow because it involves finding all simple paths up to a certain length between source and target node of the drug target edges. Don’t set
max_path_length in settings to a large value and only consider this step if your graphs have up to approximately 2 million edges. Computation is initiated by
calculate_interaction_score. The python script is parallelized using Ray. Use the setting
int_score_mode for sequential computation. Refer to the Ray documentation if you encounter problems with running the python script in parallel.
In this step the absolute difference of interaction scores between the two groups is computed. A single differential graph with the
differential_score as only edge attribute is returned.
In the last step the differential drug response score is calculated based on the
differential_score_graph. The score of a drug is the median of all differential edge scores adjacent to one of its target nodes. For drugs that do not have a target
NA is returned.
The head of the results is shown below. The drug response score is an indirect measure of how the strength of connectivity differs between the groups for the drug targets of the particular drug. It indicates which drugs could be interesting in the application of stratified medicine.
The mRNA, proteomics and phosphoproteomics data used in this vignette stems from Krug et. al, 2020, which contains data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC). Full citation:
Metabolite data was sampled randomly to generate distributions similar to those reported previously (e.g., in Terunuma et al., 2014).