NB: Usually running on a cluster requires some scripting and coding skills, however, with the VPN graphical connections, it’s becoming easier for non-programmers to run any software. Below, we provide some exemplary scripts that one can usually copy and use with small modifications on many clusters. If in doubt, check with your administrator and/or write to us!
To run Haplin on a cluster you will need an MPI implementation and the Rmpi package installed manually, before the Haplin package installation. How to install extra R packages can vary from cluster to cluster, so check the manual!
To run a job on a cluster, usually one needs to submit a script to a job queue. The submission method varies depending on the queue system used, so check the help pages of your cluster. Here, we present the quite popular SLURM queueing system.
Below, is an exemplary script that sets up a SLURM job:
#!/bin/bash #SBATCH --job-name=haplin_cluster_run #SBATCH --output=haplin_cluster_run.out #SBATCH --nodes=3 #SBATCH --ntasks-per-node=8 #SBATCH --time=8:00:00 #SBATCH --mem-per-cpu=100 #SBATCH --mail-user=user```domain.com #SBATCH --mail-type=ALL module load R module load openmpi echo "nodes: $SLURM_JOB_NODELIST" myhostfile="cur_nodes.dat" echo "----STARTING THE JOB----" date echo "------------------------" mpiexec --hostfile $myhostfile -n 1 R --save < haplin_cluster_run.r >& mpi_run.out exit_status=$? echo "----JOB EXITED WITH STATUS---: $exit_status" exit $exit_status echo "----DONE----"
Here, the important part is the
mpiexec line, where the R session is loaded to run in parallel on several cores. To achieve this with the Rmpi package, one needs to provide a list of cores available currently for the user, which is done through the
--hostfile $myhostfile part. This means that the given file should hold a list of cores — if this is not available automatically on the cluster, one can extract it from the
$SLURM_JOB_NODELIST variable (see
submit_haplin_cluster_rmpi.sh script in this folder).
For a more detailed explanation of the
#SBATCH commands, see e.g., the official documentation.
The most effective way of using Haplin on a cluster is to run
haplinSlide on a large GWAS dataset. The data preparation and calling haplinSlide is the same as for single run, see the section above. However, before calling any parallel function one needs to setup the cluster with the function:
This will make use of maximum number of available cores. If one wants to limit the run to a specific number of CPUs, the
cpus argument needs to be specified.
Then, when evoking the analysis, one needs to specify that the Rmpi package will be used:
haplinSlide( trial.data2.prep, use.missing = TRUE, ccvar = 2, design = "cc.triad", reference = "ref.cat", response = "mult", para.env = "Rmpi" )
Finally, right before the script finishes, we need to close all the threads created by
CAUTION: If the user forgets to call this function before exiting R, all the work will still be saved, however, the
mpirun will end with an error.
To sum up, an exemplary R script to run on a cluster, would look like that:
library( Haplin ) initParallelRun() chosen.markers <- 3:55 data.in <- genDataLoad( filename = "mynicedata" ) # analysis without maternal risks calculated results1 <- haplinSlide( data = data.in, markers = chosen.markers, winlength = 2, design = "triad", use.missing = TRUE, maternal = FALSE, response = "free", cpus = 2, verbose = FALSE, printout = FALSE, para.env = "Rmpi" ) # analysis with maternal risks calculated results2 <- haplinSlide( data = data.in, markers = chosen.markers, winlength = 2, design = "triad", use.missing = TRUE, maternal = TRUE, response = "mult", cpus = 2, verbose = FALSE, printout = FALSE, para.env = "Rmpi" ) finishParallelRun()
IMPORTANT: To run in parallel, we need to specify both the
para.env arguments, however, the true number of CPUs used will be set within
initParallelRun and not by the