As I was exploring Stampede2 at TACC for more RNA-seq data analysis, I needed to map the RNA seq reads for 68 samples to the Daphnia genome. Because on Stampede2, even a single serial job with 1 cpu would take the entire resources of the entire node of 16 cpus. I decided to use the tool launcher that built in Stampede2 to bundle multiple serial jobs and let launcher run the 68 mapping jobs in parallel.
The RNA-seq aligner I used was STAR. It can be loaded as a module on Stampede2. I first created a mapping script for each of the paired-end RNA-seq sample. Below is an example. For specific meaning of each option, please refer to STAR manual.
cd /your/working/directory
STAR –genomeDir /directory/to/your/STARindex \
–outFileNamePrefix SRR2062534 \
–outSAMstrandField intronMotif \
–quantMode GeneCounts \
–twopassMode Basic \
–runThreadN 2 \
–readFilesIn SRR2062534_1.fastq SRR2062534_2.fastq \
–outFilterMultimapNmax 1 \
–outReadsUnmapped Fastx \
–outFilterMatchNminOverLread 0.1 \
–outFilterScoreMinOverLread 0.1 \
–outSAMtype BAM SortedByCoordinate
These scripts are named as 1.sh, 2.sh, 3.sh, …68.sh and all placed in a directory named as mapping_scripts
To load the launcher, use the command: module load launcher
copy the file $LAUNCHER_DIR//extras/batch-scripts/launcher.slurm into the mapping_scripts directory. The file content is listed below. The places to change are noted in bold letters.
#! /bin/bash
# Simple SLURM script for submitting multiple serial
# jobs (e.g. parametric studies) using a script wrapper
# to launch the jobs.
#
# To use, build the launcher executable and your
# serial application(s) and place them in your WORKDIR
# directory. Then, edit the CONTROL_FILE to specify
# each executable per process.
#——————————————————-
#——————————————————-
#
# <—— Setup Parameters ——>
#
#SBATCH -J STAR #name of the job
#SBATCH -N 1 #how many nodes you need
#SBATCH -n 7 #how many jobs
#SBATCH -p normal #the queue on stampede 2 to use
#SBATCH -o STAR.o%j #change according to you job name
#SBATCH -e STAR.e%j #change according to you job name
#SBATCH -t 48:00:00 #number of hours for the job to run. 48hr is the maximum for normal queue.
#SBATCH –mail-user=youremail@gmail.com #email address to send notification
#SBATCH –mail-type=all # Send email at begin and end of job
# <—— Account String —–>
# <— (Use this ONLY if you have MULTIPLE accounts) —>
##SBATCH -A
#——————————————————
export LAUNCHER_PLUGIN_DIR=$LAUNCHER_DIR/plugins
export LAUNCHER_RMI=SLURM
export LAUNCHER_JOB_FILE=/directory/to/your/jobfile #change the path here to point to you job file
$LAUNCHER_DIR/paramrun
Then, I created a jobfile that ask the system to execute 7 mapping scripts. An example file is below. The number 7 is because of the memory requirement for a Daphnia genome (200Mb), 1 node only has enough for about 7 mapping jobs. I experimented a bit with different number of jobs to make sure all the jobs can run without a problem. If a few jobs don’t have enough allocated memory, it will quite and the others will keep running until finishing. We can use the development queue to test for the optimal number.
bash /full/path/to/your/1.sh
bash /full/path/to/your/2.sh
.
bash /full/path/to/your/7.sh
Once you have a jobfile like this, a few more files need to be created for including all other jobs. Similarly, we need to create a few files that are identical as the launcher.slurm file (above) except the export LAUNCHER_JOB_FILE=/directory/to/your/jobfile and job names need to be changed to point to the corresponding jobfiles. Let’s call these files launcher-1.slurm, launcher-2.slurm, …. launcher-1.slurm includes commands for the first 7 mapping tasks, launcher-2.slurm includes commands for the second batch of 7 mapping tasks, etc.
After this is all done, we can submit each of the launcher-1.slurm files using this command: sbatch launcher-1.slurm
After submission, each launcher script will spread the jobs onto a node and run them seperately. The status of the jobs can be monitored using the command: showq -u yourUserName