I have a two-node cluster running Slurm, and I'm being asked about allowing multiple jobs (hundreds of jobs) to run simultaneously. Following is my scheduling part of slurm.conf, which I changed to allow multiple jobs to run on each node:
# SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core For testing purposes, I'm running this job: #!/bin/bash #SBATCH --job-name=whatever #SBATCH --output=slurmBatchLists_Aug19.out #SBATCH --error=slurmBatchLists_Aug19.err #SBATCH --partition=debug #SBATCH --nodes=1 #SBATCH --array=70-100 #SBATCH --cpus-per-task=5 matlab -nodisplay -nojvm -r 'sampleSlurm($SLURM_ARRAY_TASK_ID);' ...which gives me the following squeue output: [mhohmeis@odin ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1742_[82-100] debug whatever mhohmeis PD 0:00 1 (Resources) 1755_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1756_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1757_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1758_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1759_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1760_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1761_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1762_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1763_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1742_70 debug whatever mhohmeis R 0:03 1 odin 1742_71 debug whatever mhohmeis R 0:03 1 odin 1742_72 debug whatever mhohmeis R 0:03 1 odin 1742_73 debug whatever mhohmeis R 0:03 1 odin 1742_74 debug whatever mhohmeis R 0:03 1 odin 1742_75 debug whatever mhohmeis R 0:03 1 odin 1742_76 debug whatever mhohmeis R 0:03 1 thor 1742_77 debug whatever mhohmeis R 0:03 1 thor 1742_78 debug whatever mhohmeis R 0:03 1 thor 1742_79 debug whatever mhohmeis R 0:03 1 thor 1742_80 debug whatever mhohmeis R 0:03 1 thor 1742_81 debug whatever mhohmeis R 0:03 1 thor They're interested in allowing *all* these jobs to run simultaneously. Also, when they add #SBATCH --ntasks=30 to the above .sbatch file, this happens when they try to run it: [mhohmeis@odin ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2052_[70-100] debug whatever mhohmeis PD 0:00 4 (PartitionConfig) Any thoughts? Thanks! Matt Hohmeister Systems and Network Administrator Department of Psychology Florida State University PO Box 3064301 Tallahassee, FL 32306-4301 Phone: +1 850 645 1902 Fax: +1 850 644 7739 Pronouns: he/him/his