Hi Matt, Check out the "OverSubscribe" partition parameter. Try setting your partition to "OverSubscribe=YES" and then submitting the jobs with the "-oversubscibe" option (or OverSubscribe=FORCE if you want this to happen for all jobs submitted to the partition). Either oversubscribe option can be followed by a colon and the maximum number of jobs that can be assigned to a resource (iirc it defaults to 4 - so you might want to increase to allow the number of jobs you need - ie, maximum number of jobs you need to run simultaneously divided by number of cores available in the partition).
Matt Jay HPC Systems Engineer - Hyak Research Computing University of Washington Information Technology From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Matt Hohmeister Sent: Thursday, September 26, 2019 9:14 AM To: slurm-us...@schedmd.com Subject: [slurm-users] Running multiple jobs simultaneously I have a two-node cluster running Slurm, and I'm being asked about allowing multiple jobs (hundreds of jobs) to run simultaneously. Following is my scheduling part of slurm.conf, which I changed to allow multiple jobs to run on each node: # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core For testing purposes, I'm running this job: #!/bin/bash #SBATCH --job-name=whatever #SBATCH --output=slurmBatchLists_Aug19.out #SBATCH --error=slurmBatchLists_Aug19.err #SBATCH --partition=debug #SBATCH --nodes=1 #SBATCH --array=70-100 #SBATCH --cpus-per-task=5 matlab -nodisplay -nojvm -r 'sampleSlurm($SLURM_ARRAY_TASK_ID);' ...which gives me the following squeue output: [mhohmeis@odin ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1742_[82-100] debug whatever mhohmeis PD 0:00 1 (Resources) 1755_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1756_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1757_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1758_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1759_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1760_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1761_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1762_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1763_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority) 1742_70 debug whatever mhohmeis R 0:03 1 odin 1742_71 debug whatever mhohmeis R 0:03 1 odin 1742_72 debug whatever mhohmeis R 0:03 1 odin 1742_73 debug whatever mhohmeis R 0:03 1 odin 1742_74 debug whatever mhohmeis R 0:03 1 odin 1742_75 debug whatever mhohmeis R 0:03 1 odin 1742_76 debug whatever mhohmeis R 0:03 1 thor 1742_77 debug whatever mhohmeis R 0:03 1 thor 1742_78 debug whatever mhohmeis R 0:03 1 thor 1742_79 debug whatever mhohmeis R 0:03 1 thor 1742_80 debug whatever mhohmeis R 0:03 1 thor 1742_81 debug whatever mhohmeis R 0:03 1 thor They're interested in allowing *all* these jobs to run simultaneously. Also, when they add #SBATCH --ntasks=30 to the above .sbatch file, this happens when they try to run it: [mhohmeis@odin ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2052_[70-100] debug whatever mhohmeis PD 0:00 4 (PartitionConfig) Any thoughts? Thanks! Matt Hohmeister Systems and Network Administrator Department of Psychology Florida State University PO Box 3064301 Tallahassee, FL 32306-4301 Phone: +1 850 645 1902 Fax: +1 850 644 7739 Pronouns: he/him/his