Slurm: 17.11.4 I want to run an interactive job on a compute node. I know that I'm going to need to run an MPI app, so I request a bunch of tasks upfront
srun -n 16 -gres=gpu:4 -pty $SHELL This creates a job with 4 nodes. ... SLURM_CPUS_ON_NODE=4 SLURM_DISTRIBUTION=block SLURM_JOB_CPUS_PER_NODE=4(x4) SLURM_JOB_NODELIST=n1,n2,n3,n4 SLURM_JOB_NUM_NODES=16 SLURM_NNODES=4 SLURM_NPROCS=16 SLURM_NTASKS=16 SLURM_STEP_NODELIST=n1,n2,n3,n4 SLURM_STEP_NUM_NODES=4 SLURM_STEP_NUM_TASKS=16 SLURM_STEP_TASKS_PER_NODE=4(x4) SLURM_TASKS_PER_NODE=4(x4) ... Once on the compute node, I run an application that make a system call to a shell script. Of my 16 cores, I want to make use of 10 of them. In that shell script is the following call srun -l -n 10 a.out However, Slurm comes back with srun: Warning: can't run 10 processes on 16 nodes, setting nnodes to 10 srun: error: Only allocated 4 nodes asked for 10 Exiting with code: 1 If I run Srun -l -n 16 a.out The app hangs and the debug show srun: jobid 1234: nodes(4):`n1,n2,n3,n4', cpu counts: 4(x4) srun: error: SLURM_NNODES environment variable conflicts with allocated node count (16 != 4). srun: debug: requesting job 1234, user 5678, nodes 4 including ((null)) srun: debug: cpus 16, tasks 16, name a.out, relative 65534 srun: Job 1234 step creation temporarily disabled, retrying srun: debug: Got signal 2 srun: Cancelled pending job step with signal 2 srun: error: Unable to create step for job 1234: Job/step already completing or completed I think there are two issues: 1. I'm asking for a gres that is being consumed by the outer srun (but my inner srun is going to need the GPU, so I need to ensure I'm asking for them upfront) 2. Even without the gres, Slurm still doesn't seem to like srun calling srun. How do I make this work? Thanks, Raymond