Hello, I am running SLURM 17.11 have a user who has a complicated workflow. The user wants 250 cores for 2 weeks to do some work semi-interactively. I'm not going to give the user a reservation to do this work, because the whole point of having a scheduler is to minimize human intervention in job scheduling.
The code uses MPI (openmpi-1.8 with gcc-4.9.2). The process that I originally envisioned was to allocate an interactive job, a new shell gets spawned and then run `mpirun`, with SLURM dispatching the work to allocation. i.e. ``` [headnode01] $ salloc --ntasks=2 --nodes=2 (SLURM grants allocation on node[01,02] and new shell spawns) [headnode01] $ mpirun -np 2 ./executable # SLURM dispatches work to node[01,02] ``` This doesn't work in the user's situation. Their workflow involves having master job that automatically spawns daughter MPI jobs (using 5 cores per job, for a total of 50 jobs) that get dispatched using `sbatch`. It would be impractical to have manage 50 interactive shells. I was imagining doing something like the following : 1. Get interactive allocation using `salloc` 2. Submitting a batch job, that within it uses `srun --jobid=XXXX` to use the resources allocated in step 1. I created a simple code, `tmp.c`, to test this process. `tmp.c`: ``` #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <mpi.h> int main(int argc, char * argv[]) { int taskID = -1; int ntasks = -1; int nSteps = 100; // Number of times to do the integration int step = 0; // Current step char hostname[250]; hostname[249]='\0'; /* MPI Initializations */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskID); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); gethostname(hostname, 1023); printf("Hello from Task %i on %s\n", taskID,hostname); /* Master Loop */ for(step=0; step<nSteps; step++){ printf("%i : task %i hostname %s\n", step, taskID, hostname); usleep(1000000); fflush(stdout); MPI_Barrier(MPI_COMM_WORLD); // Ensure every task completes } MPI_Finalize(); return 0; } ``` I compile, I allocate resources and then I try to use `srun` to utilize those resources. i.e. ``` [headnode01] $ mpicc tmp.c -o tmp [headnode01] $ salloc --ntasks=3 --nodes=3 (SLURM grants allocation on node[14-16] and new shell spawns) [headnode01] $ srun --jobid=XXXX --ntasks=1 --pty ./tmp # Done from a different shell, not the new shell... Hello World from Task 0 on node14.cluster 0 : task 0 hostname node14.cluster 1 : task 0 hostname node14.cluster 2 : task 0 hostname node14.cluster ``` Ok, this is expected. 1 MPI task running on 1 node with 1 core. If I do ``` [headnode01] $ srun --jobid=XXXX --ntasks=2 --pty ./tmp # Done from a different shell, not the new shell... Hello World from Task 0 on node14.cluster 0 : task 0 hostname node14.cluster 1 : task 0 hostname node14.cluster 2 : task 0 hostname node14.cluster ``` This is unexpected. I would expect task 0 and task 1 to be on node[14,15] because I have 3 cores/tasks allocated across 3 nodes. Instead, if I look at node[14,15] I see that both nodes have a process `tmp` running, but I only catch the stdout from node14. Why is that? If I try instead, excluding --pty ``` srun --jobid=2440814 --ntasks=2 --mpi=openmpi ./tmp Hello from Task 0 on node14.cluster 0 : task 0 hostname node14.cluster Hello from Task 0 on node15.cluster 0 : task 0 hostname node15.cluster 1 : task 0 hostname node14.cluster 1 : task 0 hostname node15.cluster ``` This is also not what I want. I don't want two separate instances of `tmp` running on two separate nodes. I want the program `tmp` to utilize two cores on two different nodes. I'd instead expect the output to be : ``` Hello from Task 0 on node14.cluster 0 : task 0 hostname node14.cluster Hello from Task 1 on node15.cluster 0 : task 1 hostname node15.cluster 1 : task 0 hostname node14.cluster 1 : task 1 hostname node15.cluster ``` I can achieve the above expected output if I run ``` sbatch --ntasks=2 --nodes=2 --wrap="mpirun -np 2 ./tmp" ``` but I'd like to do this interactively. QUESTION: How do I create an allocation and then utilize parts and pieces of that single allocation using `srun` with MPI processes? I'd like for an MPI process being run via `srun` to be able to utilize multiple cores spread across multiple nodes. Best, ====================================== Ali Snedden, Ph.D. HPC Scientific Programmer The High Performance Computing Facility Nationwide Children’s Hospital Research Institute