Very interesting! I see the problem - we have never encountered the SLURM_TASKS_PER_NODE in that format, while the SLURM_JOB_CPUS_PER_NODE indicates that we have indeed been allocated two processors on each of the nodes! So when you just do mpirun without specifying the number of processes, we will launch 4 processes (2 on each node) since that is what SLURM told us we have been given.
Interesting configuration you have there. I can add some logic that tests for internal consistency between the two and compensates for the discrepancy. Can you get a slightly bigger allocation, one that covers several nodes? For example, "salloc -n7"? And then send the output again from "printenv | grep SLURM"? I need to see if your configuration use a regex to describe the SLURM_TASKS_PER_NODE, and what it looks like. Thanks Ralph On Mon, Aug 24, 2009 at 1:55 PM, <matthew.pi...@ndsu.edu> wrote: > Hello, > > Hopefully the below information will be helpful. > > SLURM Version: 1.3.15 > > node64-test ~>salloc -n3 > salloc: Granted job allocation 826 > > node64-test ~>srun hostname > node64-24.xxxx.xxxx.xxxx.xxxx > node64-25.xxxx.xxxx.xxxx.xxxx > node64-24.xxxx.xxxx.xxxx.xxxx > > node64-test ~>printenv | grep SLURM > SLURM_NODELIST=node64-[24-25] > SLURM_NNODES=2 > SLURM_JOBID=826 > SLURM_TASKS_PER_NODE=2,1 > SLURM_JOB_ID=826 > SLURM_NPROCS=3 > SLURM_JOB_NODELIST=node64-[24-25] > SLURM_JOB_CPUS_PER_NODE=2(x2) > SLURM_JOB_NUM_NODES=2 > > node64-test ~>mpirun --display-allocation hostname > > ====================== ALLOCATED NODES ====================== > > Data for node: Name: node64-test.xxxx.xxxx.xxxx.xxxx Num slots: 0 > Max slots: 0 > Data for node: Name: node64-24 Num slots: 2 Max slots: 0 > Data for node: Name: node64-25 Num slots: 2 Max slots: 0 > > ================================================================= > node64-24.xxxx.xxxx.xxxx.xxxx > node64-24.xxxx.xxxx.xxxx.xxxx > node64-25.xxxx.xxxx.xxxx.xxxx > node64-25.xxxx.xxxx.xxxx.xxxx > > > Thanks, > Matt > > > Haven't seen that before on any of our machines. > > > > Could you do "printenv | grep SLURM" after the salloc and send the > > results? > > > > What version of SLURM is this? > > > > Please run "mpirun --display-allocation hostname" and send the results. > > > > Thanks > > Ralph > > > > On Mon, Aug 24, 2009 at 11:30 AM, <matthew.pi...@ndsu.edu> wrote: > > > >> Hello, > >> > >> I've seem to run into an interesting problem with openMPI. After > >> allocating 3 processors and confirming that the 3 processors are > >> allocated. mpirun on a simple mpitest program seems to run on 4 > >> processors. We have 2 processors per node. I can repeat this case with > >> any > >> odd number of nodes, openMPI seems to take any remaining processors on > >> the > >> box. We are running openMPI v1.3.3. Here is an example of what happens: > >> > >> node64-test ~>salloc -n3 > >> salloc: Granted job allocation 825 > >> > >> node64-test ~>srun hostname > >> node64-28.xxxx.xxxx.xxxx.xxxx > >> node64-28.xxxx.xxxx.xxxx.xxxx > >> node64-29.xxxx.xxxx.xxxx.xxxx > >> > >> node64-test ~>MX_RCACHE=0 > >> LD_LIBRARY_PATH="/hurd/mpi/openmpi/lib:/usr/local/mx/lib" mpirun > >> mpi_pgms/mpitest > >> MPI domain size: 4 > >> I am rank 000 - node64-28.xxxx.xxxx.xxxx.xxxx > >> I am rank 003 - node64-29.xxxx.xxxx.xxxx.xxxx > >> I am rank 001 - node64-28.xxxx.xxxx.xxxx.xxxx > >> I am rank 002 - node64-29.xxxx.xxxx.xxxx.xxxx > >> > >> > >> > >> For those who may be curious here is the program: > >> > >> #include <stdio.h> > >> #include <stdlib.h> > >> #include <mpi.h> > >> > >> extern int main(int argc, char *argv[]); > >> > >> extern int main(int argc, char *argv[]) > >> > >> { > >> auto int rank, > >> size, > >> namelen; > >> > >> MPI_Status status; > >> > >> static char processor_name[MPI_MAX_PROCESSOR_NAME]; > >> > >> MPI_Init(&argc, &argv); > >> MPI_Comm_rank(MPI_COMM_WORLD, &rank); > >> MPI_Comm_size(MPI_COMM_WORLD, &size); > >> > >> if ( rank == 0 ) > >> { > >> MPI_Get_processor_name(processor_name, &namelen); > >> fprintf(stdout,"My name is: %s\n",processor_name); > >> fprintf(stdout,"Cluster size is: %d\n", size); > >> > >> } > >> else > >> { > >> MPI_Get_processor_name(processor_name, &namelen); > >> fprintf(stdout,"My name is: %s\n",processor_name); > >> } > >> > >> MPI_Finalize(); > >> return(0); > >> } > >> > >> > >> I'm curious if this is a bug in the way openMPI interprets SLURM > >> environment variables. If you have any ideas or need any more > >> information > >> let me know. > >> > >> > >> Thanks. > >> Matt > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >