The 1.4 series is regularly tested on slurm machines after every modification, and has been running at LANL (and other slurm installations) for quite some time, so I doubt that's the core issue. Likewise, nothing in the system depends upon the FQDN (or anything regarding hostname) - it's just used to print diagnostics.
Not sure of the issue, and I don't have an ability to test/debug slurm any more, so I'll have to let Sam continue to look into this for you. It's probably some trivial difference in setup, unfortunately. I don't know if you said before, but it might help to know what slurm version you are using. Slurm tends to change a lot between versions (even minor releases), and it is one of the more finicky platforms we support. On Feb 6, 2011, at 9:12 PM, Michael Curtis wrote: > > On 07/02/2011, at 12:36 PM, Michael Curtis wrote: > >> >> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: >> >> Hi, >> >>> I just tried to reproduce the problem that you are experiencing and was >>> unable to. >>> >>> SLURM 2.1.15 >>> Open MPI 1.4.3 configured with: >>> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas >> >> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same >> platform file (the only change was to re-enable btl-tcp). >> >> Unfortunately, the result is the same: > > To reply to my own post again (sorry!), I tried OpenMPI 1.5.1. This works > fine: > salloc -n16 ~/../openmpi/bin/mpirun --display-map mpi > salloc: Granted job allocation 151 > > ======================== JOB MAP ======================== > > Data for node: ipc3 Num procs: 8 > Process OMPI jobid: [3365,1] Process rank: 0 > Process OMPI jobid: [3365,1] Process rank: 1 > Process OMPI jobid: [3365,1] Process rank: 2 > Process OMPI jobid: [3365,1] Process rank: 3 > Process OMPI jobid: [3365,1] Process rank: 4 > Process OMPI jobid: [3365,1] Process rank: 5 > Process OMPI jobid: [3365,1] Process rank: 6 > Process OMPI jobid: [3365,1] Process rank: 7 > > Data for node: ipc4 Num procs: 8 > Process OMPI jobid: [3365,1] Process rank: 8 > Process OMPI jobid: [3365,1] Process rank: 9 > Process OMPI jobid: [3365,1] Process rank: 10 > Process OMPI jobid: [3365,1] Process rank: 11 > Process OMPI jobid: [3365,1] Process rank: 12 > Process OMPI jobid: [3365,1] Process rank: 13 > Process OMPI jobid: [3365,1] Process rank: 14 > Process OMPI jobid: [3365,1] Process rank: 15 > > ============================================================= > Process 2 on eng-ipc3.{FQDN} out of 16 > Process 4 on eng-ipc3.{FQDN} out of 16 > Process 5 on eng-ipc3.{FQDN} out of 16 > Process 0 on eng-ipc3.{FQDN} out of 16 > Process 1 on eng-ipc3.{FQDN} out of 16 > Process 6 on eng-ipc3.{FQDN} out of 16 > Process 3 on eng-ipc3.{FQDN} out of 16 > Process 7 on eng-ipc3.{FQDN} out of 16 > Process 8 on eng-ipc4.{FQDN} out of 16 > Process 11 on eng-ipc4.{FQDN} out of 16 > Process 12 on eng-ipc4.{FQDN} out of 16 > Process 14 on eng-ipc4.{FQDN} out of 16 > Process 15 on eng-ipc4.{FQDN} out of 16 > Process 10 on eng-ipc4.{FQDN} out of 16 > Process 9 on eng-ipc4.{FQDN} out of 16 > Process 13 on eng-ipc4.{FQDN} out of 16 > salloc: Relinquishing job allocation 151 > > It does seem very much like there is a bug of some sort in 1.4.3? > > Michael > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users