The 1.4 series is regularly tested on slurm machines after every modification, 
and has been running at LANL (and other slurm installations) for quite some 
time, so I doubt that's the core issue. Likewise, nothing in the system depends 
upon the FQDN (or anything regarding hostname) - it's just used to print 
diagnostics.

Not sure of the issue, and I don't have an ability to test/debug slurm any 
more, so I'll have to let Sam continue to look into this for you. It's probably 
some trivial difference in setup, unfortunately. I don't know if you said 
before, but it might help to know what slurm version you are using. Slurm tends 
to change a lot between versions (even minor releases), and it is one of the 
more finicky platforms we support.


On Feb 6, 2011, at 9:12 PM, Michael Curtis wrote:

> 
> On 07/02/2011, at 12:36 PM, Michael Curtis wrote:
> 
>> 
>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>> 
>> Hi,
>> 
>>> I just tried to reproduce the problem that you are experiencing and was 
>>> unable to.
>>> 
>>> SLURM 2.1.15
>>> Open MPI 1.4.3 configured with: 
>>> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas
>> 
>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same 
>> platform file (the only change was to re-enable btl-tcp).
>> 
>> Unfortunately, the result is the same:
> 
> To reply to my own post again (sorry!), I tried OpenMPI 1.5.1.  This works 
> fine:
> salloc -n16 ~/../openmpi/bin/mpirun --display-map mpi
> salloc: Granted job allocation 151
> 
> ========================   JOB MAP   ========================
> 
> Data for node: ipc3   Num procs: 8
>       Process OMPI jobid: [3365,1] Process rank: 0
>       Process OMPI jobid: [3365,1] Process rank: 1
>       Process OMPI jobid: [3365,1] Process rank: 2
>       Process OMPI jobid: [3365,1] Process rank: 3
>       Process OMPI jobid: [3365,1] Process rank: 4
>       Process OMPI jobid: [3365,1] Process rank: 5
>       Process OMPI jobid: [3365,1] Process rank: 6
>       Process OMPI jobid: [3365,1] Process rank: 7
> 
> Data for node: ipc4   Num procs: 8
>       Process OMPI jobid: [3365,1] Process rank: 8
>       Process OMPI jobid: [3365,1] Process rank: 9
>       Process OMPI jobid: [3365,1] Process rank: 10
>       Process OMPI jobid: [3365,1] Process rank: 11
>       Process OMPI jobid: [3365,1] Process rank: 12
>       Process OMPI jobid: [3365,1] Process rank: 13
>       Process OMPI jobid: [3365,1] Process rank: 14
>       Process OMPI jobid: [3365,1] Process rank: 15
> 
> =============================================================
> Process 2 on eng-ipc3.{FQDN} out of 16
> Process 4 on eng-ipc3.{FQDN} out of 16
> Process 5 on eng-ipc3.{FQDN} out of 16
> Process 0 on eng-ipc3.{FQDN} out of 16
> Process 1 on eng-ipc3.{FQDN} out of 16
> Process 6 on eng-ipc3.{FQDN} out of 16
> Process 3 on eng-ipc3.{FQDN} out of 16
> Process 7 on eng-ipc3.{FQDN} out of 16
> Process 8 on eng-ipc4.{FQDN} out of 16
> Process 11 on eng-ipc4.{FQDN} out of 16
> Process 12 on eng-ipc4.{FQDN} out of 16
> Process 14 on eng-ipc4.{FQDN} out of 16
> Process 15 on eng-ipc4.{FQDN} out of 16
> Process 10 on eng-ipc4.{FQDN} out of 16
> Process 9 on eng-ipc4.{FQDN} out of 16
> Process 13 on eng-ipc4.{FQDN} out of 16
> salloc: Relinquishing job allocation 151
> 
> It does seem very much like there is a bug of some sort in 1.4.3?
> 
> Michael
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to