Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-07 Thread Ralph Castain
The 1.4 series is regularly tested on slurm machines after every modification, 
and has been running at LANL (and other slurm installations) for quite some 
time, so I doubt that's the core issue. Likewise, nothing in the system depends 
upon the FQDN (or anything regarding hostname) - it's just used to print 
diagnostics.

Not sure of the issue, and I don't have an ability to test/debug slurm any 
more, so I'll have to let Sam continue to look into this for you. It's probably 
some trivial difference in setup, unfortunately. I don't know if you said 
before, but it might help to know what slurm version you are using. Slurm tends 
to change a lot between versions (even minor releases), and it is one of the 
more finicky platforms we support.


On Feb 6, 2011, at 9:12 PM, Michael Curtis wrote:

> 
> On 07/02/2011, at 12:36 PM, Michael Curtis wrote:
> 
>> 
>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>> 
>> Hi,
>> 
>>> I just tried to reproduce the problem that you are experiencing and was 
>>> unable to.
>>> 
>>> SLURM 2.1.15
>>> Open MPI 1.4.3 configured with: 
>>> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas
>> 
>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same 
>> platform file (the only change was to re-enable btl-tcp).
>> 
>> Unfortunately, the result is the same:
> 
> To reply to my own post again (sorry!), I tried OpenMPI 1.5.1.  This works 
> fine:
> salloc -n16 ~/../openmpi/bin/mpirun --display-map mpi
> salloc: Granted job allocation 151
> 
>    JOB MAP   
> 
> Data for node: ipc3   Num procs: 8
>   Process OMPI jobid: [3365,1] Process rank: 0
>   Process OMPI jobid: [3365,1] Process rank: 1
>   Process OMPI jobid: [3365,1] Process rank: 2
>   Process OMPI jobid: [3365,1] Process rank: 3
>   Process OMPI jobid: [3365,1] Process rank: 4
>   Process OMPI jobid: [3365,1] Process rank: 5
>   Process OMPI jobid: [3365,1] Process rank: 6
>   Process OMPI jobid: [3365,1] Process rank: 7
> 
> Data for node: ipc4   Num procs: 8
>   Process OMPI jobid: [3365,1] Process rank: 8
>   Process OMPI jobid: [3365,1] Process rank: 9
>   Process OMPI jobid: [3365,1] Process rank: 10
>   Process OMPI jobid: [3365,1] Process rank: 11
>   Process OMPI jobid: [3365,1] Process rank: 12
>   Process OMPI jobid: [3365,1] Process rank: 13
>   Process OMPI jobid: [3365,1] Process rank: 14
>   Process OMPI jobid: [3365,1] Process rank: 15
> 
> =
> Process 2 on eng-ipc3.{FQDN} out of 16
> Process 4 on eng-ipc3.{FQDN} out of 16
> Process 5 on eng-ipc3.{FQDN} out of 16
> Process 0 on eng-ipc3.{FQDN} out of 16
> Process 1 on eng-ipc3.{FQDN} out of 16
> Process 6 on eng-ipc3.{FQDN} out of 16
> Process 3 on eng-ipc3.{FQDN} out of 16
> Process 7 on eng-ipc3.{FQDN} out of 16
> Process 8 on eng-ipc4.{FQDN} out of 16
> Process 11 on eng-ipc4.{FQDN} out of 16
> Process 12 on eng-ipc4.{FQDN} out of 16
> Process 14 on eng-ipc4.{FQDN} out of 16
> Process 15 on eng-ipc4.{FQDN} out of 16
> Process 10 on eng-ipc4.{FQDN} out of 16
> Process 9 on eng-ipc4.{FQDN} out of 16
> Process 13 on eng-ipc4.{FQDN} out of 16
> salloc: Relinquishing job allocation 151
> 
> It does seem very much like there is a bug of some sort in 1.4.3?
> 
> Michael
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-07 Thread Samuel K. Gutierrez

Hi,

A detailed backtrace from a core dump may help us debug this.  Would  
you be willing to provide that information for us?


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:



On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:

Hi,

I just tried to reproduce the problem that you are experiencing and  
was unable to.


SLURM 2.1.15
Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ 
lanl/tlcc/debug-nopanasas


I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same  
platform file (the only change was to re-enable btl-tcp).


Unfortunately, the result is the same:
salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
salloc: Granted job allocation 145

   JOB MAP   

Data for node: Name: eng-ipc4.{FQDN}Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 0
Process OMPI jobid: [6932,1] Process rank: 1
Process OMPI jobid: [6932,1] Process rank: 2
Process OMPI jobid: [6932,1] Process rank: 3
Process OMPI jobid: [6932,1] Process rank: 4
Process OMPI jobid: [6932,1] Process rank: 5
Process OMPI jobid: [6932,1] Process rank: 6
Process OMPI jobid: [6932,1] Process rank: 7

Data for node: Name: ipc3   Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 8
Process OMPI jobid: [6932,1] Process rank: 9
Process OMPI jobid: [6932,1] Process rank: 10
Process OMPI jobid: [6932,1] Process rank: 11
Process OMPI jobid: [6932,1] Process rank: 12
Process OMPI jobid: [6932,1] Process rank: 13
Process OMPI jobid: [6932,1] Process rank: 14
Process OMPI jobid: [6932,1] Process rank: 15

=
[eng-ipc4:31754] *** Process received signal ***
[eng-ipc4:31754] Signal: Segmentation fault (11)
[eng-ipc4:31754] Signal code: Address not mapped (1)
[eng-ipc4:31754] Failing at address: 0x8012eb748
[eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
[eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869)  
[0x7f81cf262869]
[eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338)  
[0x7f81cef93338]
[eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e)  
[0x7f81cef9397e]
[eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 
0(opal_event_loop+0x1f) [0x7f81cef9356f]
[eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress 
+0x89) [0x7f81cef87916]
[eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 
0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20]
[eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7)  
[0x7f81cf267ed7]

[eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
[eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
[eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd)  
[0x7f81ce14bc4d]

[eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
[eng-ipc4:31754] *** End of error message ***
salloc: Relinquishing job allocation 145
salloc: Job allocation 145 has been revoked.
zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ 
ServerAdmin/mpi


I've anonymised the paths and domain, otherwise pasted verbatim.   
The only odd thing I notice is that the launching machine uses its  
full domain name, whereas the other machine is referred to by the  
short name.  Despite the FQDN, the domain does not exist in the DNS  
(for historical reasons), but does exist in the /etc/hosts file.


Any further clues would be appreciated.  In case it may be relevant,  
core system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32.  One  
other point of difference may be that our environment is tcp  
(ethernet) based whereas the LANL test environment is not?


Michael


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users