Whoops; we shouldn't be seg faulting. :-\
The warning is exactly what it implies -- it found the OpenFabrics
network stack by no functioning OpenFabrics-capable hardware. You can
disable it (and the segv) by disabling the openfabrics BTL from running:
mpirun --mca btl ^openib
But what I don't see is why we're segv'ing when calling
ibv_destroy_srq(). This is a function in the shutdown sequence of the
openib BTL, but that shouldn't be getting called with the error
message that you're seeing. Are you getting corefiles, perchance?
Could you get a stack trace with the file and line numbers in OMPI
where this is happening, perchance?
Do you have OpenFabrics hardware on your cluster? According to your
error message, node18 is the one that doesn't find an OF-capable
hardware, but node66 is the one that segv's, which is darn weird...
On Mar 5, 2009, at 12:13 AM, Shinta Bonnefoy wrote:
Hi,
I am the admin of a small cluster (server running under SLES 10.1 and
nodes on OSS 10.3).
and I have just installed openmpi 1.3 on it.
I'm trying to get a simple program (like hello world) running but it
fails all the time on on of the node but never on the others.
I don't think it's related to the program since it's the simplest on
you
can write.
All the nodes are sharing the openmpi install directory (trhough) nfs
and have all the same profile.
Here is the runtime code error I've got :
mpirun -machinefile no -np 6 ~/hello.x
--------------------------------------------------------------------------
[[6735,1],0]: A high-performance Open MPI point-to-point messaging
module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: node18
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Hello world from process 3 of 6
Hello world from process 1 of 6
Hello world from process 4 of 6
Hello world from process 2 of 6
Hello world from process 5 of 6
Hello world from process 0 of 6
[node66:03997] *** Process received signal ***
[node66:03997] Signal: Segmentation fault (11)
[node66:03997] Signal code: Address not mapped (1)
[node66:03997] Failing at address: (nil)
[node66:03997] [ 0] /lib64/libpthread.so.0 [0x2b5e227a4fb0]
[node66:03997] [ 1] /usr/lib64/libibverbs.so.1(ibv_destroy_srq+0)
[0x2b5e24ee0fa0]
[node66:03997] [ 2]
/opt/cluster/software/openmpi/1.3/lib/openmpi/mca_btl_openib.so
[0x2b5e250eb2dd]
[node66:03997] [ 3]
/opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(mca_btl_base_close
+0x87)
[0x2b5e21aa2a67]
[node66:03997] [ 4]
/opt/cluster/software/openmpi/1.3/lib/openmpi/mca_bml_r2.so
[0x2b5e24cc39d2]
[node66:03997] [ 5]
/opt/cluster/software/openmpi/1.3/lib/openmpi/mca_pml_ob1.so
[0x2b5e24aa2d0e]
[node66:03997] [ 6]
/opt/cluster/software/openmpi/1.3/lib/libmpi.so.
0(mca_pml_base_finalize+0x1b)
[0x2b5e21aacd2f]
[node66:03997] [ 7] /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0
[0x2b5e21a66a7b]
[node66:03997] [ 8]
/opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(MPI_Finalize+0x17)
[0x2b5e21a84207]
[node66:03997] [ 9] /home/donald/hello.x(main+0x6d) [0x401bd5]
[node66:03997] [10] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x2b5e229cfb54]
[node66:03997] [11] /home/donald/hello.x [0x401ad9]
[node66:03997] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 3997 on node node66 exited
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[node72:07895] 4 more processes have sent help message
help-mpi-btl-base.txt / btl:no-nics
[node72:07895] Set MCA parameter "orte_base_help_aggregate" to 0 to
see
all help / error messages
Please advise,
Thanks and regards,
SB
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems