Bus error usually means that there was an invalid address passed as a
pointer somewhere in the code -- it's not usually a communications
error.
Without more information, it's rather difficult to speculate on what
happened here. Did you get corefiles? If so, are there useful
backtraces available?
On Oct 1, 2009, at 6:01 AM, Sangamesh B wrote:
Hi,
A fortran application which is compiled with ifort-10.1 and
open mpi 1.3.1 on Cent OS 5.2 fails after running 4 days with
following error message:
[compute-0-7:25430] *** Process received signal ***
[compute-0-7:25433] *** Process received signal ***
[compute-0-7:25433] Signal: Bus error (7)
[compute-0-7:25433] Signal code: (2)
[compute-0-7:25433] Failing at address: 0x4217b8
[compute-0-7:25431] *** Process received signal ***
[compute-0-7:25431] Signal: Bus error (7)
[compute-0-7:25431] Signal code: (2)
[compute-0-7:25431] Failing at address: 0x4217b8
[compute-0-7:25432] *** Process received signal ***
[compute-0-7:25432] Signal: Bus error (7)
[compute-0-7:25432] Signal code: (2)
[compute-0-7:25432] Failing at address: 0x4217b8
[compute-0-7:25430] Signal: Bus error (7)
[compute-0-7:25430] Signal code: (2)
[compute-0-7:25430] Failing at address: 0x4217b8
[compute-0-7:25431] *** Process received signal ***
[compute-0-7:25431] Signal: Segmentation fault (11)
[compute-0-7:25431] Signal code: (128)
[compute-0-7:25431] Failing at address: (nil)
[compute-0-7:25430] *** Process received signal ***
[compute-0-7:25433] *** Process received signal ***
[compute-0-7:25433] Signal: Segmentation fault (11)
[compute-0-7:25433] Signal code: (128)
[compute-0-7:25433] Failing at address: (nil)
[compute-0-7:25432] *** Process received signal ***
[compute-0-7:25432] Signal: Segmentation fault (11)
[compute-0-7:25432] Signal code: (128)
[compute-0-7:25432] Failing at address: (nil)
[compute-0-7:25430] Signal: Segmentation fault (11)
[compute-0-7:25430] Signal code: (128)
[compute-0-7:25430] Failing at address: (nil)
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 25433 on node
compute-0-7.local exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
This job is run with 4 open mpi processes, on the nodes which have
interconnected with Gigabit.
The same job runs well on the nodes with infiniband connectivity.
What could be the reason for this? Is this due to loose physical
connectivities, as its giving a bus error?
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com