Bus error usually means that there was an invalid address passed as a pointer somewhere in the code -- it's not usually a communications error.

Without more information, it's rather difficult to speculate on what happened here. Did you get corefiles? If so, are there useful backtraces available?


On Oct 1, 2009, at 6:01 AM, Sangamesh B wrote:

Hi,

A fortran application which is compiled with ifort-10.1 and open mpi 1.3.1 on Cent OS 5.2 fails after running 4 days with following error message:

[compute-0-7:25430] *** Process received signal ***

[compute-0-7:25433] *** Process received signal ***
[compute-0-7:25433] Signal: Bus error (7)
[compute-0-7:25433] Signal code:  (2)
[compute-0-7:25433] Failing at address: 0x4217b8
[compute-0-7:25431] *** Process received signal ***

[compute-0-7:25431] Signal: Bus error (7)
[compute-0-7:25431] Signal code:  (2)
[compute-0-7:25431] Failing at address: 0x4217b8
[compute-0-7:25432] *** Process received signal ***
[compute-0-7:25432] Signal: Bus error (7)

[compute-0-7:25432] Signal code:  (2)
[compute-0-7:25432] Failing at address: 0x4217b8
[compute-0-7:25430] Signal: Bus error (7)
[compute-0-7:25430] Signal code:  (2)
[compute-0-7:25430] Failing at address: 0x4217b8

[compute-0-7:25431] *** Process received signal ***
[compute-0-7:25431] Signal: Segmentation fault (11)
[compute-0-7:25431] Signal code:  (128)
[compute-0-7:25431] Failing at address: (nil)
[compute-0-7:25430] *** Process received signal ***

[compute-0-7:25433] *** Process received signal ***
[compute-0-7:25433] Signal: Segmentation fault (11)
[compute-0-7:25433] Signal code:  (128)
[compute-0-7:25433] Failing at address: (nil)
[compute-0-7:25432] *** Process received signal ***

[compute-0-7:25432] Signal: Segmentation fault (11)
[compute-0-7:25432] Signal code:  (128)
[compute-0-7:25432] Failing at address: (nil)
[compute-0-7:25430] Signal: Segmentation fault (11)
[compute-0-7:25430] Signal code:  (128)

[compute-0-7:25430] Failing at address: (nil)
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 25433 on node compute-0-7.local exited on signal 11 (Segmentation fault).



--------------------------------------------------------------------------
This job is run with 4 open mpi processes, on the nodes which have interconnected with Gigabit.
The same job runs well on the nodes with infiniband connectivity.

What could be the reason for this? Is this due to loose physical connectivities, as its giving a bus error?
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com

Reply via email to