When running either over myrinet or over gigabit one of our codes (Gagdet2) it fails predictably with the following error message. >From the back trace it looks as if the SEGV is in ompi_coll_tuned_reduce_generic.
Have there been similar reportings and/or is there a fix for this? Lydia Heck [m2042:08002] *** Process received signal *** [m2042:08002] Signal: Segmentation Fault (11) [m2042:08002] Signal code: Address not mapped (1) [m2042:08002] Failing at address: 92 /opt/OMPI/ompi-1.2b4r13488/lib/libopen-pal.so.0.0.0:opal_backtrace_print+0x26 /opt/OMPI/ompi-1.2b4r13488/lib/libopen-pal.so.0.0.0:0xc3874 /lib/amd64/libc.so.1:0xcb686 /lib/amd64/libc.so.1:0xc0a52 /opt/OMPI/ompi-1.2b4r13488/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_reduce_generic+0x11b [ Signal 11 (SEGV)] /opt/OMPI/ompi-1.2b4r13488/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_reduce_intra_binary+0x162 /opt/OMPI/ompi-1.2b4r13488/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_reduce_intra_dec_fixed+0x28d /opt/OMPI/ompi-1.2b4r13488/lib/libmpi.so.0.0.0:PMPI_Reduce+0x3f6 /data/4/nil/tak_gadget/gadget2/P-Gadget2:gravity_tree+0x146c /data/4/nil/tak_gadget/gadget2/P-Gadget2:compute_accelerations+0x7e /data/4/nil/tak_gadget/gadget2/P-Gadget2:run+0xa5 /data/4/nil/tak_gadget/gadget2/P-Gadget2:main+0x22f /data/4/nil/tak_gadget/gadget2/P-Gadget2:0x7c3c [m2042:08002] *** End of error message *** [m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_gridengine_module.c at line 793 [m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 mpirun noticed that job rank 2 with PID 0 on node m2043 exited on signal 11 (Segmentation Fault). [m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [m2043:07816] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_gridengine_module.c at line 828 -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. ------------------------------------------ Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___________________________________________