Hi,
I've got some code that uses openmpi, and sometimes, it crashes, after
printing somthing like:
[mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1166
[mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line
90
mpirun noticed that job rank 1 with PID 9658 on node mac1 exited on signal
6 (Aborted).
2 additional processes aborted (not shown)
[mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1198
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job. Returned
value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------
In this case, all processes were running on the same machine, so its not a
connection problem. Is this a bug, or something else wrong? Is there a
way to increase the timeout time?
Thanks...