Dear mailing list,

We are experimenting run time failure  on a small cluster with openmpi-2.0.2 
and gcc 6.3 and gcc 5.4.
The job start normally and lots of communications are performed. After 5-10 
minutes the connection to the hosts is closed and
the following error message is reported:
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
 one or more nodes. Please check your PATH and LD_LIBRARY_PATH
 settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
 Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
 Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
 (e.g., on Cray). Please check your configure cmd line and consider using
 one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
 lack of common network interfaces and/or no route found between
 them. Please check network connectivity (including firewalls
 and network routing requirements).



The issue does not seem to be due to the infiniband configuration, because the 
job also crash when using tcp protocol.

Do you have any clue of what could be the issue ?


Thanks a lot,

Vincent

________________________________
[http://www.plymouth.ac.uk/images/email_footer.gif]<http://www.plymouth.ac.uk/worldclass>

This email and any files with it are confidential and intended solely for the 
use of the recipient to whom it is addressed. If you are not the intended 
recipient then copying, distribution or other use of the information contained 
is strictly prohibited and you should not rely on it. If you have received this 
email in error please let the sender know immediately and delete it from your 
system(s). Internet emails are not necessarily secure. While we take every 
care, Plymouth University accepts no responsibility for viruses and it is your 
responsibility to scan emails and their attachments. Plymouth University does 
not accept responsibility for any changes made after it was sent. Nothing in 
this email or its attachments constitutes an order for goods or services unless 
accompanied by an official order form.
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to