Hi Tim,

You could try setting -mca pls_gridengine_verbose 1 to show whether SGE is able to start the ORTE daemons on the remote nodes successfully.

It seems you are having the problem previously asked by another user, Perhaps you may want to follow this thread and check your ifconfig settings to see if anything suspicious?
http://www.open-mpi.org/community/lists/users/2007/02/2669.php

My 2 cents...

Tim Campbell wrote:
Greetings,

I am using OpenMPI v1.2.3 via SGE on a network of amd64 workstations. When mpirun tries to start the processes on certain nodes I get the following error output.

[sr70][0,1,2][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=111 [sr71][0,1,3][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=111

Using perl -e 'die$!=111' I see that the error message is "Connection refused". I am able to connect to both nodes in question via ssh and/ or rsh. I changed btl_base_debug to 2, but that did not provide additional information.

What are some possible issues that might be causing this? What can I do to get more information?

Thanks,
~Tim


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

- Pak Lui
pak....@sun.com

Reply via email to