Hello OpenMPI list !
I am trying to run "GROMACS" with openmpi 1.5 compiled from source
with Intel compilers using Torque/Maui scheduler
I am getting following error. The error indicates problem with OpenMPI
hence I am posting my query here.

[compute-0-4.local][[19774,1],0][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.6.123 failed: Connection refused (111)

The job hangs (no output for a long time). The strange thing about
this error is that I get this error on random occasions. Sometimes the
job finishes without any error messages, sometimes this error shows up
in middle of Gromacs' STDERR stream, and sometimes I only get
following -

NNODES=4, MYRANK=0, HOSTNAME=compute-0-4.local
NODEID=0 argc=12
[compute-0-4.local][[19774,1],0][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.6.123 failed: Connection refused (111)
NNODES=4, MYRANK=1, HOSTNAME=compute-0-4.local
NNODES=4, MYRANK=2, HOSTNAME=compute-0-130.local
NODEID=2 argc=12
NNODES=4, MYRANK=3, HOSTNAME=compute-0-130.local
NODEID=1 argc=12
NODEID=3 argc=12

I can attach full logs of successful jobs, but it doesn't contain any
OpenMPI related messages.

When I searched for
btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect , I found
following link -
http://docs.notur.no/uit/stallo_documentation/error/mpi-init-failed
which says "This is probably due to a weakness of the system when the
job is assigned to nodes with and without infiniband at the same time"
However, our system doesn't have any infiniband fabric. We do have two
GIGE networks eth0 and eth1 both of which are working fine.

Please help.

Thank you

-
Sudarshan Wadkar
System Administrator
HPCC, IITB

-- 
~$udhi
"Success is getting what you want. Happiness is wanting what you get."
- Dale Carnegie
"It's always our decision who we are"
- Robert Solomon in Waking Life

Reply via email to