Hello OpenMPI list ! I am trying to run "GROMACS" with openmpi 1.5 compiled from source with Intel compilers using Torque/Maui scheduler I am getting following error. The error indicates problem with OpenMPI hence I am posting my query here.
[compute-0-4.local][[19774,1],0][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.6.123 failed: Connection refused (111) The job hangs (no output for a long time). The strange thing about this error is that I get this error on random occasions. Sometimes the job finishes without any error messages, sometimes this error shows up in middle of Gromacs' STDERR stream, and sometimes I only get following - NNODES=4, MYRANK=0, HOSTNAME=compute-0-4.local NODEID=0 argc=12 [compute-0-4.local][[19774,1],0][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.6.123 failed: Connection refused (111) NNODES=4, MYRANK=1, HOSTNAME=compute-0-4.local NNODES=4, MYRANK=2, HOSTNAME=compute-0-130.local NODEID=2 argc=12 NNODES=4, MYRANK=3, HOSTNAME=compute-0-130.local NODEID=1 argc=12 NODEID=3 argc=12 I can attach full logs of successful jobs, but it doesn't contain any OpenMPI related messages. When I searched for btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect , I found following link - http://docs.notur.no/uit/stallo_documentation/error/mpi-init-failed which says "This is probably due to a weakness of the system when the job is assigned to nodes with and without infiniband at the same time" However, our system doesn't have any infiniband fabric. We do have two GIGE networks eth0 and eth1 both of which are working fine. Please help. Thank you - Sudarshan Wadkar System Administrator HPCC, IITB -- ~$udhi "Success is getting what you want. Happiness is wanting what you get." - Dale Carnegie "It's always our decision who we are" - Robert Solomon in Waking Life