Dear all:

In the ongoing investigation into why a particular in-house program is not working in parallel over multiple nodes using OpenMPI, running with "--mca btl self,sm,tcp" I have been running into the following error:

[compute-6-15.local][[8185,1],0 [btl_tcp_endpoint.c:653:mca_btl_tcp_endpoint_complete_connect] connect() to 10.7.36.247 failed: Connection timed out (110)

I thought at first it was due to running out of file handles (sockets are considered files), but I have amended limits.d to allow 102400 files (up from the default of 1024), which should be more than enough.

        What is going on? Trying to connect to 4/20 nodes gave the error above.

My second question involves the notify system for btl openib. What does the syslog notifier require in order to work? I want to see if the errors running the same program with openib are due to dropped IB connections.

--
T. Vince Grimes, Ph.D.
CCC System Administrator

Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061

(806) 834-0813 (voice);     (806) 742-1289 (fax)

Reply via email to