Dear all:
In the ongoing investigation into why a particular in-house program is
not working in parallel over multiple nodes using OpenMPI, running with
"--mca btl self,sm,tcp" I have been running into the following error:
[compute-6-15.local][[8185,1],0
[btl_tcp_endpoint.c:653:mca_btl_tcp_endpoint_complete_connect] connect()
to 10.7.36.247 failed: Connection timed out (110)
I thought at first it was due to running out of file handles (sockets
are considered files), but I have amended limits.d to allow 102400 files
(up from the default of 1024), which should be more than enough.
What is going on? Trying to connect to 4/20 nodes gave the error above.
My second question involves the notify system for btl openib. What does
the syslog notifier require in order to work? I want to see if the
errors running the same program with openib are due to dropped IB
connections.
--
T. Vince Grimes, Ph.D.
CCC System Administrator
Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061
(806) 834-0813 (voice); (806) 742-1289 (fax)