Hmmm....perhaps you didn't notice the mpi_preconnect_all option? It does precisely what you described - it pushes zero-byte messages around a ring to force all the connections open at MPI_Init.
On Sep 20, 2011, at 3:06 PM, Henderson, Brent wrote: > I recently had access to a 200+ node Magny Cours (24 ranks/host) 10G Linux > cluster. I was able to use OpenMPI v1.5.4 with hello world, IMB and HPCC, > but there were a couple of issues along the way. After setting some system > tunables up a little bit on all of the nodes a hello_world program worked > just fine – it appears that the TCP connections between most or all of the > ranks are deferred until they are actually used so the easy test ran > reasonably quickly. I then moved to IMB. > > I typically don’t care about the small rank counts, so I add the –npmin 99999 > option to just run the ‘big’ number of ranks. This ended with an abort after > MPI_Init(), but before running any tests. Lots (possibly all) of ranks > emitted messages that looked like: > > > ‘[n112][[13200,1],1858][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 172.23.4.1 failed: Connection timed out (110)’ > > Where n112 is one of the nodes in the job, and 172.23.4.1 is the first node > in the job. One of the first things that IMB does before running a test is > create a communicator for each specific rank count it is testing. Apparently > this collective operation causes a large number of connections to be made. > The abort messages (one example shown above) all show the connect failure to > a single node, so it would appear that a very large number of nodes attempt > to connect to that one at the same time and overwhelmed it. (Or it was slow > and everyone ganged up on it as they worked their way around the ring. J Is > there a supported/suggested way to work around this? It was very repeatable. > > I was able to work around this by using the primary definitions for > MPI_Init() and MPI_Init_thread() by calling the ‘P’ version of the routine, > and then having each rank send its rank number to the rank one to the right, > then two to the right, and so-on around the ring. I added a MPI_Barrier( > MPI_COMM_WORLD ), call every N messages to keep things at a controlled pace. > N was 64 by default, but settable via environment variable in case that > number didn’t work well for some reason. This fully connected the mesh (110k > socket connections per host!) and allowed the tests to run. Not a great > solution, I know, but I’ll throw it out there until I know the right way. > > Once I had this in place, I used the workaround with HPCC as well. Without > it, it would not get very far at all. With it, I was able to make it through > the entire test. > > Looking forward to getting the experts thoughts about the best way to handle > big TCP clusters – thanks! > > Brent > > P.S. v1.5.4 worked *much* better that v1.4.3 on this cluster – not sure why, > but kudos to those working on changes since then! > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users