Hmmm....perhaps you didn't notice the mpi_preconnect_all option? It does 
precisely what you described - it pushes zero-byte messages around a ring to 
force all the connections open at MPI_Init.


On Sep 20, 2011, at 3:06 PM, Henderson, Brent wrote:

> I recently had access to a 200+ node Magny Cours (24 ranks/host) 10G Linux 
> cluster.  I was able to use OpenMPI v1.5.4 with hello world, IMB and HPCC, 
> but there were a couple of issues along the way.  After setting some system 
> tunables up a little bit on all of the nodes a hello_world program worked 
> just fine – it appears that the TCP connections between most or all of the 
> ranks are deferred until they are actually used so the easy test ran 
> reasonably quickly.  I then moved to IMB. 
>  
> I typically don’t care about the small rank counts, so I add the –npmin 99999 
> option to just run the ‘big’ number of ranks.  This ended with an abort after 
> MPI_Init(), but before running any tests.  Lots (possibly all) of ranks 
> emitted messages that looked like:
>  
>     
> ‘[n112][[13200,1],1858][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 172.23.4.1 failed: Connection timed out (110)’
>  
> Where n112 is one of the nodes in the job, and 172.23.4.1 is the first node 
> in the job.  One of the first things that IMB does before running a test is 
> create a communicator for each specific rank count it is testing.  Apparently 
> this collective operation causes a large number of connections to be made.  
> The abort messages (one example shown above) all show the connect failure to 
> a single node, so it would appear that a very large number of nodes attempt 
> to connect to that one at the same time and overwhelmed it.  (Or it was slow 
> and everyone ganged up on it as they worked their way around the ring.  J  Is 
> there a supported/suggested way to work around this?  It was very repeatable.
>  
> I was able to work around this by using the primary definitions for 
> MPI_Init() and MPI_Init_thread() by calling the ‘P’ version of the routine, 
> and then having each rank send its rank number to the rank one to the right, 
> then two to the right, and so-on around the ring.  I added a MPI_Barrier( 
> MPI_COMM_WORLD ), call every N messages to keep things at a controlled pace.  
> N was 64 by default, but settable via environment variable in case that 
> number didn’t work well for some reason.  This fully connected the mesh (110k 
> socket connections per host!) and allowed the tests to run.  Not a great 
> solution, I know, but I’ll throw it out there until I know the right way.
>  
> Once I had this in place, I used the workaround with HPCC as well.  Without 
> it, it would not get very far at all.  With it, I was able to make it through 
> the entire test.
>  
> Looking forward to getting the experts thoughts about the best way to handle 
> big TCP clusters – thanks!
>  
> Brent
>  
> P.S.  v1.5.4 worked *much* better that v1.4.3 on this cluster – not sure why, 
> but kudos to those working on changes since then!
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to