The data there would not have helped me too much I'm afraid.  I'm used to 
working with big IB clusters, but needed help with the TCP side of the house.

I needed things like the 'mpi_preconnect_all' flag suggestion, sysctl settings 
for the TCP stack, file descriptor limits for the user and system, how to 
enable jumbo frames, using ethtool to change various options to see what worked 
best (like interrupt coalescing, various timeouts, ...), and how to bind the 
interrupt handlers to cores for most effective processing of the requests from 
the NIC.  The card vendors themselves have documentation on some of these, but 
it is not always easy to find.

I think that the key here is to determine which (if any!) of the things I was 
looking for can live in a general FAQ entry.  :)  If you do come up with some 
updates I would certainly review them for you!

Brent


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Tuesday, September 20, 2011 6:55 PM
To: Open MPI Users
Subject: Re: [OMPI users] Large TCP cluster timeout issue

Truly am sorry about that - we were just talking today about the need to update 
and improve our FAQ on running on large clusters. Did you by any chance look at 
it? Would appreciate any thoughts on how it should be improved from a user's 
perspective.



On Sep 20, 2011, at 3:28 PM, Henderson, Brent wrote:


Nope, but if I didn't that would have saved me about an hour of coding time!

I'm still curious if it would be beneficial to inject some barriers at certain 
locations so that if you had a slow node, not everyone would end up connecting 
to it all at once.  Anyway, if I get access to another large TCP cluster, I'll 
give it a try.

Thanks,

brent

From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> 
[mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 20, 2011 4:15 PM
To: Open MPI Users
Subject: Re: [OMPI users] Large TCP cluster timeout issue

Hmmm....perhaps you didn't notice the mpi_preconnect_all option? It does 
precisely what you described - it pushes zero-byte messages around a ring to 
force all the connections open at MPI_Init.


On Sep 20, 2011, at 3:06 PM, Henderson, Brent wrote:



I recently had access to a 200+ node Magny Cours (24 ranks/host) 10G Linux 
cluster.  I was able to use OpenMPI v1.5.4 with hello world, IMB and HPCC, but 
there were a couple of issues along the way.  After setting some system 
tunables up a little bit on all of the nodes a hello_world program worked just 
fine - it appears that the TCP connections between most or all of the ranks are 
deferred until they are actually used so the easy test ran reasonably quickly.  
I then moved to IMB.

I typically don't care about the small rank counts, so I add the -npmin 99999 
option to just run the 'big' number of ranks.  This ended with an abort after 
MPI_Init(), but before running any tests.  Lots (possibly all) of ranks emitted 
messages that looked like:

    
'[n112][[13200,1],1858][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
 connect() to 172.23.4.1 failed: Connection timed out (110)'

Where n112 is one of the nodes in the job, and 172.23.4.1 is the first node in 
the job.  One of the first things that IMB does before running a test is create 
a communicator for each specific rank count it is testing.  Apparently this 
collective operation causes a large number of connections to be made.  The 
abort messages (one example shown above) all show the connect failure to a 
single node, so it would appear that a very large number of nodes attempt to 
connect to that one at the same time and overwhelmed it.  (Or it was slow and 
everyone ganged up on it as they worked their way around the ring.  :)  Is 
there a supported/suggested way to work around this?  It was very repeatable.

I was able to work around this by using the primary definitions for MPI_Init() 
and MPI_Init_thread() by calling the 'P' version of the routine, and then 
having each rank send its rank number to the rank one to the right, then two to 
the right, and so-on around the ring.  I added a MPI_Barrier( MPI_COMM_WORLD ), 
call every N messages to keep things at a controlled pace.  N was 64 by 
default, but settable via environment variable in case that number didn't work 
well for some reason.  This fully connected the mesh (110k socket connections 
per host!) and allowed the tests to run.  Not a great solution, I know, but 
I'll throw it out there until I know the right way.

Once I had this in place, I used the workaround with HPCC as well.  Without it, 
it would not get very far at all.  With it, I was able to make it through the 
entire test.

Looking forward to getting the experts thoughts about the best way to handle 
big TCP clusters - thanks!

Brent

P.S.  v1.5.4 worked *much* better that v1.4.3 on this cluster - not sure why, 
but kudos to those working on changes since then!

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to