I recently had access to a 200+ node Magny Cours (24 ranks/host) 10G Linux
cluster. I was able to use OpenMPI v1.5.4 with hello world, IMB and HPCC, but
there were a couple of issues along the way. After setting some system
tunables up a little bit on all of the nodes a hello_world program worked just
fine - it appears that the TCP connections between most or all of the ranks are
deferred until they are actually used so the easy test ran reasonably quickly.
I then moved to IMB.
I typically don't care about the small rank counts, so I add the -npmin 99999
option to just run the 'big' number of ranks. This ended with an abort after
MPI_Init(), but before running any tests. Lots (possibly all) of ranks emitted
messages that looked like:
'[n112][[13200,1],1858][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 172.23.4.1 failed: Connection timed out (110)'
Where n112 is one of the nodes in the job, and 172.23.4.1 is the first node in
the job. One of the first things that IMB does before running a test is create
a communicator for each specific rank count it is testing. Apparently this
collective operation causes a large number of connections to be made. The
abort messages (one example shown above) all show the connect failure to a
single node, so it would appear that a very large number of nodes attempt to
connect to that one at the same time and overwhelmed it. (Or it was slow and
everyone ganged up on it as they worked their way around the ring. :) Is
there a supported/suggested way to work around this? It was very repeatable.
I was able to work around this by using the primary definitions for MPI_Init()
and MPI_Init_thread() by calling the 'P' version of the routine, and then
having each rank send its rank number to the rank one to the right, then two to
the right, and so-on around the ring. I added a MPI_Barrier( MPI_COMM_WORLD ),
call every N messages to keep things at a controlled pace. N was 64 by
default, but settable via environment variable in case that number didn't work
well for some reason. This fully connected the mesh (110k socket connections
per host!) and allowed the tests to run. Not a great solution, I know, but
I'll throw it out there until I know the right way.
Once I had this in place, I used the workaround with HPCC as well. Without it,
it would not get very far at all. With it, I was able to make it through the
entire test.
Looking forward to getting the experts thoughts about the best way to handle
big TCP clusters - thanks!
Brent
P.S. v1.5.4 worked *much* better that v1.4.3 on this cluster - not sure why,
but kudos to those working on changes since then!