Hello Carsten
 happy new year to you too.

On Tue, 3 Jan 2006, Carsten Kutzner wrote:

Hi Graham,

sorry for the long delay, I was on Christmas holidays. I wish a Happy New
Year!


(Uh, I think the previous email did not arrive in my postbox (?)) But yes,

I am resending it after this reply

also the OMPI tuned all-to-all shows this strange performance behaviour
(i.e. sometimes it's fast, sometimes it's delayed for 0.2 or more
seconds). For message sizes where the delays occur, I am sometimes able to
do better with an alternative all-to-all routine. It sets up the same
communication pattern as the pairbased sendrecv all-to-all but not on the
basis of the CPUs but on the basis of the nodes. The core looks like

So its equivalent to a batch style operation, each CPU does procs_pn*2 operations per step and there are just nnodes steps. (Its the same communication pattern as before on a CPU by CPU pairwise, except the final sync is the waitall on the 'set' of posted receives)?


  /* loop over nodes */
  for (i=0; i<nnodes; i++)
  {
    destnode   = (         nodeid + i) % nnodes;  /* send to destination node */
    sourcenode = (nnodes + nodeid - i) % nnodes;  /* receive from source node */
    /* loop over CPUs on each node */
    for (j=0; j<procs_pn; j++)  /* 1 or more processors per node */
    {
      sourcecpu = sourcenode*procs_pn + j; /* source of data */
      destcpu   = destnode  *procs_pn + j; /* destination of data */
      MPI_Irecv(recvbuf + sourcecpu*recvcount, recvcount, recvtype, sourcecpu, 0, 
comm, &recvrequests[j]);
      MPI_Isend(sendbuf + destcpu  *sendcount, sendcount, sendtype, destcpu  , 0, 
comm, &sendrequests[j]);
    }
    MPI_Waitall(procs_pn,sendrequests,sendstatuses);
    MPI_Waitall(procs_pn,recvrequests,recvstatuses);
  }

Is it possible to put the send and recv request handles in the same array and then do a waitall on them in a single op. It shouldn't make too much difference as the recvs are all posted (I hope) before the waitall takes effect but it would be interesting to see if internally their is an effect from combining them.

I tested for message sizes of 4, 8, 16, 32, ... 131072 byte that are to be
sent from each CPU to every other, and for 4, 8, 16, 24 and 32 nodes (each
node has 1, 2 or 4 CPUs). While in general the OMPI all-to-all performs
better, the alternative one performs better for the following message
sizes:

4 CPU nodes:
128 CPUs on 32 nodes: 512, 1024                                                
byte
96 CPUs on 24 nodes: 512, 1024, 2048, 4096,       16384                       
byte
64 CPUs on 16 nodes:                  4096                                    
byte

2 CPU nodes:
64 CPUs on 32 nodes:      1024, 2048, 4096, 8192                              
byte
48 CPUs on 24 nodes:            2048, 4096, 8192,                      131072 
byte

1 CPU nodes:
32 CPUs on 32 nodes:                  4096, 8192, 16384                       
byte
24 CPUs on 24 nodes:                        8192, 16384, 32768, 65536, 131072 
byte

Except for the 128K message on 48/24 nodes there appears to be some well defined pattern here. It appears more like a buffering side effect than contention.. if it was pure contension then at larger message sizes the 128/32 node example is putting more stress on the switch (more pairs communicating and larger data per pair means the chance for contention is higher).

Do you have any tools such as Vampir (or its Intel equivalent) available to get a time line graph ? (even jumpshot of one of the bad cases such as the 128/32 for 256 floats below would help).

(GEORGE, can you run a GigE test for 32 nodes using slog etc and send me the data)

Here is an example measurement for 128 CPUs on 32 nodes, averages taken
over 25 runs, not counting the 1st one. Performance problems marked by a
(!):

OMPI tuned all-to-all:
======================
      mesg size  time in seconds
#CPUs     floats  average   std.dev.    min.      max.
128           1  0.001288  0.000102    0.001077  0.001512
128           2  0.008391  0.000400    0.007861  0.009958
128           4  0.008403  0.000237    0.008095  0.009018
128           8  0.008228  0.000942    0.003801  0.008810
128          16  0.008503  0.000191    0.008233  0.008839
128          32  0.008656  0.000271    0.008084  0.009177
128          64  0.009085  0.000209    0.008757  0.009603
128         128  0.251414  0.073069    0.011547  0.506703 !
128         256  0.385515  0.127661    0.251431  0.578955 !
128         512  0.035111  0.000872    0.033358  0.036262
128        1024  0.046028  0.002116    0.043381  0.052602
128        2048  0.073392  0.007745    0.066432  0.104531
128        4096  0.165052  0.072889    0.124589  0.404213
128        8192  0.341377  0.041815    0.309457  0.530409
128       16384  0.507200  0.050872    0.492307  0.750956
128       32768  1.050291  0.132867    0.954496  1.344978
128       65536  2.213977  0.154987    1.962907  2.492560
128      131072  4.026107  0.147103    3.800191  4.336205

alternative all-to-all:
======================
128           1  0.012584  0.000724    0.011073  0.015331
128           2  0.012506  0.000444    0.011707  0.013461
128           4  0.012412  0.000511    0.011157  0.013413
128           8  0.012488  0.000455    0.011767  0.013746
128          16  0.012664  0.000416    0.011745  0.013362
128          32  0.012878  0.000410    0.012157  0.013609
128          64  0.013138  0.000417    0.012452  0.013826
128         128  0.014016  0.000505    0.013195  0.014942 +
128         256  0.015843  0.000521    0.015107  0.016725 +
128         512  0.052240  0.079323    0.027019  0.320653 !
128        1024  0.123884  0.121560    0.038062  0.308929 !
128        2048  0.176877  0.125229    0.074457  0.387276 !
128        4096  0.305030  0.121716    0.176640  0.496375 !
128        8192  0.546405  0.108007    0.415272  0.899858 !
128       16384  0.604844  0.056576    0.558657  0.843943 !
128       32768  1.235298  0.097969    1.094720  1.451241 !
128       65536  2.926902  0.312733    2.458742  3.895563 !
128      131072  6.208087  0.472115    5.354304  7.317153 !

The alternative all-to-all has the same performance problems, but they set
in later ... and last longer ;(  The results for the other cases look
similar.

Two things we can do right now, add a new alltoall like yours (adding yours to the code would require legal paperwork (3rd party stuff) and correct the decision function, but really we just need to find out what is causing this as the current tuned collective alltoall looks faster when this effect is not occuring anyway. It could be anything from a hardware/configuration issue to a problem in the BTL/PTLs.

I am currently visiting HLRS/Stuttgart so I will try and call you in an hour or so, if your leaving soon I can call you tomorrow morning?

Thanks,
        Graham.
----------------------------------------------------------------------
Dr Graham E. Fagg       | Distributed, Parallel and Meta-Computing
Innovative Computing Lab. PVM3.4, HARNESS, FT-MPI, SNIPE & Open MPI
Computer Science Dept   | Suite 203, 1122 Volunteer Blvd,
University of Tennessee | Knoxville, Tennessee, USA. TN 37996-3450
Email: f...@cs.utk.edu  | Phone:+1(865)974-5790 | Fax:+1(865)974-8296
Broken complex systems are always derived from working simple systems
----------------------------------------------------------------------

Reply via email to