Hi Peter, We have HP ProCurve 2848 GigE switches here (48 port). The problem is more severe the more nodes (=ports) are involved. It starts to show up at 16 ports for a limited range of message sizes and gets really bad for 32 nodes. The switch has a 96 Gbit/s backplane and should therefore be able to forward the in and out traffic of all 48 ports simultaneously, as long as not two nodes send to one receiver. The ordered communication pattern takes care about the latter (e.g. having only pairs communicate at the same time). Maybe the switch gets problems when switching from one pair to another? I will try if I can get another switch for testing.
Thanks! Carsten On Wed, 4 Jan 2006, Peter [iso-8859-1] Kjellstr?m wrote: > Hello Carsten, > > Have you considered the possibility that this is the effect of a non-optimal > ethernet switch? I don't know how many nodes you need to reproduce it on or > if you even have physical access (and opportunity) but popping in another > decent 16-port switch for a testrun might be interesting. > > just my .02 euros, > Peter > > On Tuesday 03 January 2006 18:45, Carsten Kutzner wrote: > > On Tue, 3 Jan 2006, Graham E Fagg wrote: > > > Do you have any tools such as Vampir (or its Intel equivalent) available > > > to get a time line graph ? (even jumpshot of one of the bad cases such as > > > the 128/32 for 256 floats below would help). > > > > Hi Graham, > > > > I have attached an slog file of an all-to-all run for 1024 floats (ompi > > tuned alltoall). I could not get clog files for >32 processes - is this > > perhaps a limitation of MPE? So I decided to take the case 32 CPUs on > > 32 nodes which is performance-critical as well. From the run output you > > can see that 2 of the 5 tries yield a fast execution while the others > > are slow (see below). > > > > Carsten > > > > > > > > ckutzne@node001:~/mpe> mpirun -hostfile ./bhost1 -np 32 ./phas_mpe.x > > Alltoall Test on 32 CPUs. 5 repetitions. > > --- New category (first test not counted) --- > > MPI: sending 1024 floats ( 4096 bytes) to 32 processes ( 1 > > times) took ... 0.00690 seconds > > --------------------------------------------- > > MPI: sending 1024 floats ( 4096 bytes) to 32 processes ( 1 > > times) took ... 0.00320 seconds MPI: sending 1024 floats ( 4096 > > bytes) to 32 processes ( 1 times) took ... 0.26392 seconds ! MPI: > > sending 1024 floats ( 4096 bytes) to 32 processes ( 1 times) > > took ... 0.26868 seconds ! MPI: sending 1024 floats ( 4096 bytes) > > to 32 processes ( 1 times) took ... 0.26398 seconds ! MPI: sending > > 1024 floats ( 4096 bytes) to 32 processes ( 1 times) took ... > > 0.00339 seconds Summary (5-run average, timer resolution 0.000001): > > 1024 floats took 0.160632 (0.143644) seconds. Min: 0.003200 max: > > 0.268681 Writing logfile.... > > Finished writing logfile. > > -- > ------------------------------------------------------------ > Peter Kjellstr?m | > National Supercomputer Centre | > Sweden | http://www.nsc.liu.se > --------------------------------------------------- Dr. Carsten Kutzner Max Planck Institute for Biophysical Chemistry Theoretical and Computational Biophysics Department Am Fassberg 11 37077 Goettingen, Germany Tel. +49-551-2012313, Fax: +49-551-2012302 eMail ckut...@gwdg.de http://www.gwdg.de/~ckutzne