Jeff

Thanks for the reply; I realize you guys must be really busy with the recent
release of openmpi. I tried 1.1 and I don't get error messages any more. But
the code now hangs; no error or exit. So I am not sure if this is the same
issue or something else. I am enclosing my source code. I compiled with icc
and linked against an icc compiled version of openmpi-1.1.

My program is a set of network benchmarks (a crude kind of netpipe) that
checks typical message passing patterns in my application codes. 
Typical output is:

 32 CPU's: sync call time = 1003.0        time
rate (Mbytes/s)                     bandwidth (MBits/s)
     loop   buffers  size     XC       XE       GS       MS         XC
XE       GS       MS         XC       XE       GS       MS
       1       64    16384  2.48e-02 1.99e-02 1.21e+00 3.88e-02   4.23e+01
5.28e+01 8.65e-01 2.70e+01   1.08e+04 1.35e+04 4.43e+02 1.38e+04
       2       64    16384  2.17e-02 2.09e-02 1.21e+00 4.10e-02   4.82e+01
5.02e+01 8.65e-01 2.56e+01   1.23e+04 1.29e+04 4.43e+02 1.31e+04
       3       64    16384  2.20e-02 1.99e-02 1.01e+00 3.95e-02   4.77e+01
5.27e+01 1.04e+00 2.65e+01   1.22e+04 1.35e+04 5.33e+02 1.36e+04
       4       64    16384  2.16e-02 1.96e-02 1.25e+00 4.00e-02   4.85e+01
5.36e+01 8.37e-01 2.62e+01   1.24e+04 1.37e+04 4.28e+02 1.34e+04
       5       64    16384  2.25e-02 2.00e-02 1.25e+00 4.07e-02   4.66e+01
5.24e+01 8.39e-01 2.57e+01   1.19e+04 1.34e+04 4.30e+02 1.32e+04
       6       64    16384  2.19e-02 1.99e-02 1.29e+00 4.05e-02   4.79e+01
5.28e+01 8.14e-01 2.59e+01   1.23e+04 1.35e+04 4.17e+02 1.33e+04
       7       64    16384  2.19e-02 2.06e-02 1.25e+00 4.03e-02   4.79e+01
5.09e+01 8.38e-01 2.60e+01   1.23e+04 1.30e+04 4.29e+02 1.33e+04
       8       64    16384  2.24e-02 2.06e-02 1.25e+00 4.01e-02   4.69e+01
5.09e+01 8.39e-01 2.62e+01   1.20e+04 1.30e+04 4.30e+02 1.34e+04
       9       64    16384  4.29e-01 2.01e-02 6.35e-01 3.98e-02   2.45e+00
5.22e+01 1.65e+00 2.64e+01   6.26e+02 1.34e+04 8.46e+02 1.35e+04
      10       64    16384  2.16e-02 2.06e-02 8.87e-01 4.00e-02   4.85e+01
5.09e+01 1.18e+00 2.62e+01   1.24e+04 1.30e+04 6.05e+02 1.34e+04

Time is total for all 64 buffers. Rate is one way across one link (# of
bytes/time).
1) XC is a bidirectional ring exchange. Each processor sends to the right
and receives from the left
2) XE is an edge exchange. Pairs of nodes exchange data, with each one
sending and receiving
3) GS is the MPI_AllReduce
4) MS is my version of MPI_AllReduce. It splits the vector into Np blocks
(Np is # of processors); each processor then acts as a head node for one
block. This uses the full bandwidth all the time, unlike AllReduce which
thins out as it gets to the top of the binary tree. On a 64 node Infiniband
system MS is about 5X faster than GS-in theory it would be 6X; ie log_2(64).
Here it is 25X-not sure why so much. But MS seems to be the cause of the
hangups with messages > 64K. I can run the other benchmarks OK,but this one
seems to hang for large messages. I think the problem is at least partly due
to the switch. All MS is doing is point to point communications, but
unfortunately it sometimes requires a high bandwidth between ASIC's. It
first it exchanges data between near neighbors in MPI_COMM_WORLD, but it
must progressively span wider gaps between nodes as it goes up the various
binary trees. After a while this requires extensive traffic between ASICS.
This seems to be a problem on both my HP 2724 and the Extreme Networks
Summit400t-48. I am currently working with Extreme to try to resolve the
switch issue. As I say; the code ran great on Infiniband, but I think those
switches have hardware flow control. Finally I checked the code again under
LAM and it ran OK. Slow, but no hangs.

To run the code compile and type:
mpirun -np 32 -machinefile hosts src/netbench 8
The 8 means 2^8 bytes (ie 256K). This was enough to hang every time on my
boxes.

You can also edit the header file (header.h). MAX_LOOPS is how many times it
runs each test (currently 10); NUM_BUF is the number of buffers in each test
(must be more than number of processors), SYNC defines the global sync
frequency-every SYNC buffers. NUM_SYNC is the number of sequential barrier
calls it uses to determine the mean barrier call time. You can also switch
the verious tests on and off, which can be useful for debugging

Tony

-------------------------------
Tony Ladd
Professor, Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu 

Attachment: src.tgz
Description: application/compressed

Reply via email to