Re: [OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110

Jeff Squyres Fri, 28 Jul 2006 14:37:51 -0400

Tony --

My apologies for taking so long to answer.  :-(


I was unfortunately unable to replicate your problem.  I ran your source
code across 32 machines connected by TCP with no problem:

  mpirun --hostfile ~/mpi/cdc -np 32 -mca btl tcp,self netbench 8

I tried this on two different clusters with the same results -- it didn't
hang.  :-(

Can you try again with a recent nightly tarball, or the 1.1.1 beta tarball
that has been posted?

  http://www.open-mpi.org/software/ompi/v1.1/


On 6/30/06 8:35 AM, "Tony Ladd" <l...@che.ufl.edu> wrote:

> Jeff
> 
> Thanks for the reply; I realize you guys must be really busy with the recent
> release of openmpi. I tried 1.1 and I don't get error messages any more. But
> the code now hangs; no error or exit. So I am not sure if this is the same
> issue or something else. I am enclosing my source code. I compiled with icc
> and linked against an icc compiled version of openmpi-1.1.
> 
> My program is a set of network benchmarks (a crude kind of netpipe) that
> checks typical message passing patterns in my application codes.
> Typical output is:
> 
>  32 CPU's: sync call time = 1003.0        time
> rate (Mbytes/s)                     bandwidth (MBits/s)
>      loop   buffers  size     XC       XE       GS       MS         XC
> XE       GS       MS         XC       XE       GS       MS
>        1       64    16384  2.48e-02 1.99e-02 1.21e+00 3.88e-02   4.23e+01
> 5.28e+01 8.65e-01 2.70e+01   1.08e+04 1.35e+04 4.43e+02 1.38e+04
>        2       64    16384  2.17e-02 2.09e-02 1.21e+00 4.10e-02   4.82e+01
> 5.02e+01 8.65e-01 2.56e+01   1.23e+04 1.29e+04 4.43e+02 1.31e+04
>        3       64    16384  2.20e-02 1.99e-02 1.01e+00 3.95e-02   4.77e+01
> 5.27e+01 1.04e+00 2.65e+01   1.22e+04 1.35e+04 5.33e+02 1.36e+04
>        4       64    16384  2.16e-02 1.96e-02 1.25e+00 4.00e-02   4.85e+01
> 5.36e+01 8.37e-01 2.62e+01   1.24e+04 1.37e+04 4.28e+02 1.34e+04
>        5       64    16384  2.25e-02 2.00e-02 1.25e+00 4.07e-02   4.66e+01
> 5.24e+01 8.39e-01 2.57e+01   1.19e+04 1.34e+04 4.30e+02 1.32e+04
>        6       64    16384  2.19e-02 1.99e-02 1.29e+00 4.05e-02   4.79e+01
> 5.28e+01 8.14e-01 2.59e+01   1.23e+04 1.35e+04 4.17e+02 1.33e+04
>        7       64    16384  2.19e-02 2.06e-02 1.25e+00 4.03e-02   4.79e+01
> 5.09e+01 8.38e-01 2.60e+01   1.23e+04 1.30e+04 4.29e+02 1.33e+04
>        8       64    16384  2.24e-02 2.06e-02 1.25e+00 4.01e-02   4.69e+01
> 5.09e+01 8.39e-01 2.62e+01   1.20e+04 1.30e+04 4.30e+02 1.34e+04
>        9       64    16384  4.29e-01 2.01e-02 6.35e-01 3.98e-02   2.45e+00
> 5.22e+01 1.65e+00 2.64e+01   6.26e+02 1.34e+04 8.46e+02 1.35e+04
>       10       64    16384  2.16e-02 2.06e-02 8.87e-01 4.00e-02   4.85e+01
> 5.09e+01 1.18e+00 2.62e+01   1.24e+04 1.30e+04 6.05e+02 1.34e+04
> 
> Time is total for all 64 buffers. Rate is one way across one link (# of
> bytes/time).
> 1) XC is a bidirectional ring exchange. Each processor sends to the right
> and receives from the left
> 2) XE is an edge exchange. Pairs of nodes exchange data, with each one
> sending and receiving
> 3) GS is the MPI_AllReduce
> 4) MS is my version of MPI_AllReduce. It splits the vector into Np blocks
> (Np is # of processors); each processor then acts as a head node for one
> block. This uses the full bandwidth all the time, unlike AllReduce which
> thins out as it gets to the top of the binary tree. On a 64 node Infiniband
> system MS is about 5X faster than GS-in theory it would be 6X; ie log_2(64).
> Here it is 25X-not sure why so much. But MS seems to be the cause of the
> hangups with messages > 64K. I can run the other benchmarks OK,but this one
> seems to hang for large messages. I think the problem is at least partly due
> to the switch. All MS is doing is point to point communications, but
> unfortunately it sometimes requires a high bandwidth between ASIC's. It
> first it exchanges data between near neighbors in MPI_COMM_WORLD, but it
> must progressively span wider gaps between nodes as it goes up the various
> binary trees. After a while this requires extensive traffic between ASICS.
> This seems to be a problem on both my HP 2724 and the Extreme Networks
> Summit400t-48. I am currently working with Extreme to try to resolve the
> switch issue. As I say; the code ran great on Infiniband, but I think those
> switches have hardware flow control. Finally I checked the code again under
> LAM and it ran OK. Slow, but no hangs.
> 
> To run the code compile and type:
> mpirun -np 32 -machinefile hosts src/netbench 8
> The 8 means 2^8 bytes (ie 256K). This was enough to hang every time on my
> boxes.
> 
> You can also edit the header file (header.h). MAX_LOOPS is how many times it
> runs each test (currently 10); NUM_BUF is the number of buffers in each test
> (must be more than number of processors), SYNC defines the global sync
> frequency-every SYNC buffers. NUM_SYNC is the number of sequential barrier
> calls it uses to determine the mean barrier call time. You can also switch
> the verious tests on and off, which can be useful for debugging
> 
> Tony
> 
> -------------------------------
> Tony Ladd
> Professor, Chemical Engineering
> University of Florida
> PO Box 116005
> Gainesville, FL 32611-6005
> 
> Tel: 352-392-6509
> FAX: 352-392-9513
> Email: tl...@che.ufl.edu
> Web: http://ladd.che.ufl.edu
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

Re: [OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110

Reply via email to