Jeff,

The hardware limitation doesn't allow me to use anything other than TCP...

I think I have a good understanding of what's going on, and may have a
solution. I'll test it out. Thanks to you all.

Best regards,
Zhen

On Fri, May 6, 2016 at 7:13 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> On May 5, 2016, at 10:09 PM, Zhen Wang <tod...@gmail.com> wrote:
> >
> > It's taking so long because you are sleeping for .1 second between
> calling MPI_Test().
> >
> > The TCP transport is only sending a few fragments of your message during
> each iteration through MPI_Test (because, by definition, it has to return
> "immediately").  Other transports do better handing off large messages like
> this to hardware for asynchronous progress.
> > This agrees with what I observed. Larger messages needs more calls of
> MPI_Test. What do you mean by other transports?
>
> The POSIX sockets API, commonly used with TCP over Ethernet, is great for
> most network-based applications, but it has some inherent constraints that
> limit its performance in HPC types of applications.
>
> That being said, many people just take a bunch of servers and run MPI over
> over TCP/Ethernet, and it works well enough for them.  Because of this
> "good enough" performance, and the fact that every server in the world
> supports some type of Ethernet capability, all MPI implementations support
> TCP.
>
> But there are more demanding HPC applications that require higher
> performance from the network in order to get good overall performance.  As
> such, other networking APIs -- most commonly provided by vendors for
> HPC-class networks (Ethernet or otherwise) -- do not have the same
> performance constraints as the POSIX sockets API, and are usually preferred
> by HPC applications.
>
> There's usually two kinds of performance improvements that such networking
> APIs offer (in conjunction with the underlying NIC for the HPC-class
> network):
>
> 1. Improving software API efficiency (e.g., avoid extra memory copies,
> bypassing the OS and exposing NIC hardware directly into userspace, etc.)
>
> 2. Exploiting NIC hardware capabilities, usually designed for MPI and/or
> general high performance (e.g., polling for progress instead of waiting for
> interrupts, hardware demultiplex of incoming messages directly to target
> processes, direct data placement at the target, etc.)
>
> Hence, when I say "other transports", I'm referring to these HPC-class
> networks (and associated APIs).
>
> > Additionally, in the upcoming v2.0.0 release is a non-default option to
> enable an asynchronous progress thread for the TCP transport.  We're up to
> v2.0.0rc2; you can give that async TCP support a whirl, if you want.  Pass
> "--mca btl_tcp_progress_thread 1" on the mpirun command line to enable the
> TCP progress thread to try it.
> > Does this mean there's an additional thread to transfer data in
> background?
>
> Yes.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29112.php
>

Reply via email to