Re: [OMPI users] Network performance over TCP

Barrett, Brian via users Wed, 12 Jul 2017 11:22:03 -0700

Adam -

The btl_tcp_links flag does not currently work (for various reasons) in the 2.x 
and 3.x series.  It’s on my todo list to fix, but I’m not sure it will get done 
before the 3.0.0 release.  Part of the reason that it hasn’t been a priority is 
that most applications (outside of benchmarks) don’t benefit from the 20 Gbps 
between rank pairs, as they are generally talking to multiple peers at once 
(and therefore can drive the full 20 Gbps).  It’s definitely on our roadmap, 
but can’t promise a release just yet.


Brian

On Jul 12, 2017, at 11:44 AM, Adam Sylvester 
<op8...@gmail.com<mailto:op8...@gmail.com>> wrote:

I switched over to X1 instances in AWS which have 20 Gbps connectivity.  Using 
iperf3, I'm seeing 11.1 Gbps between them with just one port.  iperf3 supports 
a -P option which will connect using multiple ports...  Setting this to use in 
the range of 5-20 ports (there's some variability from run to run), I can get 
in the range of 18 Gbps aggregate which for a real world speed seems pretty 
good.

Using mpirun with the previously-suggested btl_tcp_sndbuf and btl_tcp_rcvbuf 
settings, I'm getting around 10.7 Gbps.  So, pretty close to iperf with just 
one port (makes sense there'd be some overhead with MPI).  My understanding of 
the btl_tcp_links flag that Gilles mentioned is that it should be analogous to 
iperf's -P flag - it should connect with multiple ports in the hopes of 
improving the aggregate bandwidth.

If that's what this flag is supposed to do, it does not appear to be working 
properly for me.  With lsof, I can see the expected number of ports show up 
when I run iperf.  However, with MPI I only ever see three connections between 
the two machines - sshd, orted, and my actual application.  No matter what I 
set btl_tcp_links to, I don't see any additional ports show up (or any change 
in performance).

Am I misunderstanding what this flag does or is there a bug here?  If I am 
misunderstanding the flag's intent, is there a different flag that would allow 
Open MPI to use multiple ports similar to what iperf is doing?

Thanks.
-Adam

On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester 
<op8...@gmail.com<mailto:op8...@gmail.com>> wrote:
Thanks again Gilles.  Ahh, better yet - I wasn't familiar with the config file 
way to set these parameters... it'll be easy to bake this into my AMI so that I 
don't have to set them each time while waiting for the next Open MPI release.

Out of mostly laziness I try to keep to the formal releases rather than 
applying patches myself, but thanks for the link to it (the commit comments 
were useful to understand why this improved performance).

-Adam

On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet 
<gil...@rist.or.jp<mailto:gil...@rist.or.jp>> wrote:
Adam,


Thanks for letting us know your performance issue has been resolved.


yes, https://www.open-mpi.org/faq/?category=tcp is the best place to look for 
this kind of information.

i will add a reference to these parameters. i will also ask folks at AWS if 
they have additional/other recommendations.


note you have a few options before 2.1.2 (or 3.0.0) is released :


- update your system wide config file (/.../etc/openmpi-mca-params.conf) or 
user config file

  ($HOME/.openmpi/mca-params.conf) and add the following lines

btl_tcp_sndbuf = 0

btl_tcp_rcvbuf = 0


- add the following environment variable to your environment

export OMPI_MCA_btl_tcp_sndbuf=0

export OMPI_MCA_btl_tcp_rcvbuf=0


- use Open MPI 2.0.3


- last but not least, you can manually download and apply the patch available at

https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc7c4693f9c1ef01dfb69f.patch


Cheers,

Gilles

On 7/9/2017 11:04 PM, Adam Sylvester wrote:
Gilles,

Thanks for the fast response!

The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you recommended made a 
huge difference - this got me up to 5.7 Gb/s! I wasn't aware of these flags... 
with a little Googling, is https://www.open-mpi.org/faq/?category=tcp the best 
place to look for this kind of information and any other tweaks I may want to 
try (or if there's a better FAQ out there, please let me know)?
There is only eth0 on my machines so nothing to tweak there (though good to 
know for the future). I also didn't see any improvement by specifying more 
sockets per instance. But, your initial suggestion had a major impact.
In general I try to stay relatively up to date with my Open MPI version; I'll 
be extra motivated to upgrade to 2.1.2 so that I don't have to remember to set 
these --mca flags on the command line. :o)
-Adam

On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com> 
<mailto:gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>>> 
wrote:

    Adam,

    at first, you need to change the default send and receive socket
    buffers :
    mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
    /* note this will be the default from Open MPI 2.1.2 */

    hopefully, that will be enough to greatly improve the bandwidth for
    large messages.


    generally speaking, i recommend you use the latest (e.g. Open MPI
    2.1.1) available version

    how many interfaces can be used to communicate between hosts ?
    if there is more than one (for example a slow and a fast one), you'd
    rather only use the fast one.
    for example, if eth0 is the fast interface, that can be achieved with
    mpirun --mca btl_tcp_if_include eth0 ...

    also, you might be able to achieve better results by using more than
    one socket on the fast interface.
    for example, if you want to use 4 sockets per interface
    mpirun --mca btl_tcp_links 4 ...



    Cheers,

    Gilles

    On Sun, Jul 9, 2017 at 10:10 PM, Adam Sylvester 
<op8...@gmail.com<mailto:op8...@gmail.com>
    <mailto:op8...@gmail.com<mailto:op8...@gmail.com>>> wrote:
    > I am using Open MPI 2.1.0 on RHEL 7.  My application has one
    unavoidable
    > pinch point where a large amount of data needs to be transferred
    (about 8 GB
    > of data needs to be both sent to and received all other ranks),
    and I'm
    > seeing worse performance than I would expect; this step has a
    major impact
    > on my overall runtime.  In the real application, I am using
    MPI_Alltoall()
    > for this step, but for the purpose of a simple benchmark, I
    simplified to
    > simply do a single MPI_Send() / MPI_Recv() between two ranks of
    a 2 GB
    > buffer.
    >
    > I'm running this in AWS with instances that have 10 Gbps
    connectivity in the
    > same availability zone (according to tracepath, there are no
    hops between
    > them) and MTU set to 8801 bytes.  Doing a non-MPI benchmark of
    sending data
    > directly over TCP between these two instances, I reliably get
    around 4 Gbps.
    > Between these same two instances with MPI_Send() / MPI_Recv(), I
    reliably
    > get around 2.4 Gbps.  This seems like a major performance
    degradation for a
    > single MPI operation.
    >
    > I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings.  I'm
    > connecting between instances via ssh and using I assume TCP for
    the actual
    > network transfer (I'm not setting any special command-line or
    programmatic
    > settings).  The actual command I'm running is:
    > mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
    >
    > Any advice on other things to test or compilation and/or runtime
    flags to
    > set would be much appreciated!
    > -Adam
    >
    > _______________________________________________
    > users mailing list
    > users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
<mailto:users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
    > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
    _______________________________________________
    users mailing list
    users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
<mailto:users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>




_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Network performance over TCP

Reply via email to