Re: [OMPI users] Network performance over TCP

Adam Sylvester Sat, 23 Mar 2019 11:02:20 -0700

Digging up this old thread as it appears there's still an issue with
btl_tcp_links.


I'm now using c5.18xlarge instances in AWS which have 25 Gbps connectivity;
using iperf3 with the -P option to drive multiple ports, I achieve over 24
Gbps when communicating between two instances.

When I originally asked this question, Gilles suggested I could do the
equivalent with OpenMPI via the --mca btl_tcp_links flag but then Brian
reported that this flag doesn't work in the 2.x and 3.x series.  I just
updated to OpenMPI 4.0.0, hoping that this was fixed, according to the FAQ
at https://www.open-mpi.org/faq/?category=tcp#tcp-multi-links, it should be
working.  However, I see no difference in performance; on a simple
benchmark which passes 10 GB between two ranks (one rank per host) via
MPI_Send() and MPI_Recv(), I see around 9 Gb / s with or without this flag.

In particular, I am running with:
mpirun --mca btl_tcp_links 4 -N 1 --bind-to none --hostfile hosts.txt
/path/to/my/application

Trying a btl_tcp_links value of 2 or 3 also makes no difference.  Is there
another flag I need to be using or is something still broken?

Thanks.
-Adam

On Thu, Jul 13, 2017 at 12:05 PM Adam Sylvester <op8...@gmail.com> wrote:

> Bummer - thanks for the info Brian.
>
> As an FYI, I do have a real world use case for this faster connectivity
> (i.e. beyond just a benchmark).  While my application will happily gobble
> up and run on however many machines it's given, there's a resource manager
> that lives on top of everything that doles out machines to applications.
> So there will be cases where my application will only get two machines to
> run and so I'd still like the big data transfers to happen as quickly as
> possible.  I agree that when there are many ranks all talking to each
> other, I should hopefully get closer to the full 20 Gbps.
>
> I appreciate that you have a number of other higher priorities, but wanted
> to make you aware that I do have a use case for it... look forward to using
> it when it's in place. :o)
>
> On Wed, Jul 12, 2017 at 2:18 PM, Barrett, Brian via users <
> users@lists.open-mpi.org> wrote:
>
>> Adam -
>>
>> The btl_tcp_links flag does not currently work (for various reasons) in
>> the 2.x and 3.x series.  It’s on my todo list to fix, but I’m not sure it
>> will get done before the 3.0.0 release.  Part of the reason that it hasn’t
>> been a priority is that most applications (outside of benchmarks) don’t
>> benefit from the 20 Gbps between rank pairs, as they are generally talking
>> to multiple peers at once (and therefore can drive the full 20 Gbps).  It’s
>> definitely on our roadmap, but can’t promise a release just yet.
>>
>> Brian
>>
>> On Jul 12, 2017, at 11:44 AM, Adam Sylvester <op8...@gmail.com> wrote:
>>
>> I switched over to X1 instances in AWS which have 20 Gbps connectivity.
>> Using iperf3, I'm seeing 11.1 Gbps between them with just one port.  iperf3
>> supports a -P option which will connect using multiple ports...  Setting
>> this to use in the range of 5-20 ports (there's some variability from run
>> to run), I can get in the range of 18 Gbps aggregate which for a real world
>> speed seems pretty good.
>>
>> Using mpirun with the previously-suggested btl_tcp_sndbuf and
>> btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps.  So, pretty close to
>> iperf with just one port (makes sense there'd be some overhead with MPI).
>> My understanding of the btl_tcp_links flag that Gilles mentioned is that it
>> should be analogous to iperf's -P flag - it should connect with multiple
>> ports in the hopes of improving the aggregate bandwidth.
>>
>> If that's what this flag is supposed to do, it does not appear to be
>> working properly for me.  With lsof, I can see the expected number of ports
>> show up when I run iperf.  However, with MPI I only ever see three
>> connections between the two machines - sshd, orted, and my actual
>> application.  No matter what I set btl_tcp_links to, I don't see any
>> additional ports show up (or any change in performance).
>>
>> Am I misunderstanding what this flag does or is there a bug here?  If I
>> am misunderstanding the flag's intent, is there a different flag that would
>> allow Open MPI to use multiple ports similar to what iperf is doing?
>>
>> Thanks.
>> -Adam
>>
>> On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester <op8...@gmail.com> wrote:
>>
>>> Thanks again Gilles.  Ahh, better yet - I wasn't familiar with the
>>> config file way to set these parameters... it'll be easy to bake this into
>>> my AMI so that I don't have to set them each time while waiting for the
>>> next Open MPI release.
>>>
>>> Out of mostly laziness I try to keep to the formal releases rather than
>>> applying patches myself, but thanks for the link to it (the commit comments
>>> were useful to understand why this improved performance).
>>>
>>> -Adam
>>>
>>> On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet <gil...@rist.or.jp
>>> > wrote:
>>>
>>>> Adam,
>>>>
>>>>
>>>> Thanks for letting us know your performance issue has been resolved.
>>>>
>>>>
>>>> yes, https://www.open-mpi.org/faq/?category=tcp is the best place to
>>>> look for this kind of information.
>>>>
>>>> i will add a reference to these parameters. i will also ask folks at
>>>> AWS if they have additional/other recommendations.
>>>>
>>>>
>>>> note you have a few options before 2.1.2 (or 3.0.0) is released :
>>>>
>>>>
>>>> - update your system wide config file
>>>> (/.../etc/openmpi-mca-params.conf) or user config file
>>>>
>>>>   ($HOME/.openmpi/mca-params.conf) and add the following lines
>>>>
>>>> btl_tcp_sndbuf = 0
>>>>
>>>> btl_tcp_rcvbuf = 0
>>>>
>>>>
>>>> - add the following environment variable to your environment
>>>>
>>>> export OMPI_MCA_btl_tcp_sndbuf=0
>>>>
>>>> export OMPI_MCA_btl_tcp_rcvbuf=0
>>>>
>>>>
>>>> - use Open MPI 2.0.3
>>>>
>>>>
>>>> - last but not least, you can manually download and apply the patch
>>>> available at
>>>>
>>>>
>>>> https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc7c4693f9c1ef01dfb69f.patch
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On 7/9/2017 11:04 PM, Adam Sylvester wrote:
>>>>
>>>>> Gilles,
>>>>>
>>>>> Thanks for the fast response!
>>>>>
>>>>> The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you
>>>>> recommended made a huge difference - this got me up to 5.7 Gb/s! I wasn't
>>>>> aware of these flags... with a little Googling, is
>>>>> https://www.open-mpi.org/faq/?category=tcp the best place to look for
>>>>> this kind of information and any other tweaks I may want to try (or if
>>>>> there's a better FAQ out there, please let me know)?
>>>>> There is only eth0 on my machines so nothing to tweak there (though
>>>>> good to know for the future). I also didn't see any improvement by
>>>>> specifying more sockets per instance. But, your initial suggestion had a
>>>>> major impact.
>>>>> In general I try to stay relatively up to date with my Open MPI
>>>>> version; I'll be extra motivated to upgrade to 2.1.2 so that I don't have
>>>>> to remember to set these --mca flags on the command line. :o)
>>>>> -Adam
>>>>>
>>>>> On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet <
>>>>> gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>>
>>>>> wrote:
>>>>>
>>>>>     Adam,
>>>>>
>>>>>     at first, you need to change the default send and receive socket
>>>>>     buffers :
>>>>>     mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
>>>>>     /* note this will be the default from Open MPI 2.1.2 */
>>>>>
>>>>>     hopefully, that will be enough to greatly improve the bandwidth for
>>>>>     large messages.
>>>>>
>>>>>
>>>>>     generally speaking, i recommend you use the latest (e.g. Open MPI
>>>>>     2.1.1) available version
>>>>>
>>>>>     how many interfaces can be used to communicate between hosts ?
>>>>>     if there is more than one (for example a slow and a fast one),
>>>>> you'd
>>>>>     rather only use the fast one.
>>>>>     for example, if eth0 is the fast interface, that can be achieved
>>>>> with
>>>>>     mpirun --mca btl_tcp_if_include eth0 ...
>>>>>
>>>>>     also, you might be able to achieve better results by using more
>>>>> than
>>>>>     one socket on the fast interface.
>>>>>     for example, if you want to use 4 sockets per interface
>>>>>     mpirun --mca btl_tcp_links 4 ...
>>>>>
>>>>>
>>>>>
>>>>>     Cheers,
>>>>>
>>>>>     Gilles
>>>>>
>>>>>     On Sun, Jul 9, 2017 at 10:10 PM, Adam Sylvester <op8...@gmail.com
>>>>>     <mailto:op8...@gmail.com>> wrote:
>>>>>     > I am using Open MPI 2.1.0 on RHEL 7.  My application has one
>>>>>     unavoidable
>>>>>     > pinch point where a large amount of data needs to be transferred
>>>>>     (about 8 GB
>>>>>     > of data needs to be both sent to and received all other ranks),
>>>>>     and I'm
>>>>>     > seeing worse performance than I would expect; this step has a
>>>>>     major impact
>>>>>     > on my overall runtime.  In the real application, I am using
>>>>>     MPI_Alltoall()
>>>>>     > for this step, but for the purpose of a simple benchmark, I
>>>>>     simplified to
>>>>>     > simply do a single MPI_Send() / MPI_Recv() between two ranks of
>>>>>     a 2 GB
>>>>>     > buffer.
>>>>>     >
>>>>>     > I'm running this in AWS with instances that have 10 Gbps
>>>>>     connectivity in the
>>>>>     > same availability zone (according to tracepath, there are no
>>>>>     hops between
>>>>>     > them) and MTU set to 8801 bytes.  Doing a non-MPI benchmark of
>>>>>     sending data
>>>>>     > directly over TCP between these two instances, I reliably get
>>>>>     around 4 Gbps.
>>>>>     > Between these same two instances with MPI_Send() / MPI_Recv(), I
>>>>>     reliably
>>>>>     > get around 2.4 Gbps.  This seems like a major performance
>>>>>     degradation for a
>>>>>     > single MPI operation.
>>>>>     >
>>>>>     > I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings.
>>>>> I'm
>>>>>     > connecting between instances via ssh and using I assume TCP for
>>>>>     the actual
>>>>>     > network transfer (I'm not setting any special command-line or
>>>>>     programmatic
>>>>>     > settings).  The actual command I'm running is:
>>>>>     > mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
>>>>>     >
>>>>>     > Any advice on other things to test or compilation and/or runtime
>>>>>     flags to
>>>>>     > set would be much appreciated!
>>>>>     > -Adam
>>>>>     >
>>>>>     > _______________________________________________
>>>>>     > users mailing list
>>>>>     > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>     > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>     _______________________________________________
>>>>>     users mailing list
>>>>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>
>>>
>>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Network performance over TCP

Reply via email to