Re: [OMPI users] Network performance over TCP

Gilles Gouaillardet Sat, 23 Mar 2019 17:58:57 -0700

Adam,

FWIW, EFA adapter is available on this AWS instance, and Open MPI can use
it via libfabric (aka OFI)
Here is a link to Brian’s video
https://insidehpc.com/2018/04/amazon-libfabric-case-study-flexible-hpc-infrastructure/


Cheers,

Gilles

On Sunday, March 24, 2019, Adam Sylvester <op8...@gmail.com> wrote:

> Digging up this old thread as it appears there's still an issue with
> btl_tcp_links.
>
> I'm now using c5.18xlarge instances in AWS which have 25 Gbps
> connectivity; using iperf3 with the -P option to drive multiple ports, I
> achieve over 24 Gbps when communicating between two instances.
>
> When I originally asked this question, Gilles suggested I could do the
> equivalent with OpenMPI via the --mca btl_tcp_links flag but then Brian
> reported that this flag doesn't work in the 2.x and 3.x series.  I just
> updated to OpenMPI 4.0.0, hoping that this was fixed, according to the FAQ
> at https://www.open-mpi.org/faq/?category=tcp#tcp-multi-links, it should
> be working.  However, I see no difference in performance; on a simple
> benchmark which passes 10 GB between two ranks (one rank per host) via
> MPI_Send() and MPI_Recv(), I see around 9 Gb / s with or without this flag.
>
> In particular, I am running with:
> mpirun --mca btl_tcp_links 4 -N 1 --bind-to none --hostfile hosts.txt
> /path/to/my/application
>
> Trying a btl_tcp_links value of 2 or 3 also makes no difference.  Is there
> another flag I need to be using or is something still broken?
>
> Thanks.
> -Adam
>
> On Thu, Jul 13, 2017 at 12:05 PM Adam Sylvester <op8...@gmail.com> wrote:
>
>> Bummer - thanks for the info Brian.
>>
>> As an FYI, I do have a real world use case for this faster connectivity
>> (i.e. beyond just a benchmark).  While my application will happily gobble
>> up and run on however many machines it's given, there's a resource manager
>> that lives on top of everything that doles out machines to applications.
>> So there will be cases where my application will only get two machines to
>> run and so I'd still like the big data transfers to happen as quickly as
>> possible.  I agree that when there are many ranks all talking to each
>> other, I should hopefully get closer to the full 20 Gbps.
>>
>> I appreciate that you have a number of other higher priorities, but
>> wanted to make you aware that I do have a use case for it... look forward
>> to using it when it's in place. :o)
>>
>> On Wed, Jul 12, 2017 at 2:18 PM, Barrett, Brian via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> Adam -
>>>
>>> The btl_tcp_links flag does not currently work (for various reasons) in
>>> the 2.x and 3.x series.  It’s on my todo list to fix, but I’m not sure it
>>> will get done before the 3.0.0 release.  Part of the reason that it hasn’t
>>> been a priority is that most applications (outside of benchmarks) don’t
>>> benefit from the 20 Gbps between rank pairs, as they are generally talking
>>> to multiple peers at once (and therefore can drive the full 20 Gbps).  It’s
>>> definitely on our roadmap, but can’t promise a release just yet.
>>>
>>> Brian
>>>
>>> On Jul 12, 2017, at 11:44 AM, Adam Sylvester <op8...@gmail.com> wrote:
>>>
>>> I switched over to X1 instances in AWS which have 20 Gbps connectivity.
>>> Using iperf3, I'm seeing 11.1 Gbps between them with just one port.  iperf3
>>> supports a -P option which will connect using multiple ports...  Setting
>>> this to use in the range of 5-20 ports (there's some variability from run
>>> to run), I can get in the range of 18 Gbps aggregate which for a real world
>>> speed seems pretty good.
>>>
>>> Using mpirun with the previously-suggested btl_tcp_sndbuf and
>>> btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps.  So, pretty close to
>>> iperf with just one port (makes sense there'd be some overhead with MPI).
>>> My understanding of the btl_tcp_links flag that Gilles mentioned is that it
>>> should be analogous to iperf's -P flag - it should connect with multiple
>>> ports in the hopes of improving the aggregate bandwidth.
>>>
>>> If that's what this flag is supposed to do, it does not appear to be
>>> working properly for me.  With lsof, I can see the expected number of ports
>>> show up when I run iperf.  However, with MPI I only ever see three
>>> connections between the two machines - sshd, orted, and my actual
>>> application.  No matter what I set btl_tcp_links to, I don't see any
>>> additional ports show up (or any change in performance).
>>>
>>> Am I misunderstanding what this flag does or is there a bug here?  If I
>>> am misunderstanding the flag's intent, is there a different flag that would
>>> allow Open MPI to use multiple ports similar to what iperf is doing?
>>>
>>> Thanks.
>>> -Adam
>>>
>>> On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester <op8...@gmail.com>
>>> wrote:
>>>
>>>> Thanks again Gilles.  Ahh, better yet - I wasn't familiar with the
>>>> config file way to set these parameters... it'll be easy to bake this into
>>>> my AMI so that I don't have to set them each time while waiting for the
>>>> next Open MPI release.
>>>>
>>>> Out of mostly laziness I try to keep to the formal releases rather than
>>>> applying patches myself, but thanks for the link to it (the commit comments
>>>> were useful to understand why this improved performance).
>>>>
>>>> -Adam
>>>>
>>>> On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet <
>>>> gil...@rist.or.jp> wrote:
>>>>
>>>>> Adam,
>>>>>
>>>>>
>>>>> Thanks for letting us know your performance issue has been resolved.
>>>>>
>>>>>
>>>>> yes, https://www.open-mpi.org/faq/?category=tcp is the best place to
>>>>> look for this kind of information.
>>>>>
>>>>> i will add a reference to these parameters. i will also ask folks at
>>>>> AWS if they have additional/other recommendations.
>>>>>
>>>>>
>>>>> note you have a few options before 2.1.2 (or 3.0.0) is released :
>>>>>
>>>>>
>>>>> - update your system wide config file (/.../etc/openmpi-mca-params.conf)
>>>>> or user config file
>>>>>
>>>>>   ($HOME/.openmpi/mca-params.conf) and add the following lines
>>>>>
>>>>> btl_tcp_sndbuf = 0
>>>>>
>>>>> btl_tcp_rcvbuf = 0
>>>>>
>>>>>
>>>>> - add the following environment variable to your environment
>>>>>
>>>>> export OMPI_MCA_btl_tcp_sndbuf=0
>>>>>
>>>>> export OMPI_MCA_btl_tcp_rcvbuf=0
>>>>>
>>>>>
>>>>> - use Open MPI 2.0.3
>>>>>
>>>>>
>>>>> - last but not least, you can manually download and apply the patch
>>>>> available at
>>>>>
>>>>> https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc7c4693f9c1
>>>>> ef01dfb69f.patch
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>>
>>>>> On 7/9/2017 11:04 PM, Adam Sylvester wrote:
>>>>>
>>>>>> Gilles,
>>>>>>
>>>>>> Thanks for the fast response!
>>>>>>
>>>>>> The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you
>>>>>> recommended made a huge difference - this got me up to 5.7 Gb/s! I wasn't
>>>>>> aware of these flags... with a little Googling, is
>>>>>> https://www.open-mpi.org/faq/?category=tcp the best place to look
>>>>>> for this kind of information and any other tweaks I may want to try (or 
>>>>>> if
>>>>>> there's a better FAQ out there, please let me know)?
>>>>>> There is only eth0 on my machines so nothing to tweak there (though
>>>>>> good to know for the future). I also didn't see any improvement by
>>>>>> specifying more sockets per instance. But, your initial suggestion had a
>>>>>> major impact.
>>>>>> In general I try to stay relatively up to date with my Open MPI
>>>>>> version; I'll be extra motivated to upgrade to 2.1.2 so that I don't have
>>>>>> to remember to set these --mca flags on the command line. :o)
>>>>>> -Adam
>>>>>>
>>>>>> On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet <
>>>>>> gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>>
>>>>>> wrote:
>>>>>>
>>>>>>     Adam,
>>>>>>
>>>>>>     at first, you need to change the default send and receive socket
>>>>>>     buffers :
>>>>>>     mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
>>>>>>     /* note this will be the default from Open MPI 2.1.2 */
>>>>>>
>>>>>>     hopefully, that will be enough to greatly improve the bandwidth
>>>>>> for
>>>>>>     large messages.
>>>>>>
>>>>>>
>>>>>>     generally speaking, i recommend you use the latest (e.g. Open MPI
>>>>>>     2.1.1) available version
>>>>>>
>>>>>>     how many interfaces can be used to communicate between hosts ?
>>>>>>     if there is more than one (for example a slow and a fast one),
>>>>>> you'd
>>>>>>     rather only use the fast one.
>>>>>>     for example, if eth0 is the fast interface, that can be achieved
>>>>>> with
>>>>>>     mpirun --mca btl_tcp_if_include eth0 ...
>>>>>>
>>>>>>     also, you might be able to achieve better results by using more
>>>>>> than
>>>>>>     one socket on the fast interface.
>>>>>>     for example, if you want to use 4 sockets per interface
>>>>>>     mpirun --mca btl_tcp_links 4 ...
>>>>>>
>>>>>>
>>>>>>
>>>>>>     Cheers,
>>>>>>
>>>>>>     Gilles
>>>>>>
>>>>>>     On Sun, Jul 9, 2017 at 10:10 PM, Adam Sylvester <op8...@gmail.com
>>>>>>     <mailto:op8...@gmail.com>> wrote:
>>>>>>     > I am using Open MPI 2.1.0 on RHEL 7.  My application has one
>>>>>>     unavoidable
>>>>>>     > pinch point where a large amount of data needs to be transferred
>>>>>>     (about 8 GB
>>>>>>     > of data needs to be both sent to and received all other ranks),
>>>>>>     and I'm
>>>>>>     > seeing worse performance than I would expect; this step has a
>>>>>>     major impact
>>>>>>     > on my overall runtime.  In the real application, I am using
>>>>>>     MPI_Alltoall()
>>>>>>     > for this step, but for the purpose of a simple benchmark, I
>>>>>>     simplified to
>>>>>>     > simply do a single MPI_Send() / MPI_Recv() between two ranks of
>>>>>>     a 2 GB
>>>>>>     > buffer.
>>>>>>     >
>>>>>>     > I'm running this in AWS with instances that have 10 Gbps
>>>>>>     connectivity in the
>>>>>>     > same availability zone (according to tracepath, there are no
>>>>>>     hops between
>>>>>>     > them) and MTU set to 8801 bytes.  Doing a non-MPI benchmark of
>>>>>>     sending data
>>>>>>     > directly over TCP between these two instances, I reliably get
>>>>>>     around 4 Gbps.
>>>>>>     > Between these same two instances with MPI_Send() / MPI_Recv(), I
>>>>>>     reliably
>>>>>>     > get around 2.4 Gbps.  This seems like a major performance
>>>>>>     degradation for a
>>>>>>     > single MPI operation.
>>>>>>     >
>>>>>>     > I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings.
>>>>>> I'm
>>>>>>     > connecting between instances via ssh and using I assume TCP for
>>>>>>     the actual
>>>>>>     > network transfer (I'm not setting any special command-line or
>>>>>>     programmatic
>>>>>>     > settings).  The actual command I'm running is:
>>>>>>     > mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
>>>>>>     >
>>>>>>     > Any advice on other things to test or compilation and/or runtime
>>>>>>     flags to
>>>>>>     > set would be much appreciated!
>>>>>>     > -Adam
>>>>>>     >
>>>>>>     > _______________________________________________
>>>>>>     > users mailing list
>>>>>>     > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>     > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>     _______________________________________________
>>>>>>     users mailing list
>>>>>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>
>>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Network performance over TCP

Reply via email to