Re: [OMPI users] Network performance over TCP

Adam Sylvester Sat, 23 Mar 2019 20:05:03 -0700

Thanks Gilles.  Unfortunately, my understanding is that EFA is only
available on C5n instances, not 'regular' C5 instances (
https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-elastic-fabric-adapter/).
I will be using C5n instances in the future but not at this time, so I'm
hoping to get btl_tcp_links or equivalent to work...


Adam

On Sat, Mar 23, 2019, 8:59 PM Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Adam,
>
> FWIW, EFA adapter is available on this AWS instance, and Open MPI can use
> it via libfabric (aka OFI)
> Here is a link to Brian’s video
> https://insidehpc.com/2018/04/amazon-libfabric-case-study-flexible-hpc-infrastructure/
>
> Cheers,
>
> Gilles
>
> On Sunday, March 24, 2019, Adam Sylvester <op8...@gmail.com> wrote:
>
>> Digging up this old thread as it appears there's still an issue with
>> btl_tcp_links.
>>
>> I'm now using c5.18xlarge instances in AWS which have 25 Gbps
>> connectivity; using iperf3 with the -P option to drive multiple ports, I
>> achieve over 24 Gbps when communicating between two instances.
>>
>> When I originally asked this question, Gilles suggested I could do the
>> equivalent with OpenMPI via the --mca btl_tcp_links flag but then Brian
>> reported that this flag doesn't work in the 2.x and 3.x series.  I just
>> updated to OpenMPI 4.0.0, hoping that this was fixed, according to the FAQ
>> at https://www.open-mpi.org/faq/?category=tcp#tcp-multi-links, it should
>> be working.  However, I see no difference in performance; on a simple
>> benchmark which passes 10 GB between two ranks (one rank per host) via
>> MPI_Send() and MPI_Recv(), I see around 9 Gb / s with or without this flag.
>>
>> In particular, I am running with:
>> mpirun --mca btl_tcp_links 4 -N 1 --bind-to none --hostfile hosts.txt
>> /path/to/my/application
>>
>> Trying a btl_tcp_links value of 2 or 3 also makes no difference.  Is
>> there another flag I need to be using or is something still broken?
>>
>> Thanks.
>> -Adam
>>
>> On Thu, Jul 13, 2017 at 12:05 PM Adam Sylvester <op8...@gmail.com> wrote:
>>
>>> Bummer - thanks for the info Brian.
>>>
>>> As an FYI, I do have a real world use case for this faster connectivity
>>> (i.e. beyond just a benchmark).  While my application will happily gobble
>>> up and run on however many machines it's given, there's a resource manager
>>> that lives on top of everything that doles out machines to applications.
>>> So there will be cases where my application will only get two machines to
>>> run and so I'd still like the big data transfers to happen as quickly as
>>> possible.  I agree that when there are many ranks all talking to each
>>> other, I should hopefully get closer to the full 20 Gbps.
>>>
>>> I appreciate that you have a number of other higher priorities, but
>>> wanted to make you aware that I do have a use case for it... look forward
>>> to using it when it's in place. :o)
>>>
>>> On Wed, Jul 12, 2017 at 2:18 PM, Barrett, Brian via users <
>>> users@lists.open-mpi.org> wrote:
>>>
>>>> Adam -
>>>>
>>>> The btl_tcp_links flag does not currently work (for various reasons) in
>>>> the 2.x and 3.x series.  It’s on my todo list to fix, but I’m not sure it
>>>> will get done before the 3.0.0 release.  Part of the reason that it hasn’t
>>>> been a priority is that most applications (outside of benchmarks) don’t
>>>> benefit from the 20 Gbps between rank pairs, as they are generally talking
>>>> to multiple peers at once (and therefore can drive the full 20 Gbps).  It’s
>>>> definitely on our roadmap, but can’t promise a release just yet.
>>>>
>>>> Brian
>>>>
>>>> On Jul 12, 2017, at 11:44 AM, Adam Sylvester <op8...@gmail.com> wrote:
>>>>
>>>> I switched over to X1 instances in AWS which have 20 Gbps
>>>> connectivity.  Using iperf3, I'm seeing 11.1 Gbps between them with just
>>>> one port.  iperf3 supports a -P option which will connect using multiple
>>>> ports...  Setting this to use in the range of 5-20 ports (there's some
>>>> variability from run to run), I can get in the range of 18 Gbps aggregate
>>>> which for a real world speed seems pretty good.
>>>>
>>>> Using mpirun with the previously-suggested btl_tcp_sndbuf and
>>>> btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps.  So, pretty close to
>>>> iperf with just one port (makes sense there'd be some overhead with MPI).
>>>> My understanding of the btl_tcp_links flag that Gilles mentioned is that it
>>>> should be analogous to iperf's -P flag - it should connect with multiple
>>>> ports in the hopes of improving the aggregate bandwidth.
>>>>
>>>> If that's what this flag is supposed to do, it does not appear to be
>>>> working properly for me.  With lsof, I can see the expected number of ports
>>>> show up when I run iperf.  However, with MPI I only ever see three
>>>> connections between the two machines - sshd, orted, and my actual
>>>> application.  No matter what I set btl_tcp_links to, I don't see any
>>>> additional ports show up (or any change in performance).
>>>>
>>>> Am I misunderstanding what this flag does or is there a bug here?  If I
>>>> am misunderstanding the flag's intent, is there a different flag that would
>>>> allow Open MPI to use multiple ports similar to what iperf is doing?
>>>>
>>>> Thanks.
>>>> -Adam
>>>>
>>>> On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester <op8...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks again Gilles.  Ahh, better yet - I wasn't familiar with the
>>>>> config file way to set these parameters... it'll be easy to bake this into
>>>>> my AMI so that I don't have to set them each time while waiting for the
>>>>> next Open MPI release.
>>>>>
>>>>> Out of mostly laziness I try to keep to the formal releases rather
>>>>> than applying patches myself, but thanks for the link to it (the commit
>>>>> comments were useful to understand why this improved performance).
>>>>>
>>>>> -Adam
>>>>>
>>>>> On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet <
>>>>> gil...@rist.or.jp> wrote:
>>>>>
>>>>>> Adam,
>>>>>>
>>>>>>
>>>>>> Thanks for letting us know your performance issue has been resolved.
>>>>>>
>>>>>>
>>>>>> yes, https://www.open-mpi.org/faq/?category=tcp is the best place to
>>>>>> look for this kind of information.
>>>>>>
>>>>>> i will add a reference to these parameters. i will also ask folks at
>>>>>> AWS if they have additional/other recommendations.
>>>>>>
>>>>>>
>>>>>> note you have a few options before 2.1.2 (or 3.0.0) is released :
>>>>>>
>>>>>>
>>>>>> - update your system wide config file
>>>>>> (/.../etc/openmpi-mca-params.conf) or user config file
>>>>>>
>>>>>>   ($HOME/.openmpi/mca-params.conf) and add the following lines
>>>>>>
>>>>>> btl_tcp_sndbuf = 0
>>>>>>
>>>>>> btl_tcp_rcvbuf = 0
>>>>>>
>>>>>>
>>>>>> - add the following environment variable to your environment
>>>>>>
>>>>>> export OMPI_MCA_btl_tcp_sndbuf=0
>>>>>>
>>>>>> export OMPI_MCA_btl_tcp_rcvbuf=0
>>>>>>
>>>>>>
>>>>>> - use Open MPI 2.0.3
>>>>>>
>>>>>>
>>>>>> - last but not least, you can manually download and apply the patch
>>>>>> available at
>>>>>>
>>>>>>
>>>>>> https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc7c4693f9c1ef01dfb69f.patch
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> On 7/9/2017 11:04 PM, Adam Sylvester wrote:
>>>>>>
>>>>>>> Gilles,
>>>>>>>
>>>>>>> Thanks for the fast response!
>>>>>>>
>>>>>>> The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you
>>>>>>> recommended made a huge difference - this got me up to 5.7 Gb/s! I 
>>>>>>> wasn't
>>>>>>> aware of these flags... with a little Googling, is
>>>>>>> https://www.open-mpi.org/faq/?category=tcp the best place to look
>>>>>>> for this kind of information and any other tweaks I may want to try (or 
>>>>>>> if
>>>>>>> there's a better FAQ out there, please let me know)?
>>>>>>> There is only eth0 on my machines so nothing to tweak there (though
>>>>>>> good to know for the future). I also didn't see any improvement by
>>>>>>> specifying more sockets per instance. But, your initial suggestion had a
>>>>>>> major impact.
>>>>>>> In general I try to stay relatively up to date with my Open MPI
>>>>>>> version; I'll be extra motivated to upgrade to 2.1.2 so that I don't 
>>>>>>> have
>>>>>>> to remember to set these --mca flags on the command line. :o)
>>>>>>> -Adam
>>>>>>>
>>>>>>> On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet <
>>>>>>> gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>     Adam,
>>>>>>>
>>>>>>>     at first, you need to change the default send and receive socket
>>>>>>>     buffers :
>>>>>>>     mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
>>>>>>>     /* note this will be the default from Open MPI 2.1.2 */
>>>>>>>
>>>>>>>     hopefully, that will be enough to greatly improve the bandwidth
>>>>>>> for
>>>>>>>     large messages.
>>>>>>>
>>>>>>>
>>>>>>>     generally speaking, i recommend you use the latest (e.g. Open MPI
>>>>>>>     2.1.1) available version
>>>>>>>
>>>>>>>     how many interfaces can be used to communicate between hosts ?
>>>>>>>     if there is more than one (for example a slow and a fast one),
>>>>>>> you'd
>>>>>>>     rather only use the fast one.
>>>>>>>     for example, if eth0 is the fast interface, that can be achieved
>>>>>>> with
>>>>>>>     mpirun --mca btl_tcp_if_include eth0 ...
>>>>>>>
>>>>>>>     also, you might be able to achieve better results by using more
>>>>>>> than
>>>>>>>     one socket on the fast interface.
>>>>>>>     for example, if you want to use 4 sockets per interface
>>>>>>>     mpirun --mca btl_tcp_links 4 ...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>     Cheers,
>>>>>>>
>>>>>>>     Gilles
>>>>>>>
>>>>>>>     On Sun, Jul 9, 2017 at 10:10 PM, Adam Sylvester <
>>>>>>> op8...@gmail.com
>>>>>>>     <mailto:op8...@gmail.com>> wrote:
>>>>>>>     > I am using Open MPI 2.1.0 on RHEL 7.  My application has one
>>>>>>>     unavoidable
>>>>>>>     > pinch point where a large amount of data needs to be
>>>>>>> transferred
>>>>>>>     (about 8 GB
>>>>>>>     > of data needs to be both sent to and received all other ranks),
>>>>>>>     and I'm
>>>>>>>     > seeing worse performance than I would expect; this step has a
>>>>>>>     major impact
>>>>>>>     > on my overall runtime.  In the real application, I am using
>>>>>>>     MPI_Alltoall()
>>>>>>>     > for this step, but for the purpose of a simple benchmark, I
>>>>>>>     simplified to
>>>>>>>     > simply do a single MPI_Send() / MPI_Recv() between two ranks of
>>>>>>>     a 2 GB
>>>>>>>     > buffer.
>>>>>>>     >
>>>>>>>     > I'm running this in AWS with instances that have 10 Gbps
>>>>>>>     connectivity in the
>>>>>>>     > same availability zone (according to tracepath, there are no
>>>>>>>     hops between
>>>>>>>     > them) and MTU set to 8801 bytes.  Doing a non-MPI benchmark of
>>>>>>>     sending data
>>>>>>>     > directly over TCP between these two instances, I reliably get
>>>>>>>     around 4 Gbps.
>>>>>>>     > Between these same two instances with MPI_Send() / MPI_Recv(),
>>>>>>> I
>>>>>>>     reliably
>>>>>>>     > get around 2.4 Gbps.  This seems like a major performance
>>>>>>>     degradation for a
>>>>>>>     > single MPI operation.
>>>>>>>     >
>>>>>>>     > I compiled Open MPI 2.1.0 with gcc 4.9.1 and default
>>>>>>> settings.  I'm
>>>>>>>     > connecting between instances via ssh and using I assume TCP for
>>>>>>>     the actual
>>>>>>>     > network transfer (I'm not setting any special command-line or
>>>>>>>     programmatic
>>>>>>>     > settings).  The actual command I'm running is:
>>>>>>>     > mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
>>>>>>>     >
>>>>>>>     > Any advice on other things to test or compilation and/or
>>>>>>> runtime
>>>>>>>     flags to
>>>>>>>     > set would be much appreciated!
>>>>>>>     > -Adam
>>>>>>>     >
>>>>>>>     > _______________________________________________
>>>>>>>     > users mailing list
>>>>>>>     > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>     > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>     _______________________________________________
>>>>>>>     users mailing list
>>>>>>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users@lists.open-mpi.org
>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>
>>>
>>> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Network performance over TCP

Reply via email to