Adam, FWIW, EFA adapter is available on this AWS instance, and Open MPI can use it via libfabric (aka OFI) Here is a link to Brian’s video https://insidehpc.com/2018/04/amazon-libfabric-case-study-flexible-hpc-infrastructure/
Cheers, Gilles On Sunday, March 24, 2019, Adam Sylvester <op8...@gmail.com> wrote: > Digging up this old thread as it appears there's still an issue with > btl_tcp_links. > > I'm now using c5.18xlarge instances in AWS which have 25 Gbps > connectivity; using iperf3 with the -P option to drive multiple ports, I > achieve over 24 Gbps when communicating between two instances. > > When I originally asked this question, Gilles suggested I could do the > equivalent with OpenMPI via the --mca btl_tcp_links flag but then Brian > reported that this flag doesn't work in the 2.x and 3.x series. I just > updated to OpenMPI 4.0.0, hoping that this was fixed, according to the FAQ > at https://www.open-mpi.org/faq/?category=tcp#tcp-multi-links, it should > be working. However, I see no difference in performance; on a simple > benchmark which passes 10 GB between two ranks (one rank per host) via > MPI_Send() and MPI_Recv(), I see around 9 Gb / s with or without this flag. > > In particular, I am running with: > mpirun --mca btl_tcp_links 4 -N 1 --bind-to none --hostfile hosts.txt > /path/to/my/application > > Trying a btl_tcp_links value of 2 or 3 also makes no difference. Is there > another flag I need to be using or is something still broken? > > Thanks. > -Adam > > On Thu, Jul 13, 2017 at 12:05 PM Adam Sylvester <op8...@gmail.com> wrote: > >> Bummer - thanks for the info Brian. >> >> As an FYI, I do have a real world use case for this faster connectivity >> (i.e. beyond just a benchmark). While my application will happily gobble >> up and run on however many machines it's given, there's a resource manager >> that lives on top of everything that doles out machines to applications. >> So there will be cases where my application will only get two machines to >> run and so I'd still like the big data transfers to happen as quickly as >> possible. I agree that when there are many ranks all talking to each >> other, I should hopefully get closer to the full 20 Gbps. >> >> I appreciate that you have a number of other higher priorities, but >> wanted to make you aware that I do have a use case for it... look forward >> to using it when it's in place. :o) >> >> On Wed, Jul 12, 2017 at 2:18 PM, Barrett, Brian via users < >> users@lists.open-mpi.org> wrote: >> >>> Adam - >>> >>> The btl_tcp_links flag does not currently work (for various reasons) in >>> the 2.x and 3.x series. It’s on my todo list to fix, but I’m not sure it >>> will get done before the 3.0.0 release. Part of the reason that it hasn’t >>> been a priority is that most applications (outside of benchmarks) don’t >>> benefit from the 20 Gbps between rank pairs, as they are generally talking >>> to multiple peers at once (and therefore can drive the full 20 Gbps). It’s >>> definitely on our roadmap, but can’t promise a release just yet. >>> >>> Brian >>> >>> On Jul 12, 2017, at 11:44 AM, Adam Sylvester <op8...@gmail.com> wrote: >>> >>> I switched over to X1 instances in AWS which have 20 Gbps connectivity. >>> Using iperf3, I'm seeing 11.1 Gbps between them with just one port. iperf3 >>> supports a -P option which will connect using multiple ports... Setting >>> this to use in the range of 5-20 ports (there's some variability from run >>> to run), I can get in the range of 18 Gbps aggregate which for a real world >>> speed seems pretty good. >>> >>> Using mpirun with the previously-suggested btl_tcp_sndbuf and >>> btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps. So, pretty close to >>> iperf with just one port (makes sense there'd be some overhead with MPI). >>> My understanding of the btl_tcp_links flag that Gilles mentioned is that it >>> should be analogous to iperf's -P flag - it should connect with multiple >>> ports in the hopes of improving the aggregate bandwidth. >>> >>> If that's what this flag is supposed to do, it does not appear to be >>> working properly for me. With lsof, I can see the expected number of ports >>> show up when I run iperf. However, with MPI I only ever see three >>> connections between the two machines - sshd, orted, and my actual >>> application. No matter what I set btl_tcp_links to, I don't see any >>> additional ports show up (or any change in performance). >>> >>> Am I misunderstanding what this flag does or is there a bug here? If I >>> am misunderstanding the flag's intent, is there a different flag that would >>> allow Open MPI to use multiple ports similar to what iperf is doing? >>> >>> Thanks. >>> -Adam >>> >>> On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester <op8...@gmail.com> >>> wrote: >>> >>>> Thanks again Gilles. Ahh, better yet - I wasn't familiar with the >>>> config file way to set these parameters... it'll be easy to bake this into >>>> my AMI so that I don't have to set them each time while waiting for the >>>> next Open MPI release. >>>> >>>> Out of mostly laziness I try to keep to the formal releases rather than >>>> applying patches myself, but thanks for the link to it (the commit comments >>>> were useful to understand why this improved performance). >>>> >>>> -Adam >>>> >>>> On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet < >>>> gil...@rist.or.jp> wrote: >>>> >>>>> Adam, >>>>> >>>>> >>>>> Thanks for letting us know your performance issue has been resolved. >>>>> >>>>> >>>>> yes, https://www.open-mpi.org/faq/?category=tcp is the best place to >>>>> look for this kind of information. >>>>> >>>>> i will add a reference to these parameters. i will also ask folks at >>>>> AWS if they have additional/other recommendations. >>>>> >>>>> >>>>> note you have a few options before 2.1.2 (or 3.0.0) is released : >>>>> >>>>> >>>>> - update your system wide config file (/.../etc/openmpi-mca-params.conf) >>>>> or user config file >>>>> >>>>> ($HOME/.openmpi/mca-params.conf) and add the following lines >>>>> >>>>> btl_tcp_sndbuf = 0 >>>>> >>>>> btl_tcp_rcvbuf = 0 >>>>> >>>>> >>>>> - add the following environment variable to your environment >>>>> >>>>> export OMPI_MCA_btl_tcp_sndbuf=0 >>>>> >>>>> export OMPI_MCA_btl_tcp_rcvbuf=0 >>>>> >>>>> >>>>> - use Open MPI 2.0.3 >>>>> >>>>> >>>>> - last but not least, you can manually download and apply the patch >>>>> available at >>>>> >>>>> https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc7c4693f9c1 >>>>> ef01dfb69f.patch >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> On 7/9/2017 11:04 PM, Adam Sylvester wrote: >>>>> >>>>>> Gilles, >>>>>> >>>>>> Thanks for the fast response! >>>>>> >>>>>> The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you >>>>>> recommended made a huge difference - this got me up to 5.7 Gb/s! I wasn't >>>>>> aware of these flags... with a little Googling, is >>>>>> https://www.open-mpi.org/faq/?category=tcp the best place to look >>>>>> for this kind of information and any other tweaks I may want to try (or >>>>>> if >>>>>> there's a better FAQ out there, please let me know)? >>>>>> There is only eth0 on my machines so nothing to tweak there (though >>>>>> good to know for the future). I also didn't see any improvement by >>>>>> specifying more sockets per instance. But, your initial suggestion had a >>>>>> major impact. >>>>>> In general I try to stay relatively up to date with my Open MPI >>>>>> version; I'll be extra motivated to upgrade to 2.1.2 so that I don't have >>>>>> to remember to set these --mca flags on the command line. :o) >>>>>> -Adam >>>>>> >>>>>> On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet < >>>>>> gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> >>>>>> wrote: >>>>>> >>>>>> Adam, >>>>>> >>>>>> at first, you need to change the default send and receive socket >>>>>> buffers : >>>>>> mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ... >>>>>> /* note this will be the default from Open MPI 2.1.2 */ >>>>>> >>>>>> hopefully, that will be enough to greatly improve the bandwidth >>>>>> for >>>>>> large messages. >>>>>> >>>>>> >>>>>> generally speaking, i recommend you use the latest (e.g. Open MPI >>>>>> 2.1.1) available version >>>>>> >>>>>> how many interfaces can be used to communicate between hosts ? >>>>>> if there is more than one (for example a slow and a fast one), >>>>>> you'd >>>>>> rather only use the fast one. >>>>>> for example, if eth0 is the fast interface, that can be achieved >>>>>> with >>>>>> mpirun --mca btl_tcp_if_include eth0 ... >>>>>> >>>>>> also, you might be able to achieve better results by using more >>>>>> than >>>>>> one socket on the fast interface. >>>>>> for example, if you want to use 4 sockets per interface >>>>>> mpirun --mca btl_tcp_links 4 ... >>>>>> >>>>>> >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> On Sun, Jul 9, 2017 at 10:10 PM, Adam Sylvester <op8...@gmail.com >>>>>> <mailto:op8...@gmail.com>> wrote: >>>>>> > I am using Open MPI 2.1.0 on RHEL 7. My application has one >>>>>> unavoidable >>>>>> > pinch point where a large amount of data needs to be transferred >>>>>> (about 8 GB >>>>>> > of data needs to be both sent to and received all other ranks), >>>>>> and I'm >>>>>> > seeing worse performance than I would expect; this step has a >>>>>> major impact >>>>>> > on my overall runtime. In the real application, I am using >>>>>> MPI_Alltoall() >>>>>> > for this step, but for the purpose of a simple benchmark, I >>>>>> simplified to >>>>>> > simply do a single MPI_Send() / MPI_Recv() between two ranks of >>>>>> a 2 GB >>>>>> > buffer. >>>>>> > >>>>>> > I'm running this in AWS with instances that have 10 Gbps >>>>>> connectivity in the >>>>>> > same availability zone (according to tracepath, there are no >>>>>> hops between >>>>>> > them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of >>>>>> sending data >>>>>> > directly over TCP between these two instances, I reliably get >>>>>> around 4 Gbps. >>>>>> > Between these same two instances with MPI_Send() / MPI_Recv(), I >>>>>> reliably >>>>>> > get around 2.4 Gbps. This seems like a major performance >>>>>> degradation for a >>>>>> > single MPI operation. >>>>>> > >>>>>> > I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings. >>>>>> I'm >>>>>> > connecting between instances via ssh and using I assume TCP for >>>>>> the actual >>>>>> > network transfer (I'm not setting any special command-line or >>>>>> programmatic >>>>>> > settings). The actual command I'm running is: >>>>>> > mpirun -N 1 --bind-to none --hostfile hosts.txt my_app >>>>>> > >>>>>> > Any advice on other things to test or compilation and/or runtime >>>>>> flags to >>>>>> > set would be much appreciated! >>>>>> > -Adam >>>>>> > >>>>>> > _______________________________________________ >>>>>> > users mailing list >>>>>> > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> >>>> >>>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >> >>
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users