Thanks Gilles. Unfortunately, my understanding is that EFA is only available on C5n instances, not 'regular' C5 instances ( https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-elastic-fabric-adapter/). I will be using C5n instances in the future but not at this time, so I'm hoping to get btl_tcp_links or equivalent to work...
Adam On Sat, Mar 23, 2019, 8:59 PM Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Adam, > > FWIW, EFA adapter is available on this AWS instance, and Open MPI can use > it via libfabric (aka OFI) > Here is a link to Brian’s video > https://insidehpc.com/2018/04/amazon-libfabric-case-study-flexible-hpc-infrastructure/ > > Cheers, > > Gilles > > On Sunday, March 24, 2019, Adam Sylvester <op8...@gmail.com> wrote: > >> Digging up this old thread as it appears there's still an issue with >> btl_tcp_links. >> >> I'm now using c5.18xlarge instances in AWS which have 25 Gbps >> connectivity; using iperf3 with the -P option to drive multiple ports, I >> achieve over 24 Gbps when communicating between two instances. >> >> When I originally asked this question, Gilles suggested I could do the >> equivalent with OpenMPI via the --mca btl_tcp_links flag but then Brian >> reported that this flag doesn't work in the 2.x and 3.x series. I just >> updated to OpenMPI 4.0.0, hoping that this was fixed, according to the FAQ >> at https://www.open-mpi.org/faq/?category=tcp#tcp-multi-links, it should >> be working. However, I see no difference in performance; on a simple >> benchmark which passes 10 GB between two ranks (one rank per host) via >> MPI_Send() and MPI_Recv(), I see around 9 Gb / s with or without this flag. >> >> In particular, I am running with: >> mpirun --mca btl_tcp_links 4 -N 1 --bind-to none --hostfile hosts.txt >> /path/to/my/application >> >> Trying a btl_tcp_links value of 2 or 3 also makes no difference. Is >> there another flag I need to be using or is something still broken? >> >> Thanks. >> -Adam >> >> On Thu, Jul 13, 2017 at 12:05 PM Adam Sylvester <op8...@gmail.com> wrote: >> >>> Bummer - thanks for the info Brian. >>> >>> As an FYI, I do have a real world use case for this faster connectivity >>> (i.e. beyond just a benchmark). While my application will happily gobble >>> up and run on however many machines it's given, there's a resource manager >>> that lives on top of everything that doles out machines to applications. >>> So there will be cases where my application will only get two machines to >>> run and so I'd still like the big data transfers to happen as quickly as >>> possible. I agree that when there are many ranks all talking to each >>> other, I should hopefully get closer to the full 20 Gbps. >>> >>> I appreciate that you have a number of other higher priorities, but >>> wanted to make you aware that I do have a use case for it... look forward >>> to using it when it's in place. :o) >>> >>> On Wed, Jul 12, 2017 at 2:18 PM, Barrett, Brian via users < >>> users@lists.open-mpi.org> wrote: >>> >>>> Adam - >>>> >>>> The btl_tcp_links flag does not currently work (for various reasons) in >>>> the 2.x and 3.x series. It’s on my todo list to fix, but I’m not sure it >>>> will get done before the 3.0.0 release. Part of the reason that it hasn’t >>>> been a priority is that most applications (outside of benchmarks) don’t >>>> benefit from the 20 Gbps between rank pairs, as they are generally talking >>>> to multiple peers at once (and therefore can drive the full 20 Gbps). It’s >>>> definitely on our roadmap, but can’t promise a release just yet. >>>> >>>> Brian >>>> >>>> On Jul 12, 2017, at 11:44 AM, Adam Sylvester <op8...@gmail.com> wrote: >>>> >>>> I switched over to X1 instances in AWS which have 20 Gbps >>>> connectivity. Using iperf3, I'm seeing 11.1 Gbps between them with just >>>> one port. iperf3 supports a -P option which will connect using multiple >>>> ports... Setting this to use in the range of 5-20 ports (there's some >>>> variability from run to run), I can get in the range of 18 Gbps aggregate >>>> which for a real world speed seems pretty good. >>>> >>>> Using mpirun with the previously-suggested btl_tcp_sndbuf and >>>> btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps. So, pretty close to >>>> iperf with just one port (makes sense there'd be some overhead with MPI). >>>> My understanding of the btl_tcp_links flag that Gilles mentioned is that it >>>> should be analogous to iperf's -P flag - it should connect with multiple >>>> ports in the hopes of improving the aggregate bandwidth. >>>> >>>> If that's what this flag is supposed to do, it does not appear to be >>>> working properly for me. With lsof, I can see the expected number of ports >>>> show up when I run iperf. However, with MPI I only ever see three >>>> connections between the two machines - sshd, orted, and my actual >>>> application. No matter what I set btl_tcp_links to, I don't see any >>>> additional ports show up (or any change in performance). >>>> >>>> Am I misunderstanding what this flag does or is there a bug here? If I >>>> am misunderstanding the flag's intent, is there a different flag that would >>>> allow Open MPI to use multiple ports similar to what iperf is doing? >>>> >>>> Thanks. >>>> -Adam >>>> >>>> On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester <op8...@gmail.com> >>>> wrote: >>>> >>>>> Thanks again Gilles. Ahh, better yet - I wasn't familiar with the >>>>> config file way to set these parameters... it'll be easy to bake this into >>>>> my AMI so that I don't have to set them each time while waiting for the >>>>> next Open MPI release. >>>>> >>>>> Out of mostly laziness I try to keep to the formal releases rather >>>>> than applying patches myself, but thanks for the link to it (the commit >>>>> comments were useful to understand why this improved performance). >>>>> >>>>> -Adam >>>>> >>>>> On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet < >>>>> gil...@rist.or.jp> wrote: >>>>> >>>>>> Adam, >>>>>> >>>>>> >>>>>> Thanks for letting us know your performance issue has been resolved. >>>>>> >>>>>> >>>>>> yes, https://www.open-mpi.org/faq/?category=tcp is the best place to >>>>>> look for this kind of information. >>>>>> >>>>>> i will add a reference to these parameters. i will also ask folks at >>>>>> AWS if they have additional/other recommendations. >>>>>> >>>>>> >>>>>> note you have a few options before 2.1.2 (or 3.0.0) is released : >>>>>> >>>>>> >>>>>> - update your system wide config file >>>>>> (/.../etc/openmpi-mca-params.conf) or user config file >>>>>> >>>>>> ($HOME/.openmpi/mca-params.conf) and add the following lines >>>>>> >>>>>> btl_tcp_sndbuf = 0 >>>>>> >>>>>> btl_tcp_rcvbuf = 0 >>>>>> >>>>>> >>>>>> - add the following environment variable to your environment >>>>>> >>>>>> export OMPI_MCA_btl_tcp_sndbuf=0 >>>>>> >>>>>> export OMPI_MCA_btl_tcp_rcvbuf=0 >>>>>> >>>>>> >>>>>> - use Open MPI 2.0.3 >>>>>> >>>>>> >>>>>> - last but not least, you can manually download and apply the patch >>>>>> available at >>>>>> >>>>>> >>>>>> https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc7c4693f9c1ef01dfb69f.patch >>>>>> >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> On 7/9/2017 11:04 PM, Adam Sylvester wrote: >>>>>> >>>>>>> Gilles, >>>>>>> >>>>>>> Thanks for the fast response! >>>>>>> >>>>>>> The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you >>>>>>> recommended made a huge difference - this got me up to 5.7 Gb/s! I >>>>>>> wasn't >>>>>>> aware of these flags... with a little Googling, is >>>>>>> https://www.open-mpi.org/faq/?category=tcp the best place to look >>>>>>> for this kind of information and any other tweaks I may want to try (or >>>>>>> if >>>>>>> there's a better FAQ out there, please let me know)? >>>>>>> There is only eth0 on my machines so nothing to tweak there (though >>>>>>> good to know for the future). I also didn't see any improvement by >>>>>>> specifying more sockets per instance. But, your initial suggestion had a >>>>>>> major impact. >>>>>>> In general I try to stay relatively up to date with my Open MPI >>>>>>> version; I'll be extra motivated to upgrade to 2.1.2 so that I don't >>>>>>> have >>>>>>> to remember to set these --mca flags on the command line. :o) >>>>>>> -Adam >>>>>>> >>>>>>> On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet < >>>>>>> gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> >>>>>>> wrote: >>>>>>> >>>>>>> Adam, >>>>>>> >>>>>>> at first, you need to change the default send and receive socket >>>>>>> buffers : >>>>>>> mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ... >>>>>>> /* note this will be the default from Open MPI 2.1.2 */ >>>>>>> >>>>>>> hopefully, that will be enough to greatly improve the bandwidth >>>>>>> for >>>>>>> large messages. >>>>>>> >>>>>>> >>>>>>> generally speaking, i recommend you use the latest (e.g. Open MPI >>>>>>> 2.1.1) available version >>>>>>> >>>>>>> how many interfaces can be used to communicate between hosts ? >>>>>>> if there is more than one (for example a slow and a fast one), >>>>>>> you'd >>>>>>> rather only use the fast one. >>>>>>> for example, if eth0 is the fast interface, that can be achieved >>>>>>> with >>>>>>> mpirun --mca btl_tcp_if_include eth0 ... >>>>>>> >>>>>>> also, you might be able to achieve better results by using more >>>>>>> than >>>>>>> one socket on the fast interface. >>>>>>> for example, if you want to use 4 sockets per interface >>>>>>> mpirun --mca btl_tcp_links 4 ... >>>>>>> >>>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Gilles >>>>>>> >>>>>>> On Sun, Jul 9, 2017 at 10:10 PM, Adam Sylvester < >>>>>>> op8...@gmail.com >>>>>>> <mailto:op8...@gmail.com>> wrote: >>>>>>> > I am using Open MPI 2.1.0 on RHEL 7. My application has one >>>>>>> unavoidable >>>>>>> > pinch point where a large amount of data needs to be >>>>>>> transferred >>>>>>> (about 8 GB >>>>>>> > of data needs to be both sent to and received all other ranks), >>>>>>> and I'm >>>>>>> > seeing worse performance than I would expect; this step has a >>>>>>> major impact >>>>>>> > on my overall runtime. In the real application, I am using >>>>>>> MPI_Alltoall() >>>>>>> > for this step, but for the purpose of a simple benchmark, I >>>>>>> simplified to >>>>>>> > simply do a single MPI_Send() / MPI_Recv() between two ranks of >>>>>>> a 2 GB >>>>>>> > buffer. >>>>>>> > >>>>>>> > I'm running this in AWS with instances that have 10 Gbps >>>>>>> connectivity in the >>>>>>> > same availability zone (according to tracepath, there are no >>>>>>> hops between >>>>>>> > them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of >>>>>>> sending data >>>>>>> > directly over TCP between these two instances, I reliably get >>>>>>> around 4 Gbps. >>>>>>> > Between these same two instances with MPI_Send() / MPI_Recv(), >>>>>>> I >>>>>>> reliably >>>>>>> > get around 2.4 Gbps. This seems like a major performance >>>>>>> degradation for a >>>>>>> > single MPI operation. >>>>>>> > >>>>>>> > I compiled Open MPI 2.1.0 with gcc 4.9.1 and default >>>>>>> settings. I'm >>>>>>> > connecting between instances via ssh and using I assume TCP for >>>>>>> the actual >>>>>>> > network transfer (I'm not setting any special command-line or >>>>>>> programmatic >>>>>>> > settings). The actual command I'm running is: >>>>>>> > mpirun -N 1 --bind-to none --hostfile hosts.txt my_app >>>>>>> > >>>>>>> > Any advice on other things to test or compilation and/or >>>>>>> runtime >>>>>>> flags to >>>>>>> > set would be much appreciated! >>>>>>> > -Adam >>>>>>> > >>>>>>> > _______________________________________________ >>>>>>> > users mailing list >>>>>>> > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> users@lists.open-mpi.org >>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> >>> >>> _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users