Digging up this old thread as it appears there's still an issue with btl_tcp_links.
I'm now using c5.18xlarge instances in AWS which have 25 Gbps connectivity; using iperf3 with the -P option to drive multiple ports, I achieve over 24 Gbps when communicating between two instances. When I originally asked this question, Gilles suggested I could do the equivalent with OpenMPI via the --mca btl_tcp_links flag but then Brian reported that this flag doesn't work in the 2.x and 3.x series. I just updated to OpenMPI 4.0.0, hoping that this was fixed, according to the FAQ at https://www.open-mpi.org/faq/?category=tcp#tcp-multi-links, it should be working. However, I see no difference in performance; on a simple benchmark which passes 10 GB between two ranks (one rank per host) via MPI_Send() and MPI_Recv(), I see around 9 Gb / s with or without this flag. In particular, I am running with: mpirun --mca btl_tcp_links 4 -N 1 --bind-to none --hostfile hosts.txt /path/to/my/application Trying a btl_tcp_links value of 2 or 3 also makes no difference. Is there another flag I need to be using or is something still broken? Thanks. -Adam On Thu, Jul 13, 2017 at 12:05 PM Adam Sylvester <op8...@gmail.com> wrote: > Bummer - thanks for the info Brian. > > As an FYI, I do have a real world use case for this faster connectivity > (i.e. beyond just a benchmark). While my application will happily gobble > up and run on however many machines it's given, there's a resource manager > that lives on top of everything that doles out machines to applications. > So there will be cases where my application will only get two machines to > run and so I'd still like the big data transfers to happen as quickly as > possible. I agree that when there are many ranks all talking to each > other, I should hopefully get closer to the full 20 Gbps. > > I appreciate that you have a number of other higher priorities, but wanted > to make you aware that I do have a use case for it... look forward to using > it when it's in place. :o) > > On Wed, Jul 12, 2017 at 2:18 PM, Barrett, Brian via users < > users@lists.open-mpi.org> wrote: > >> Adam - >> >> The btl_tcp_links flag does not currently work (for various reasons) in >> the 2.x and 3.x series. It’s on my todo list to fix, but I’m not sure it >> will get done before the 3.0.0 release. Part of the reason that it hasn’t >> been a priority is that most applications (outside of benchmarks) don’t >> benefit from the 20 Gbps between rank pairs, as they are generally talking >> to multiple peers at once (and therefore can drive the full 20 Gbps). It’s >> definitely on our roadmap, but can’t promise a release just yet. >> >> Brian >> >> On Jul 12, 2017, at 11:44 AM, Adam Sylvester <op8...@gmail.com> wrote: >> >> I switched over to X1 instances in AWS which have 20 Gbps connectivity. >> Using iperf3, I'm seeing 11.1 Gbps between them with just one port. iperf3 >> supports a -P option which will connect using multiple ports... Setting >> this to use in the range of 5-20 ports (there's some variability from run >> to run), I can get in the range of 18 Gbps aggregate which for a real world >> speed seems pretty good. >> >> Using mpirun with the previously-suggested btl_tcp_sndbuf and >> btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps. So, pretty close to >> iperf with just one port (makes sense there'd be some overhead with MPI). >> My understanding of the btl_tcp_links flag that Gilles mentioned is that it >> should be analogous to iperf's -P flag - it should connect with multiple >> ports in the hopes of improving the aggregate bandwidth. >> >> If that's what this flag is supposed to do, it does not appear to be >> working properly for me. With lsof, I can see the expected number of ports >> show up when I run iperf. However, with MPI I only ever see three >> connections between the two machines - sshd, orted, and my actual >> application. No matter what I set btl_tcp_links to, I don't see any >> additional ports show up (or any change in performance). >> >> Am I misunderstanding what this flag does or is there a bug here? If I >> am misunderstanding the flag's intent, is there a different flag that would >> allow Open MPI to use multiple ports similar to what iperf is doing? >> >> Thanks. >> -Adam >> >> On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester <op8...@gmail.com> wrote: >> >>> Thanks again Gilles. Ahh, better yet - I wasn't familiar with the >>> config file way to set these parameters... it'll be easy to bake this into >>> my AMI so that I don't have to set them each time while waiting for the >>> next Open MPI release. >>> >>> Out of mostly laziness I try to keep to the formal releases rather than >>> applying patches myself, but thanks for the link to it (the commit comments >>> were useful to understand why this improved performance). >>> >>> -Adam >>> >>> On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet <gil...@rist.or.jp >>> > wrote: >>> >>>> Adam, >>>> >>>> >>>> Thanks for letting us know your performance issue has been resolved. >>>> >>>> >>>> yes, https://www.open-mpi.org/faq/?category=tcp is the best place to >>>> look for this kind of information. >>>> >>>> i will add a reference to these parameters. i will also ask folks at >>>> AWS if they have additional/other recommendations. >>>> >>>> >>>> note you have a few options before 2.1.2 (or 3.0.0) is released : >>>> >>>> >>>> - update your system wide config file >>>> (/.../etc/openmpi-mca-params.conf) or user config file >>>> >>>> ($HOME/.openmpi/mca-params.conf) and add the following lines >>>> >>>> btl_tcp_sndbuf = 0 >>>> >>>> btl_tcp_rcvbuf = 0 >>>> >>>> >>>> - add the following environment variable to your environment >>>> >>>> export OMPI_MCA_btl_tcp_sndbuf=0 >>>> >>>> export OMPI_MCA_btl_tcp_rcvbuf=0 >>>> >>>> >>>> - use Open MPI 2.0.3 >>>> >>>> >>>> - last but not least, you can manually download and apply the patch >>>> available at >>>> >>>> >>>> https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc7c4693f9c1ef01dfb69f.patch >>>> >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On 7/9/2017 11:04 PM, Adam Sylvester wrote: >>>> >>>>> Gilles, >>>>> >>>>> Thanks for the fast response! >>>>> >>>>> The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you >>>>> recommended made a huge difference - this got me up to 5.7 Gb/s! I wasn't >>>>> aware of these flags... with a little Googling, is >>>>> https://www.open-mpi.org/faq/?category=tcp the best place to look for >>>>> this kind of information and any other tweaks I may want to try (or if >>>>> there's a better FAQ out there, please let me know)? >>>>> There is only eth0 on my machines so nothing to tweak there (though >>>>> good to know for the future). I also didn't see any improvement by >>>>> specifying more sockets per instance. But, your initial suggestion had a >>>>> major impact. >>>>> In general I try to stay relatively up to date with my Open MPI >>>>> version; I'll be extra motivated to upgrade to 2.1.2 so that I don't have >>>>> to remember to set these --mca flags on the command line. :o) >>>>> -Adam >>>>> >>>>> On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet < >>>>> gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> >>>>> wrote: >>>>> >>>>> Adam, >>>>> >>>>> at first, you need to change the default send and receive socket >>>>> buffers : >>>>> mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ... >>>>> /* note this will be the default from Open MPI 2.1.2 */ >>>>> >>>>> hopefully, that will be enough to greatly improve the bandwidth for >>>>> large messages. >>>>> >>>>> >>>>> generally speaking, i recommend you use the latest (e.g. Open MPI >>>>> 2.1.1) available version >>>>> >>>>> how many interfaces can be used to communicate between hosts ? >>>>> if there is more than one (for example a slow and a fast one), >>>>> you'd >>>>> rather only use the fast one. >>>>> for example, if eth0 is the fast interface, that can be achieved >>>>> with >>>>> mpirun --mca btl_tcp_if_include eth0 ... >>>>> >>>>> also, you might be able to achieve better results by using more >>>>> than >>>>> one socket on the fast interface. >>>>> for example, if you want to use 4 sockets per interface >>>>> mpirun --mca btl_tcp_links 4 ... >>>>> >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> On Sun, Jul 9, 2017 at 10:10 PM, Adam Sylvester <op8...@gmail.com >>>>> <mailto:op8...@gmail.com>> wrote: >>>>> > I am using Open MPI 2.1.0 on RHEL 7. My application has one >>>>> unavoidable >>>>> > pinch point where a large amount of data needs to be transferred >>>>> (about 8 GB >>>>> > of data needs to be both sent to and received all other ranks), >>>>> and I'm >>>>> > seeing worse performance than I would expect; this step has a >>>>> major impact >>>>> > on my overall runtime. In the real application, I am using >>>>> MPI_Alltoall() >>>>> > for this step, but for the purpose of a simple benchmark, I >>>>> simplified to >>>>> > simply do a single MPI_Send() / MPI_Recv() between two ranks of >>>>> a 2 GB >>>>> > buffer. >>>>> > >>>>> > I'm running this in AWS with instances that have 10 Gbps >>>>> connectivity in the >>>>> > same availability zone (according to tracepath, there are no >>>>> hops between >>>>> > them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of >>>>> sending data >>>>> > directly over TCP between these two instances, I reliably get >>>>> around 4 Gbps. >>>>> > Between these same two instances with MPI_Send() / MPI_Recv(), I >>>>> reliably >>>>> > get around 2.4 Gbps. This seems like a major performance >>>>> degradation for a >>>>> > single MPI operation. >>>>> > >>>>> > I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings. >>>>> I'm >>>>> > connecting between instances via ssh and using I assume TCP for >>>>> the actual >>>>> > network transfer (I'm not setting any special command-line or >>>>> programmatic >>>>> > settings). The actual command I'm running is: >>>>> > mpirun -N 1 --bind-to none --hostfile hosts.txt my_app >>>>> > >>>>> > Any advice on other things to test or compilation and/or runtime >>>>> flags to >>>>> > set would be much appreciated! >>>>> > -Adam >>>>> > >>>>> > _______________________________________________ >>>>> > users mailing list >>>>> > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> >>> >>> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users