On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong throughput drop from 6000 to 5700MB/s when the memory isn't allocated on the right socket (and latency increases from 0.8 to 1.4us). Of course that's pingpong only, things will be worse on a memory-overloaded machine. But I don't expect things to be "less worse" if you do an intermediate copy through the memory near the HCA: you would overload the QPI link as much as here, and you would overload the CPU even more because of the additional copies.
Brice Le 08/07/2013 18:27, Michael Thomadakis a écrit : > People have mentioned that they experience unexpected slow downs in > PCIe_gen3 I/O when the pages map to a socket different from the one > the HCA connects to. It is speculated that the inter-socket QPI is not > provisioned to transfer more than 1GiB/sec for PCIe_gen 3 traffic. > This situation may not be in effect on all SandyBrige or IvyBridge > systems. > > Have you measured anything like this on you systems as well? That > would require using physical memory mapped to the socket w/o HCA > exclusively for MPI messaging. > > Mike > > > On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres) > <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote: > > On Jul 8, 2013, at 11:35 AM, Michael Thomadakis > <drmichaelt7...@gmail.com <mailto:drmichaelt7...@gmail.com>> wrote: > > > The issue is that when you read or write PCIe_gen 3 dat to a > non-local NUMA memory, SandyBridge will use the inter-socket QPIs > to get this data across to the other socket. I think there is > considerable limitation in PCIe I/O traffic data going over the > inter-socket QPI. One way to get around this is for reads to > buffer all data into memory space local to the same socket and > then transfer them by code across to the other socket's physical > memory. For writes the same approach can be used with intermediary > process copying data. > > Sure, you'll cause congestion across the QPI network when you do > non-local PCI reads/writes. That's a given. > > But I'm not aware of a hardware limitation on PCI-requested > traffic across QPI (I could be wrong, of course -- I'm a software > guy, not a hardware guy). A simple test would be to bind an MPI > process to a far NUMA node and run a simple MPI bandwidth test and > see if to get better/same/worse bandwidth compared to binding an > MPI process on a near NUMA socket. > > But in terms of doing intermediate (pipelined) reads/writes to > local NUMA memory before reading/writing to PCI, no, Open MPI does > not do this. Unless there is a PCI-QPI bandwidth constraint that > we're unaware of, I'm not sure why you would do this -- it would > likely add considerable complexity to the code and it would > definitely lead to higher overall MPI latency. > > Don't forget that the MPI paradigm is for the application to > provide the send/receive buffer. Meaning: MPI doesn't (always) > control where the buffer is located (particularly for large messages). > > > I was wondering if OpenMPI does anything special memory mapping > to work around this. > > Just what I mentioned in the prior email. > > > And if with Ivy Bridge (or Haswell) he situation has improved. > > Open MPI doesn't treat these chips any different. > > -- > Jeff Squyres > jsquy...@cisco.com <mailto:jsquy...@cisco.com> > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users