On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
throughput drop from 6000 to 5700MB/s when the memory isn't allocated on
the right socket (and latency increases from 0.8 to 1.4us). Of course
that's pingpong only, things will be worse on a memory-overloaded
machine. But I don't expect things to be "less worse" if you do an
intermediate copy through the memory near the HCA: you would overload
the QPI link as much as here, and you would overload the CPU even more
because of the additional copies.

Brice



Le 08/07/2013 18:27, Michael Thomadakis a écrit :
> People have mentioned that they experience unexpected slow downs in
> PCIe_gen3 I/O when the pages map to a socket different from the one
> the HCA connects to. It is speculated that the inter-socket QPI is not
> provisioned to transfer more than 1GiB/sec for PCIe_gen 3 traffic.
> This situation may not be in effect on all SandyBrige or IvyBridge
> systems.
>
> Have you measured anything like this on you systems as well? That
> would require using physical memory mapped to the socket w/o HCA
> exclusively for MPI messaging.
>
> Mike
>
>
> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres)
> <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:
>
>     On Jul 8, 2013, at 11:35 AM, Michael Thomadakis
>     <drmichaelt7...@gmail.com <mailto:drmichaelt7...@gmail.com>> wrote:
>
>     > The issue is that when you read or write PCIe_gen 3 dat to a
>     non-local NUMA memory, SandyBridge will use the inter-socket QPIs
>     to get this data across to the other socket. I think there is
>     considerable limitation in PCIe I/O traffic data going over the
>     inter-socket QPI. One way to get around this is for reads to
>     buffer all data into memory space local to the same socket and
>     then transfer them by code across to the other socket's physical
>     memory. For writes the same approach can be used with intermediary
>     process copying data.
>
>     Sure, you'll cause congestion across the QPI network when you do
>     non-local PCI reads/writes.  That's a given.
>
>     But I'm not aware of a hardware limitation on PCI-requested
>     traffic across QPI (I could be wrong, of course -- I'm a software
>     guy, not a hardware guy).  A simple test would be to bind an MPI
>     process to a far NUMA node and run a simple MPI bandwidth test and
>     see if to get better/same/worse bandwidth compared to binding an
>     MPI process on a near NUMA socket.
>
>     But in terms of doing intermediate (pipelined) reads/writes to
>     local NUMA memory before reading/writing to PCI, no, Open MPI does
>     not do this.  Unless there is a PCI-QPI bandwidth constraint that
>     we're unaware of, I'm not sure why you would do this -- it would
>     likely add considerable complexity to the code and it would
>     definitely lead to higher overall MPI latency.
>
>     Don't forget that the MPI paradigm is for the application to
>     provide the send/receive buffer.  Meaning: MPI doesn't (always)
>     control where the buffer is located (particularly for large messages).
>
>     > I was wondering if OpenMPI does anything special memory mapping
>     to work around this.
>
>     Just what I mentioned in the prior email.
>
>     > And if with Ivy Bridge (or Haswell) he situation has improved.
>
>     Open MPI doesn't treat these chips any different.
>
>     --
>     Jeff Squyres
>     jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>     For corporate legal information go to:
>     http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>     _______________________________________________
>     users mailing list
>     us...@open-mpi.org <mailto:us...@open-mpi.org>
>     http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to