| Remember that the point of IB and other operating-system bypass devices
is that the driver is not involved in the fast path of sending /
| receiving. One of the side-effects of that design point is that
userspace does all the allocation of send / receive buffers.
That's a good point. It was not
| The driver doesn't allocate much memory here. Maybe some small control
buffers, but nothing significantly involved in large message transfer |
performance. Everything critical here is allocated by user-space (either
MPI lib or application), so we just have to make sure we bind the
| process memor
On Jul 8, 2013, at 2:01 PM, Brice Goglin wrote:
> The driver doesn't allocate much memory here. Maybe some small control
> buffers, but nothing significantly involved in large message transfer
> performance. Everything critical here is allocated by user-space (either MPI
> lib or application),
Cisco hasn't been involved in IB for several years, so I can't comment on that
directly.
That being said, our Cisco VIC devices are PCI gen *2*, but they are x16 (not
x8). We can get full bandwidth out of out 2*10Gb device from remote NUMA nodes
on E5-2690-based machines (Sandy Bridge) for lar
The driver doesn't allocate much memory here. Maybe some small control
buffers, but nothing significantly involved in large message transfer
performance. Everything critical here is allocated by user-space (either
MPI lib or application), so we just have to make sure we bind the
process memory prop
Hi Brice,
thanks for testing this out.
How did you make sure that the pinned pages used by the I/O adapter mapped
to the "other" socket's memory controller ? Is pining the MPI binary to a
socket sufficient to pin the space used for MPI I/O as well to that socket?
I think this is something done by
On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
throughput drop from 6000 to 5700MB/s when the memory isn't allocated on
the right socket (and latency increases from 0.8 to 1.4us). Of course
that's pingpong only, things will be worse on a memory-overloaded
machine. But I don't expe
People have mentioned that they experience unexpected slow downs in
PCIe_gen3 I/O when the pages map to a socket different from the one the HCA
connects to. It is speculated that the inter-socket QPI is not provisioned
to transfer more than 1GiB/sec for PCIe_gen 3 traffic. This situation may
not be
On Jul 8, 2013, at 11:35 AM, Michael Thomadakis
wrote:
> The issue is that when you read or write PCIe_gen 3 dat to a non-local NUMA
> memory, SandyBridge will use the inter-socket QPIs to get this data across to
> the other socket. I think there is considerable limitation in PCIe I/O
> traff
Hi Jeff,
thanks for the reply.
The issue is that when you read or write PCIe_gen 3 dat to a non-local NUMA
memory, SandyBridge will use the inter-socket QPIs to get this data across
to the other socket. I think there is considerable limitation in PCIe I/O
traffic data going over the inter-socket
On Jul 6, 2013, at 4:59 PM, Michael Thomadakis wrote:
> When you stack runs on SandyBridge nodes atached to HCAs ove PCI3 gen 3 do
> you pay any special attention to the memory buffers according to which
> socket/memory controller their physical memory belongs to?
>
> For instance, if the HCA
Hello OpenMPI,
When you stack runs on SandyBridge nodes atached to HCAs ove PCI3 *gen
3*do you pay any special attention to the memory buffers according to
which
socket/memory controller their physical memory belongs to?
For instance, if the HCA is attached to the PCIgen3 lanes of Socket 1 do
yo
12 matches
Mail list logo