Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Rolf vandeVaart
With respect to the CUDA-aware support, Ralph is correct.  The ability to send 
and receive GPU buffers is in the Open MPI 1.7 series.  And incremental 
improvements will be added to the Open MPI 1.7 series.  CUDA 5.0 is supported.



From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Saturday, July 06, 2013 5:14 PM
To: Open MPI Users
Subject: Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 
1.7.2

There was discussion of this on a prior email thread on the OMPI devel mailing 
list:

http://www.open-mpi.org/community/lists/devel/2013/05/12354.php


On Jul 6, 2013, at 2:01 PM, Michael Thomadakis 
mailto:drmichaelt7...@gmail.com>> wrote:


thanks,
Do you guys have any plan to support Intel Phi in the future? That is, running 
MPI code on the Phi cards or across the multicore and Phi, as Intel MPI does?
thanks...
Michael

On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain 
mailto:r...@open-mpi.org>> wrote:
Rolf will have to answer the question on level of support. The CUDA code is not 
in the 1.6 series as it was developed after that series went "stable". It is in 
the 1.7 series, although the level of support will likely be incrementally 
increasing as that "feature" series continues to evolve.


On Jul 6, 2013, at 12:06 PM, Michael Thomadakis 
mailto:drmichaelt7...@gmail.com>> wrote:

> Hello OpenMPI,
>
> I am wondering what level of support is there for CUDA and GPUdirect on 
> OpenMPI 1.6.5 and 1.7.2.
>
> I saw the ./configure --with-cuda=CUDA_DIR option in the FAQ. However, it 
> seems that with configure v1.6.5 it was ignored.
>
> Can you identify GPU memory and send messages from it directly without 
> copying to host memory first?
>
>
> Or in general, what level of CUDA support is there on 1.6.5 and 1.7.2 ? Do 
> you support SDK 5.0 and above?
>
> Cheers ...
> Michael
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] How to select specific out of multiple interfaces for communication and support for heterogeneous fabrics

2013-07-08 Thread Jeff Squyres (jsquyres)
Open MPI may have get confused if you end up having different receive queue 
specifications in your IB setup (in the "openib" Byte Transfer Layer (BTL) 
plugin that is used for point-to-point MPI communication transport in OMPI).  

If Open MPI doesn't work out of the box for you in a job that utilizes both QDR 
and FDR, you may need to override some defaults so that all receives queues are 
the same in both QDR-enabled nodes and FDR-enabled nodes.


On Jul 5, 2013, at 6:26 PM, Michael Thomadakis  wrote:

> Sorry on the mvapich2 reference :) 
> 
> All nodes are attached over a common 1GigE network. We wish ofcourse that if 
> a node-pair is connected via a higher-speed fabric as well (IB FDR or 10GigE) 
> then that this would be leveraged instead of the common 1GigE.
> 
> One question: suppose that we use nodes having either FDR or QDR IB 
> interfaces available, connected to one common IB fabric, all defined over a 
> common IP subnet: Will OpenMPI have any problem with this? Can MPI 
> communication take place over this type of hybrid IB fabric? We already have 
> a sub-cluster with QDR HCAs and we are attaching it to IB fabric with FDR 
> "backbone" and another cluster with FDR HCAs. 
> 
> Do you think there may be some issue with this? The HCAs are FDR and QDR 
> Mellanox devices and the switching is also over FDR Mellanox fabric. Mellanox 
> claims that at the IB level this is doable (i.e., FDR link pairs talk to each 
> other at FDR speeds and QDR link pairs at QDR).
> 
> I guess if we use the RC connection types then it does not matter to OpenMPI. 
> 
> thanks 
> Michael
> 
> 
> 
> 
> On Fri, Jul 5, 2013 at 4:59 PM, Ralph Castain  wrote:
> I can't speak for MVAPICH - you probably need to ask them about this 
> scenario. OMPI will automatically select whatever available transport that 
> can reach the intended process. This requires that each communicating pair of 
> processes have access to at least one common transport.
> 
> So if a process that is on a node with only 1G-E wants to communicate with 
> another process, then the node where that other process is running must also 
> have access to a compatible Ethernet interface (1G can talk to 10G, so they 
> can have different capabilities) on that subnet (or on a subnet that knows 
> how to route to the other one). If both nodes have 10G-E as well as 1G-E 
> interfaces, then OMPI will automatically take the 10G interface as it is the 
> faster of the two.
> 
> Note this means that if a process is on a node that only has IB, and wants to 
> communicate to a process on a node that only has 1G-E, then the two processes 
> cannot communicate.
> 
> HTH
> Ralph
> 
> On Jul 5, 2013, at 2:34 PM, Michael Thomadakis  
> wrote:
> 
>> Hello OpenMPI
>> 
>> We area seriously considering deploying OpenMPI 1.6.5 for production (and 
>> 1.7.2 for testing) on HPC clusters which consists of nodes with different 
>> types of networking interfaces.
>> 
>> 
>> 1) Interface selection
>> 
>> We are using OpenMPI 1.6.5 and was wondering how one would go about 
>> selecting at run time which networking interface to use for MPI 
>> communications in case that both IB, 10GigE and 1 GigE are present. 
>> 
>> This issues arises in a cluster with nodes that are equipped with different 
>> types of interfaces:
>> 
>> Some have both IB-QDR or FDR and 10- and 1-GigE. Others only have 10-GigE 
>> and 1-GigE and simply others only 1-GigE.
>> 
>> 
>> 2) OpenMPI 1.6.5 level of support for Heterogeneous Fabric
>> 
>> Can OpenMPI support running an MPI application using a mix of nodes with all 
>> of the above networking interface combinations ? 
>> 
>>   2.a) Can the same MPI code (SPMD or MPMD) have a subset of its ranks run 
>> on nodes with QDR IB and another subset on FDR IB simultaneously? These are 
>> Mellanox QDR and FDR HCAs. 
>> 
>> Mellanox mentioned to us that they support both QDR and FDR HCAs attached to 
>> the same IB subnet. Do you think MVAPICH2 will have any issue with this?
>> 
>> 2.b) Can the same MPI code (SPMD or MPMD) have a subset of its ranks run on 
>> nodes with IB and another subset over 10GiGE simultaneously? 
>> 
>> That is imagine nodes I1, I2, ..., IN having say QDR HCAs and nodes G1, G2, 
>> GM having only 10GigE interfaces. Could we have the same MPI application run 
>> across both types of nodes? 
>> 
>> Or should there be say 2 communicators with one of them explicitly overlaid 
>> on a IB only subnet and the other on a 10GigE only subnet? 
>> 
>> 
>> Please let me know if the above are not very clear.
>> 
>> Thank you much
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/l

Re: [OMPI users] OpenMPI 1.6.5 and IBM-AIX

2013-07-08 Thread Jeff Squyres (jsquyres)
+1.

This one seems like it could be as simple as a missing header file.  Try adding

#include "opal/constants.h"

in the timer_aix_component.c file.  


On Jul 6, 2013, at 1:08 PM, Ralph Castain  wrote:

> We haven't had access to an AIX machine in quite some time, so it isn't a big 
> surprise that things have bit-rotted. If you're willing to debug, we can try 
> to provide fixes. Just may take a bit to complete.
> 
> 
> On Jul 6, 2013, at 9:49 AM, Ilias Miroslav  wrote:
> 
>> Hi again,
>> 
>> even for GNU compilers the OpenMPI compilation fails on AIX:
>> .
>> .
>> .
>> Making all in mca/timer/aix
>> make[2]: Entering directory 
>> `/gpfs/home/ilias/bin/openmpi_gnu/openmpi-1.6.5/opal/mca/timer/aix'
>> CC timer_aix_component.lo
>> timer_aix_component.c: In function 'opal_timer_aix_open':
>> timer_aix_component.c:68:10: error: 'OPAL_SUCCESS' undeclared (first use in 
>> this function)
>> timer_aix_component.c:68:10: note: each undeclared identifier is reported 
>> only once for each function it appears in
>> timer_aix_component.c: At top level:
>> ../../../../opal/include/opal/sys/atomic.h:393:9: warning: 
>> 'opal_atomic_add_32' used but never defined [enabled by default]
>> ../../../../opal/include/opal/sys/atomic.h:403:9: warning: 
>> 'opal_atomic_sub_32' used but never defined [enabled by default]
>> make[2]: *** [timer_aix_component.lo] Error 1
>> make[2]: Leaving directory 
>> `/gpfs/home/ilias/bin/openmpi_gnu/openmpi-1.6.5/opal/mca/timer/aix'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory 
>> `/gpfs/home/ilias/bin/openmpi_gnu/openmpi-1.6.5/opal'
>> make: *** [all-recursive] Error 1
>> 
>> 
>> From: Ilias Miroslav
>> Sent: Saturday, July 06, 2013 1:51 PM
>> To: us...@open-mpi.org
>> Subject: OpenMPI 1.6.5 and IBM-AIX
>> 
>> Dear experts,
>> 
>> I am trying to build up OpenMPI 1.6.5 package with the AIX compiler suite:
>> 
>> ./configure --prefix=/gpfs/home/ilias/bin/openmpi_xl  CXX=xlC CC=xlc F77=xlf 
>> FC=xlf90
>> xl fortran is of version 13.01, xlc/C is 11.01
>> 
>> Configuration goes well, but the compilation fails. Any help, please ?
>> 
>> 
>> 
>> Making all in mca/timer/aix
>> make[2]: Entering directory 
>> `/gpfs/home/ilias/bin/openmpi_xl/openmpi-1.6.5/opal/mca/timer/aix'
>> CC timer_aix_component.lo
>> "timer_aix_component.c", line 68.10: 1506-045 (S) Undeclared identifier 
>> OPAL_SUCCESS.
>> "timer_aix_component.c", line 69.1: 1506-162 (W) No definition was found for 
>> function opal_atomic_sub_32. Storage class changed to extern.
>> "timer_aix_component.c", line 69.1: 1506-162 (W) No definition was found for 
>> function opal_atomic_add_32. Storage class changed to extern.
>> make[2]: *** [timer_aix_component.lo] Error 1
>> make[2]: Leaving directory 
>> `/gpfs/home/ilias/bin/openmpi_xl/openmpi-1.6.5/opal/mca/timer/aix'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory 
>> `/gpfs/home/ilias/bin/openmpi_xl/openmpi-1.6.5/opal'
>> make: *** [all-recursive] Error 1
>> ilias@147.213.80.175:~/bin/openmpi_xl/openmpi-1.6.5/.
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Trouble with MPI_Recv not filling buffer

2013-07-08 Thread Jeff Squyres (jsquyres)
On Jul 6, 2013, at 2:33 PM, Patrick Brückner 
 wrote:

> data p;
> p.collection = malloc(sizeof(int)*N);
> 
> printf("[%d] before receiving, data id %d at %d with direction 
> %d\n",me,p.id,p.position,p.direction);
> 
> MPI_Status data_status;
> MPI_Recv(&p,1,MPI_data,MPI_ANY_SOURCE,99,MPI_COMM_WORLD,&data_status);
> if(data_status.MPI_ERROR != MPI_SUCCESS) {
>printf("[%d] ERROR %d",me,data_status.MPI_ERROR);
>return -1;
> }
> printf("[%d] received status %d\n",data_status.MPI_ERROR);

I think you need "me" as the first printable argument in there.

> received++;
> printf("[%0d] received data %d (%d/%d) at position %d with direction 
> %d\n",me,p.id,received,expected,p.position,p.direction);
> --- snip ---
> 
> I get this output:
> 
> [1] before receiving, data id -1665002272 at 0 with direction 0
> [0] received status 0
> [1] received data -1665002272 (1/2) at position 0 with direction 0
> 
> I am wondering if you had any hint for me, why data is still not having the 
> correct data but just the old, uninitialized values, and why I don't get any 
> error.

My first guess would be that you created the MPI_data datatype incorrectly; 
that's what you should probably check into.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Jeff Squyres (jsquyres)
On Jul 6, 2013, at 4:59 PM, Michael Thomadakis  wrote:

> When you stack runs on SandyBridge nodes atached to HCAs ove PCI3 gen 3 do 
> you pay any special attention to the memory buffers according to which 
> socket/memory controller  their physical memory belongs to?
> 
> For instance, if the HCA is attached to the PCIgen3 lanes of Socket 1 do you 
> do anything special when the read/write buffers map to physical memory 
> belonging to Socket 2? Or do you7 avoid using buffers mapping ro memory that 
> belongs (is accessible via) the other socket?

It is not *necessary* to do ensure that buffers are NUMA-local to the PCI 
device that they are writing to, but it certainly results in lower latency to 
read/write to PCI devices (regardless of flavor) that are attached to an MPI 
process' local NUMA node.  The Hardware Locality (hwloc) tool "lstopo" can 
print a pretty picture of your server to show you where your PCI busses are 
connected.

For TCP, Open MPI will use all TCP devices that it finds by default (because it 
is assumed that latency is so high that NUMA locality doesn't matter).  The 
openib (OpenFabrics) transport will use the "closest" HCA ports that it can 
find to each MPI process.  

In our upcoming Cisco ultra low latency BTL, it defaults to using the closest 
Cisco VIC ports that it can find for short messages (i.e., to minimize 
latency), but uses all available VICs for long messages (i.e., to maximize 
bandwidth).

> Has this situation improved with Ivy-Brige systems or Haswell?

It's the same overall architecture (i.e., NUMA).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Michael Thomadakis
Hi Jeff,

thanks for the reply.

The issue is that when you read or write PCIe_gen 3 dat to a non-local NUMA
memory, SandyBridge will use the inter-socket QPIs to get this data across
to the other socket. I think there is considerable limitation in PCIe I/O
traffic data going over the inter-socket QPI. One way to get around this is
for reads to buffer all data into memory space local to the same socket and
then transfer them by code across to the other socket's physical memory.
For writes the same approach can be used with intermediary process copying
data.

I was wondering if OpenMPI does anything special memory mapping to work
around this. And if with Ivy Bridge (or Haswell) he situation has improved.

thanks
Mike


On Mon, Jul 8, 2013 at 9:57 AM, Jeff Squyres (jsquyres)
wrote:

> On Jul 6, 2013, at 4:59 PM, Michael Thomadakis 
> wrote:
>
> > When you stack runs on SandyBridge nodes atached to HCAs ove PCI3 gen 3
> do you pay any special attention to the memory buffers according to which
> socket/memory controller  their physical memory belongs to?
> >
> > For instance, if the HCA is attached to the PCIgen3 lanes of Socket 1 do
> you do anything special when the read/write buffers map to physical memory
> belonging to Socket 2? Or do you7 avoid using buffers mapping ro memory
> that belongs (is accessible via) the other socket?
>
> It is not *necessary* to do ensure that buffers are NUMA-local to the PCI
> device that they are writing to, but it certainly results in lower latency
> to read/write to PCI devices (regardless of flavor) that are attached to an
> MPI process' local NUMA node.  The Hardware Locality (hwloc) tool "lstopo"
> can print a pretty picture of your server to show you where your PCI busses
> are connected.
>
> For TCP, Open MPI will use all TCP devices that it finds by default
> (because it is assumed that latency is so high that NUMA locality doesn't
> matter).  The openib (OpenFabrics) transport will use the "closest" HCA
> ports that it can find to each MPI process.
>
> In our upcoming Cisco ultra low latency BTL, it defaults to using the
> closest Cisco VIC ports that it can find for short messages (i.e., to
> minimize latency), but uses all available VICs for long messages (i.e., to
> maximize bandwidth).
>
> > Has this situation improved with Ivy-Brige systems or Haswell?
>
> It's the same overall architecture (i.e., NUMA).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Michael Thomadakis
Thanks ...
Michael


On Mon, Jul 8, 2013 at 8:50 AM, Rolf vandeVaart wrote:

> With respect to the CUDA-aware support, Ralph is correct.  The ability to
> send and receive GPU buffers is in the Open MPI 1.7 series.  And
> incremental improvements will be added to the Open MPI 1.7 series.  CUDA
> 5.0 is supported.
>
> ** **
>
> ** **
>
> ** **
>
> *From:* users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] *On
> Behalf Of *Ralph Castain
> *Sent:* Saturday, July 06, 2013 5:14 PM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI
> 1.6.5 an 1.7.2
>
> ** **
>
> There was discussion of this on a prior email thread on the OMPI devel
> mailing list:
>
> ** **
>
> http://www.open-mpi.org/community/lists/devel/2013/05/12354.php
>
> ** **
>
> ** **
>
> On Jul 6, 2013, at 2:01 PM, Michael Thomadakis 
> wrote:
>
>
>
> 
>
> thanks,
>
> Do you guys have any plan to support Intel Phi in the future? That is,
> running MPI code on the Phi cards or across the multicore and Phi, as Intel
> MPI does?
>
> thanks...
>
> Michael
>
> ** **
>
> On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain  wrote:***
> *
>
> Rolf will have to answer the question on level of support. The CUDA code
> is not in the 1.6 series as it was developed after that series went
> "stable". It is in the 1.7 series, although the level of support will
> likely be incrementally increasing as that "feature" series continues to
> evolve.
>
>
>
> On Jul 6, 2013, at 12:06 PM, Michael Thomadakis 
> wrote:
>
> > Hello OpenMPI,
> >
> > I am wondering what level of support is there for CUDA and GPUdirect on
> OpenMPI 1.6.5 and 1.7.2.
> >
> > I saw the ./configure --with-cuda=CUDA_DIR option in the FAQ. However,
> it seems that with configure v1.6.5 it was ignored.
> >
> > Can you identify GPU memory and send messages from it directly without
> copying to host memory first?
> >
> >
> > Or in general, what level of CUDA support is there on 1.6.5 and 1.7.2 ?
> Do you support SDK 5.0 and above?
> >
> > Cheers ...
> > Michael
>
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ** **
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ** **
>  --
>  This email message is for the sole use of the intended recipient(s) and
> may contain confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
>  --
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Jeff Squyres (jsquyres)
On Jul 8, 2013, at 11:35 AM, Michael Thomadakis  
wrote:

> The issue is that when you read or write PCIe_gen 3 dat to a non-local NUMA 
> memory, SandyBridge will use the inter-socket QPIs to get this data across to 
> the other socket. I think there is considerable limitation in PCIe I/O 
> traffic data going over the inter-socket QPI. One way to get around this is 
> for reads to buffer all data into memory space local to the same socket and 
> then transfer them by code across to the other socket's physical memory. For 
> writes the same approach can be used with intermediary process copying data.

Sure, you'll cause congestion across the QPI network when you do non-local PCI 
reads/writes.  That's a given.

But I'm not aware of a hardware limitation on PCI-requested traffic across QPI 
(I could be wrong, of course -- I'm a software guy, not a hardware guy).  A 
simple test would be to bind an MPI process to a far NUMA node and run a simple 
MPI bandwidth test and see if to get better/same/worse bandwidth compared to 
binding an MPI process on a near NUMA socket.

But in terms of doing intermediate (pipelined) reads/writes to local NUMA 
memory before reading/writing to PCI, no, Open MPI does not do this.  Unless 
there is a PCI-QPI bandwidth constraint that we're unaware of, I'm not sure why 
you would do this -- it would likely add considerable complexity to the code 
and it would definitely lead to higher overall MPI latency.

Don't forget that the MPI paradigm is for the application to provide the 
send/receive buffer.  Meaning: MPI doesn't (always) control where the buffer is 
located (particularly for large messages).

> I was wondering if OpenMPI does anything special memory mapping to work 
> around this.

Just what I mentioned in the prior email.

> And if with Ivy Bridge (or Haswell) he situation has improved.

Open MPI doesn't treat these chips any different.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Michael Thomadakis
People have mentioned that they experience unexpected slow downs in
PCIe_gen3 I/O when the pages map to a socket different from the one the HCA
connects to. It is speculated that the inter-socket QPI is not provisioned
to transfer more than 1GiB/sec for PCIe_gen 3 traffic. This situation may
not be in effect on all SandyBrige or IvyBridge systems.

Have you measured anything like this on you systems as well? That would
require using physical memory mapped to the socket w/o HCA exclusively for
MPI messaging.

Mike


On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres)  wrote:

> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis 
> wrote:
>
> > The issue is that when you read or write PCIe_gen 3 dat to a non-local
> NUMA memory, SandyBridge will use the inter-socket QPIs to get this data
> across to the other socket. I think there is considerable limitation in
> PCIe I/O traffic data going over the inter-socket QPI. One way to get
> around this is for reads to buffer all data into memory space local to the
> same socket and then transfer them by code across to the other socket's
> physical memory. For writes the same approach can be used with intermediary
> process copying data.
>
> Sure, you'll cause congestion across the QPI network when you do non-local
> PCI reads/writes.  That's a given.
>
> But I'm not aware of a hardware limitation on PCI-requested traffic across
> QPI (I could be wrong, of course -- I'm a software guy, not a hardware
> guy).  A simple test would be to bind an MPI process to a far NUMA node and
> run a simple MPI bandwidth test and see if to get better/same/worse
> bandwidth compared to binding an MPI process on a near NUMA socket.
>
> But in terms of doing intermediate (pipelined) reads/writes to local NUMA
> memory before reading/writing to PCI, no, Open MPI does not do this.
>  Unless there is a PCI-QPI bandwidth constraint that we're unaware of, I'm
> not sure why you would do this -- it would likely add considerable
> complexity to the code and it would definitely lead to higher overall MPI
> latency.
>
> Don't forget that the MPI paradigm is for the application to provide the
> send/receive buffer.  Meaning: MPI doesn't (always) control where the
> buffer is located (particularly for large messages).
>
> > I was wondering if OpenMPI does anything special memory mapping to work
> around this.
>
> Just what I mentioned in the prior email.
>
> > And if with Ivy Bridge (or Haswell) he situation has improved.
>
> Open MPI doesn't treat these chips any different.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Brice Goglin
On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
throughput drop from 6000 to 5700MB/s when the memory isn't allocated on
the right socket (and latency increases from 0.8 to 1.4us). Of course
that's pingpong only, things will be worse on a memory-overloaded
machine. But I don't expect things to be "less worse" if you do an
intermediate copy through the memory near the HCA: you would overload
the QPI link as much as here, and you would overload the CPU even more
because of the additional copies.

Brice



Le 08/07/2013 18:27, Michael Thomadakis a écrit :
> People have mentioned that they experience unexpected slow downs in
> PCIe_gen3 I/O when the pages map to a socket different from the one
> the HCA connects to. It is speculated that the inter-socket QPI is not
> provisioned to transfer more than 1GiB/sec for PCIe_gen 3 traffic.
> This situation may not be in effect on all SandyBrige or IvyBridge
> systems.
>
> Have you measured anything like this on you systems as well? That
> would require using physical memory mapped to the socket w/o HCA
> exclusively for MPI messaging.
>
> Mike
>
>
> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres)
> mailto:jsquy...@cisco.com>> wrote:
>
> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis
> mailto:drmichaelt7...@gmail.com>> wrote:
>
> > The issue is that when you read or write PCIe_gen 3 dat to a
> non-local NUMA memory, SandyBridge will use the inter-socket QPIs
> to get this data across to the other socket. I think there is
> considerable limitation in PCIe I/O traffic data going over the
> inter-socket QPI. One way to get around this is for reads to
> buffer all data into memory space local to the same socket and
> then transfer them by code across to the other socket's physical
> memory. For writes the same approach can be used with intermediary
> process copying data.
>
> Sure, you'll cause congestion across the QPI network when you do
> non-local PCI reads/writes.  That's a given.
>
> But I'm not aware of a hardware limitation on PCI-requested
> traffic across QPI (I could be wrong, of course -- I'm a software
> guy, not a hardware guy).  A simple test would be to bind an MPI
> process to a far NUMA node and run a simple MPI bandwidth test and
> see if to get better/same/worse bandwidth compared to binding an
> MPI process on a near NUMA socket.
>
> But in terms of doing intermediate (pipelined) reads/writes to
> local NUMA memory before reading/writing to PCI, no, Open MPI does
> not do this.  Unless there is a PCI-QPI bandwidth constraint that
> we're unaware of, I'm not sure why you would do this -- it would
> likely add considerable complexity to the code and it would
> definitely lead to higher overall MPI latency.
>
> Don't forget that the MPI paradigm is for the application to
> provide the send/receive buffer.  Meaning: MPI doesn't (always)
> control where the buffer is located (particularly for large messages).
>
> > I was wondering if OpenMPI does anything special memory mapping
> to work around this.
>
> Just what I mentioned in the prior email.
>
> > And if with Ivy Bridge (or Haswell) he situation has improved.
>
> Open MPI doesn't treat these chips any different.
>
> --
> Jeff Squyres
> jsquy...@cisco.com 
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org 
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Elken, Tom
Do you guys have any plan to support Intel Phi in the future? That is, running 
MPI code on the Phi cards or across the multicore and Phi, as Intel MPI does?
[Tom]
Hi Michael,
Because a Xeon Phi card acts a lot like a Linux host with an x86 architecture, 
you can build your own Open MPI libraries to serve this purpose.

Our team has used existing (an older 1.4.3 version of) Open MPI source to build 
an Open MPI for running MPI code on Intel Xeon Phi cards over Intel's (formerly 
QLogic's) True Scale InfiniBand fabric, and it works quite well.  We have not 
released a pre-built Open MPI as part of any Intel software release.   But I 
think if you have a compiler for Xeon Phi (Intel Compiler or GCC) and an 
interconnect for it, you should be able to build an Open MPI that works on Xeon 
Phi.
Cheers,
Tom Elken
thanks...
Michael

On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain 
mailto:r...@open-mpi.org>> wrote:
Rolf will have to answer the question on level of support. The CUDA code is not 
in the 1.6 series as it was developed after that series went "stable". It is in 
the 1.7 series, although the level of support will likely be incrementally 
increasing as that "feature" series continues to evolve.


On Jul 6, 2013, at 12:06 PM, Michael Thomadakis 
mailto:drmichaelt7...@gmail.com>> wrote:

> Hello OpenMPI,
>
> I am wondering what level of support is there for CUDA and GPUdirect on 
> OpenMPI 1.6.5 and 1.7.2.
>
> I saw the ./configure --with-cuda=CUDA_DIR option in the FAQ. However, it 
> seems that with configure v1.6.5 it was ignored.
>
> Can you identify GPU memory and send messages from it directly without 
> copying to host memory first?
>
>
> Or in general, what level of CUDA support is there on 1.6.5 and 1.7.2 ? Do 
> you support SDK 5.0 and above?
>
> Cheers ...
> Michael
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Michael Thomadakis
Thanks Tom, that sounds good. I will give it a try as soon as our Phi host
here host gets installed.

I assume that all the prerequisite libs and bins on the Phi side are
available when we download the Phi s/w stack from Intel's site, right ?

Cheers
Michael




On Mon, Jul 8, 2013 at 12:10 PM, Elken, Tom  wrote:

>Do you guys have any plan to support Intel Phi in the future? That is,
> running MPI code on the Phi cards or across the multicore and Phi, as Intel
> MPI does?
>
> *[Tom] *
>
> Hi Michael,
>
> Because a Xeon Phi card acts a lot like a Linux host with an x86
> architecture, you can build your own Open MPI libraries to serve this
> purpose.
>
> Our team has used existing (an older 1.4.3 version of) Open MPI source to
> build an Open MPI for running MPI code on Intel Xeon Phi cards over Intel’s
> (formerly QLogic’s) True Scale InfiniBand fabric, and it works quite well.
> We have not released a pre-built Open MPI as part of any Intel software
> release.   But I think if you have a compiler for Xeon Phi (Intel Compiler
> or GCC) and an interconnect for it, you should be able to build an Open MPI
> that works on Xeon Phi.  
>
> Cheers,
> Tom Elken
>
> thanks...
>
> Michael
>
> ** **
>
> On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain  wrote:***
> *
>
> Rolf will have to answer the question on level of support. The CUDA code
> is not in the 1.6 series as it was developed after that series went
> "stable". It is in the 1.7 series, although the level of support will
> likely be incrementally increasing as that "feature" series continues to
> evolve.
>
>
>
> On Jul 6, 2013, at 12:06 PM, Michael Thomadakis 
> wrote:
>
> > Hello OpenMPI,
> >
> > I am wondering what level of support is there for CUDA and GPUdirect on
> OpenMPI 1.6.5 and 1.7.2.
> >
> > I saw the ./configure --with-cuda=CUDA_DIR option in the FAQ. However,
> it seems that with configure v1.6.5 it was ignored.
> >
> > Can you identify GPU memory and send messages from it directly without
> copying to host memory first?
> >
> >
> > Or in general, what level of CUDA support is there on 1.6.5 and 1.7.2 ?
> Do you support SDK 5.0 and above?
> >
> > Cheers ...
> > Michael
>
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ** **
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] problem with MPI_Iexscan

2013-07-08 Thread Siegmar Gross
Hi,

today I installed openmpi-1.9a1r28730 and tried to test MPI_Iexscan()
on my machine (Solaris 10 sparc, Sun C 5.12). Unfortunately my program
breaks.

tyr xxx 105 mpicc iexscan.c 
tyr xxx 106 mpiexec -np 2 iexscan
[tyr:21094] *** An error occurred in MPI_Iexscan
[tyr:21094] *** reported by process [4097966081,0]
[tyr:21094] *** on communicator MPI_COMM_WORLD
[tyr:21094] *** MPI_ERR_INTERN: internal error
[tyr:21094] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
  will now abort,
[tyr:21094] ***and potentially your MPI job)
[tyr.informatik.hs-fulda.de:21092] 1 more process has sent help
  message help-mpi-errors.txt / mpi_errors_are_fatal
[tyr.informatik.hs-fulda.de:21092] Set MCA parameter
  "orte_base_help_aggregate" to 0 to see all help / error messages
tyr xxx 107 
tyr xxx 107 ompi_info |grep "MPI: "
Open MPI: 1.9a1r28730
tyr xxx 108 


That's the program I used for my test.

#include 
#include "mpi.h"

#define MAXLEN 1

int main(int argc, char *argv[])
{
   int out[MAXLEN], in[MAXLEN], i, j, k;
   int myself, tasks;
   MPI_Request request;

   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &myself);
   MPI_Comm_size(MPI_COMM_WORLD, &tasks);

   for(j = 1; j <= MAXLEN; j *= 10) {
  for(i = 0; i < j; i++) {
out[i] = i;
  }
  MPI_Iexscan(out, in, j, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &request);
  MPI_Wait(&request,  MPI_STATUS_IGNORE);

  if (myself != 0)
for(k = 0; k < j; k++) {
  if(in[k] != k * myself) {  
fprintf (stderr, "bad answer (%d) at index %d of %d "
 "(should be %d)\n", in[k], k, j, k*(myself));
break; 
  }
}
   }
   MPI_Barrier(MPI_COMM_WORLD);
   MPI_Finalize();
   return 0;
}


Do you have any ideas what's going wrong? Is the internal MPI error
a real internal error or is something wrong with my program, which
results in an internal error? Thank you very much for any help in
advance.


Kind regards

Siegmar



Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Michael Thomadakis
Hi Brice,

thanks for testing this out.

How did you make sure that the pinned pages used by the I/O adapter mapped
to the "other" socket's memory controller ? Is pining the MPI binary to a
socket sufficient to pin the space used for MPI I/O as well to that socket?
I think this is something done by and at the HCA device driver level.

Anyways, as long as the memory performance difference is a the levels you
mentioned then there is no "big" issue. Most likely the device driver get
space from the same numa domain that of the socket the HCA is attached to.

Thanks for trying it out
Michael






On Mon, Jul 8, 2013 at 11:45 AM, Brice Goglin  wrote:

>  On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
> throughput drop from 6000 to 5700MB/s when the memory isn't allocated on
> the right socket (and latency increases from 0.8 to 1.4us). Of course
> that's pingpong only, things will be worse on a memory-overloaded machine.
> But I don't expect things to be "less worse" if you do an intermediate copy
> through the memory near the HCA: you would overload the QPI link as much as
> here, and you would overload the CPU even more because of the additional
> copies.
>
> Brice
>
>
>
> Le 08/07/2013 18:27, Michael Thomadakis a écrit :
>
> People have mentioned that they experience unexpected slow downs in
> PCIe_gen3 I/O when the pages map to a socket different from the one the HCA
> connects to. It is speculated that the inter-socket QPI is not provisioned
> to transfer more than 1GiB/sec for PCIe_gen 3 traffic. This situation may
> not be in effect on all SandyBrige or IvyBridge systems.
>
>  Have you measured anything like this on you systems as well? That would
> require using physical memory mapped to the socket w/o HCA exclusively for
> MPI messaging.
>
>  Mike
>
>
> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis 
>> wrote:
>>
>> > The issue is that when you read or write PCIe_gen 3 dat to a non-local
>> NUMA memory, SandyBridge will use the inter-socket QPIs to get this data
>> across to the other socket. I think there is considerable limitation in
>> PCIe I/O traffic data going over the inter-socket QPI. One way to get
>> around this is for reads to buffer all data into memory space local to the
>> same socket and then transfer them by code across to the other socket's
>> physical memory. For writes the same approach can be used with intermediary
>> process copying data.
>>
>>  Sure, you'll cause congestion across the QPI network when you do
>> non-local PCI reads/writes.  That's a given.
>>
>> But I'm not aware of a hardware limitation on PCI-requested traffic
>> across QPI (I could be wrong, of course -- I'm a software guy, not a
>> hardware guy).  A simple test would be to bind an MPI process to a far NUMA
>> node and run a simple MPI bandwidth test and see if to get
>> better/same/worse bandwidth compared to binding an MPI process on a near
>> NUMA socket.
>>
>> But in terms of doing intermediate (pipelined) reads/writes to local NUMA
>> memory before reading/writing to PCI, no, Open MPI does not do this.
>>  Unless there is a PCI-QPI bandwidth constraint that we're unaware of, I'm
>> not sure why you would do this -- it would likely add considerable
>> complexity to the code and it would definitely lead to higher overall MPI
>> latency.
>>
>> Don't forget that the MPI paradigm is for the application to provide the
>> send/receive buffer.  Meaning: MPI doesn't (always) control where the
>> buffer is located (particularly for large messages).
>>
>> > I was wondering if OpenMPI does anything special memory mapping to work
>> around this.
>>
>>  Just what I mentioned in the prior email.
>>
>> > And if with Ivy Bridge (or Haswell) he situation has improved.
>>
>>  Open MPI doesn't treat these chips any different.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> ___
> users mailing 
> listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Brice Goglin
The driver doesn't allocate much memory here. Maybe some small control
buffers, but nothing significantly involved in large message transfer
performance. Everything critical here is allocated by user-space (either
MPI lib or application), so we just have to make sure we bind the
process memory properly. I used hwloc-bind to do that.

Note that we have seen larger issues on older platforms. You basically
just need a big HCA and PCI link on a not-so-big machine. Not very
common fortunately with todays QPI links between Sandy-Bridge socket,
those are quite big compared to PCI Gen3 8x links to the HCA. On old AMD
platforms (and modern Intels with big GPUs), issues are not that
uncommon (we've seen up to 40% DMA bandwidth difference there).

Brice



Le 08/07/2013 19:44, Michael Thomadakis a écrit :
> Hi Brice, 
>
> thanks for testing this out.
>
> How did you make sure that the pinned pages used by the I/O adapter
> mapped to the "other" socket's memory controller ? Is pining the MPI
> binary to a socket sufficient to pin the space used for MPI I/O as
> well to that socket? I think this is something done by and at the HCA
> device driver level. 
>
> Anyways, as long as the memory performance difference is a the levels
> you mentioned then there is no "big" issue. Most likely the device
> driver get space from the same numa domain that of the socket the HCA
> is attached to. 
>
> Thanks for trying it out
> Michael
>
>
>
>
>
>
> On Mon, Jul 8, 2013 at 11:45 AM, Brice Goglin  > wrote:
>
> On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
> throughput drop from 6000 to 5700MB/s when the memory isn't
> allocated on the right socket (and latency increases from 0.8 to
> 1.4us). Of course that's pingpong only, things will be worse on a
> memory-overloaded machine. But I don't expect things to be "less
> worse" if you do an intermediate copy through the memory near the
> HCA: you would overload the QPI link as much as here, and you
> would overload the CPU even more because of the additional copies.
>
> Brice
>
>
>
> Le 08/07/2013 18:27, Michael Thomadakis a écrit :
>> People have mentioned that they experience unexpected slow downs
>> in PCIe_gen3 I/O when the pages map to a socket different from
>> the one the HCA connects to. It is speculated that the
>> inter-socket QPI is not provisioned to transfer more than
>> 1GiB/sec for PCIe_gen 3 traffic. This situation may not be in
>> effect on all SandyBrige or IvyBridge systems.
>>
>> Have you measured anything like this on you systems as well? That
>> would require using physical memory mapped to the socket w/o HCA
>> exclusively for MPI messaging.
>>
>> Mike
>>
>>
>> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres)
>> mailto:jsquy...@cisco.com>> wrote:
>>
>> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis
>> mailto:drmichaelt7...@gmail.com>>
>> wrote:
>>
>> > The issue is that when you read or write PCIe_gen 3 dat to
>> a non-local NUMA memory, SandyBridge will use the
>> inter-socket QPIs to get this data across to the other
>> socket. I think there is considerable limitation in PCIe I/O
>> traffic data going over the inter-socket QPI. One way to get
>> around this is for reads to buffer all data into memory space
>> local to the same socket and then transfer them by code
>> across to the other socket's physical memory. For writes the
>> same approach can be used with intermediary process copying data.
>>
>> Sure, you'll cause congestion across the QPI network when you
>> do non-local PCI reads/writes.  That's a given.
>>
>> But I'm not aware of a hardware limitation on PCI-requested
>> traffic across QPI (I could be wrong, of course -- I'm a
>> software guy, not a hardware guy).  A simple test would be to
>> bind an MPI process to a far NUMA node and run a simple MPI
>> bandwidth test and see if to get better/same/worse bandwidth
>> compared to binding an MPI process on a near NUMA socket.
>>
>> But in terms of doing intermediate (pipelined) reads/writes
>> to local NUMA memory before reading/writing to PCI, no, Open
>> MPI does not do this.  Unless there is a PCI-QPI bandwidth
>> constraint that we're unaware of, I'm not sure why you would
>> do this -- it would likely add considerable complexity to the
>> code and it would definitely lead to higher overall MPI latency.
>>
>> Don't forget that the MPI paradigm is for the application to
>> provide the send/receive buffer.  Meaning: MPI doesn't
>> (always) control where the buffer is located (particularly
>> for large messages).
>>
>> > I was wondering if OpenMPI does anything special memory
>> mapping to work around this.
>>
>>

Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Jeff Squyres (jsquyres)
Cisco hasn't been involved in IB for several years, so I can't comment on that 
directly.

That being said, our Cisco VIC devices are PCI gen *2*, but they are x16 (not 
x8).  We can get full bandwidth out of out 2*10Gb device from remote NUMA nodes 
on E5-2690-based machines (Sandy Bridge) for large messages.  In the lab, we 
have... tweaked... versions of those devices that give significantly higher 
bandwidth (it's no secret that the ASIC on these devices is capable of 80Gb).

We haven't looked for this specific issue, but I can confirm that we have seen 
the bandwidth that we expected out of our devices.


On Jul 8, 2013, at 12:27 PM, Michael Thomadakis  
wrote:

> People have mentioned that they experience unexpected slow downs in PCIe_gen3 
> I/O when the pages map to a socket different from the one the HCA connects 
> to. It is speculated that the inter-socket QPI is not provisioned to transfer 
> more than 1GiB/sec for PCIe_gen 3 traffic. This situation may not be in 
> effect on all SandyBrige or IvyBridge systems.
> 
> Have you measured anything like this on you systems as well? That would 
> require using physical memory mapped to the socket w/o HCA exclusively for 
> MPI messaging.
> 
> Mike
> 
> 
> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres)  
> wrote:
> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis  
> wrote:
> 
> > The issue is that when you read or write PCIe_gen 3 dat to a non-local NUMA 
> > memory, SandyBridge will use the inter-socket QPIs to get this data across 
> > to the other socket. I think there is considerable limitation in PCIe I/O 
> > traffic data going over the inter-socket QPI. One way to get around this is 
> > for reads to buffer all data into memory space local to the same socket and 
> > then transfer them by code across to the other socket's physical memory. 
> > For writes the same approach can be used with intermediary process copying 
> > data.
> 
> Sure, you'll cause congestion across the QPI network when you do non-local 
> PCI reads/writes.  That's a given.
> 
> But I'm not aware of a hardware limitation on PCI-requested traffic across 
> QPI (I could be wrong, of course -- I'm a software guy, not a hardware guy).  
> A simple test would be to bind an MPI process to a far NUMA node and run a 
> simple MPI bandwidth test and see if to get better/same/worse bandwidth 
> compared to binding an MPI process on a near NUMA socket.
> 
> But in terms of doing intermediate (pipelined) reads/writes to local NUMA 
> memory before reading/writing to PCI, no, Open MPI does not do this.  Unless 
> there is a PCI-QPI bandwidth constraint that we're unaware of, I'm not sure 
> why you would do this -- it would likely add considerable complexity to the 
> code and it would definitely lead to higher overall MPI latency.
> 
> Don't forget that the MPI paradigm is for the application to provide the 
> send/receive buffer.  Meaning: MPI doesn't (always) control where the buffer 
> is located (particularly for large messages).
> 
> > I was wondering if OpenMPI does anything special memory mapping to work 
> > around this.
> 
> Just what I mentioned in the prior email.
> 
> > And if with Ivy Bridge (or Haswell) he situation has improved.
> 
> Open MPI doesn't treat these chips any different.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Jeff Squyres (jsquyres)
On Jul 8, 2013, at 2:01 PM, Brice Goglin  wrote:

> The driver doesn't allocate much memory here. Maybe some small control 
> buffers, but nothing significantly involved in large message transfer 
> performance. Everything critical here is allocated by user-space (either MPI 
> lib or application), so we just have to make sure we bind the process memory 
> properly. I used hwloc-bind to do that.

+1

Remember that the point of IB and other operating-system bypass devices is that 
the driver is not involved in the fast path of sending / receiving.  One of the 
side-effects of that design point is that userspace does all the allocation of 
send / receive buffers.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Elken, Tom

Thanks Tom, that sounds good. I will give it a try as soon as our Phi host here 
host gets installed.

I assume that all the prerequisite libs and bins on the Phi side are available 
when we download the Phi s/w stack from Intel's site, right ?
[Tom]
Right.  When you install Intel's MPSS (Manycore Platform Software Stack), 
including following the section on "OFED Support" in the readme file, you 
should have all the prerequisite libs and bins.  Note that I have not built 
Open MPI for Xeon Phi for your interconnect, but it seems to me that it should 
work.

-Tom

Cheers
Michael



On Mon, Jul 8, 2013 at 12:10 PM, Elken, Tom 
mailto:tom.el...@intel.com>> wrote:
Do you guys have any plan to support Intel Phi in the future? That is, running 
MPI code on the Phi cards or across the multicore and Phi, as Intel MPI does?
[Tom]
Hi Michael,
Because a Xeon Phi card acts a lot like a Linux host with an x86 architecture, 
you can build your own Open MPI libraries to serve this purpose.

Our team has used existing (an older 1.4.3 version of) Open MPI source to build 
an Open MPI for running MPI code on Intel Xeon Phi cards over Intel's (formerly 
QLogic's) True Scale InfiniBand fabric, and it works quite well.  We have not 
released a pre-built Open MPI as part of any Intel software release.   But I 
think if you have a compiler for Xeon Phi (Intel Compiler or GCC) and an 
interconnect for it, you should be able to build an Open MPI that works on Xeon 
Phi.
Cheers,
Tom Elken
thanks...
Michael

On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain 
mailto:r...@open-mpi.org>> wrote:
Rolf will have to answer the question on level of support. The CUDA code is not 
in the 1.6 series as it was developed after that series went "stable". It is in 
the 1.7 series, although the level of support will likely be incrementally 
increasing as that "feature" series continues to evolve.


On Jul 6, 2013, at 12:06 PM, Michael Thomadakis 
mailto:drmichaelt7...@gmail.com>> wrote:

> Hello OpenMPI,
>
> I am wondering what level of support is there for CUDA and GPUdirect on 
> OpenMPI 1.6.5 and 1.7.2.
>
> I saw the ./configure --with-cuda=CUDA_DIR option in the FAQ. However, it 
> seems that with configure v1.6.5 it was ignored.
>
> Can you identify GPU memory and send messages from it directly without 
> copying to host memory first?
>
>
> Or in general, what level of CUDA support is there on 1.6.5 and 1.7.2 ? Do 
> you support SDK 5.0 and above?
>
> Cheers ...
> Michael
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Michael Thomadakis
Thanks Tom, I will test it out...
regards
Michael


On Mon, Jul 8, 2013 at 1:16 PM, Elken, Tom  wrote:

>   ** **
>
> Thanks Tom, that sounds good. I will give it a try as soon as our Phi host
> here host gets installed. 
>
> ** **
>
> I assume that all the prerequisite libs and bins on the Phi side are
> available when we download the Phi s/w stack from Intel's site, right ?***
> *
>
> *[Tom] *
>
> *Right.  When you install Intel’s MPSS (Manycore Platform Software
> Stack), including following the section on “OFED Support” in the readme
> file, you should have all the prerequisite libs and bins.  Note that I have
> not built Open MPI for Xeon Phi for your interconnect, but it seems to me
> that it should work. *
>
> * *
>
> *-Tom*
>
> ** **
>
> Cheers
>
> Michael
>
> ** **
>
> ** **
>
> ** **
>
> On Mon, Jul 8, 2013 at 12:10 PM, Elken, Tom  wrote:**
> **
>
> Do you guys have any plan to support Intel Phi in the future? That is,
> running MPI code on the Phi cards or across the multicore and Phi, as Intel
> MPI does?
>
> *[Tom] *
>
> Hi Michael,
>
> Because a Xeon Phi card acts a lot like a Linux host with an x86
> architecture, you can build your own Open MPI libraries to serve this
> purpose.
>
> Our team has used existing (an older 1.4.3 version of) Open MPI source to
> build an Open MPI for running MPI code on Intel Xeon Phi cards over Intel’s
> (formerly QLogic’s) True Scale InfiniBand fabric, and it works quite well.
> We have not released a pre-built Open MPI as part of any Intel software
> release.   But I think if you have a compiler for Xeon Phi (Intel Compiler
> or GCC) and an interconnect for it, you should be able to build an Open MPI
> that works on Xeon Phi.  
>
> Cheers,
> Tom Elken
>
> thanks...
>
> Michael
>
>  
>
> On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain  wrote:***
> *
>
> Rolf will have to answer the question on level of support. The CUDA code
> is not in the 1.6 series as it was developed after that series went
> "stable". It is in the 1.7 series, although the level of support will
> likely be incrementally increasing as that "feature" series continues to
> evolve.
>
>
>
> On Jul 6, 2013, at 12:06 PM, Michael Thomadakis 
> wrote:
>
> > Hello OpenMPI,
> >
> > I am wondering what level of support is there for CUDA and GPUdirect on
> OpenMPI 1.6.5 and 1.7.2.
> >
> > I saw the ./configure --with-cuda=CUDA_DIR option in the FAQ. However,
> it seems that with configure v1.6.5 it was ignored.
> >
> > Can you identify GPU memory and send messages from it directly without
> copying to host memory first?
> >
> >
> > Or in general, what level of CUDA support is there on 1.6.5 and 1.7.2 ?
> Do you support SDK 5.0 and above?
> >
> > Cheers ...
> > Michael
>
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>  
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ** **
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Michael Thomadakis
| The driver doesn't allocate much memory here. Maybe some small control
buffers, but nothing significantly involved in large message transfer |
performance. Everything critical here is allocated by user-space (either
MPI lib or application), so we just have to make sure we bind the
| process memory properly. I used hwloc-bind to do that.

I see ... So the user level process (user or MPI library) sets aside memory
(malloc?) and basically then the OFED/IB sets up RDMA messaging with
addresses pointing back to that user physical memory. I guess before
running the MPI benchmark you requested *data *memory allocation policy to
allocate pages "owned" by the other socket?

| Note that we have seen larger issues on older platforms. You basically
just need a big HCA and PCI link on a not-so-big machine. Not very
| common fortunately with todays QPI links between Sandy-Bridge socket,
those are quite big compared to PCI Gen3 8x links to the HCA. On
| old AMD platforms (and modern Intels with big GPUs), issues are not that
uncommon (we've seen up to 40% DMA bandwidth difference
| there).

The issue that has been observed is with PCIe_gen 3 traffic on attached I/O
which, say, reads data off of the HCA and has to store it to memory but
when this memory belongs to the other socket. In that case PCI e data uses
the QPI links on SB to send out these packets to the other socket. It has
been speculated that QPI links where NOT provisioned to transfer more than
1GiB of PCIe data alongside the regular inter-NUMA memory traffic. It may
be the case that Intel has re-provisioned QPI to be able to accommodate
more PCIe traffic.

Thanks again
Michael



On Mon, Jul 8, 2013 at 1:01 PM, Brice Goglin  wrote:

>  The driver doesn't allocate much memory here. Maybe some small control
> buffers, but nothing significantly involved in large message transfer
> performance. Everything critical here is allocated by user-space (either
> MPI lib or application), so we just have to make sure we bind the process
> memory properly. I used hwloc-bind to do that.
>
> Note that we have seen larger issues on older platforms. You basically
> just need a big HCA and PCI link on a not-so-big machine. Not very common
> fortunately with todays QPI links between Sandy-Bridge socket, those are
> quite big compared to PCI Gen3 8x links to the HCA. On old AMD platforms
> (and modern Intels with big GPUs), issues are not that uncommon (we've seen
> up to 40% DMA bandwidth difference there).
>
> Brice
>
>
>
> Le 08/07/2013 19:44, Michael Thomadakis a écrit :
>
>  Hi Brice,
>
>  thanks for testing this out.
>
>  How did you make sure that the pinned pages used by the I/O adapter
> mapped to the "other" socket's memory controller ? Is pining the MPI binary
> to a socket sufficient to pin the space used for MPI I/O as well to that
> socket? I think this is something done by and at the HCA device driver
> level.
>
>  Anyways, as long as the memory performance difference is a the levels
> you mentioned then there is no "big" issue. Most likely the device driver
> get space from the same numa domain that of the socket the HCA is attached
> to.
>
>  Thanks for trying it out
> Michael
>
>
>
>
>
>
>  On Mon, Jul 8, 2013 at 11:45 AM, Brice Goglin wrote:
>
>>  On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
>> throughput drop from 6000 to 5700MB/s when the memory isn't allocated on
>> the right socket (and latency increases from 0.8 to 1.4us). Of course
>> that's pingpong only, things will be worse on a memory-overloaded machine.
>> But I don't expect things to be "less worse" if you do an intermediate copy
>> through the memory near the HCA: you would overload the QPI link as much as
>> here, and you would overload the CPU even more because of the additional
>> copies.
>>
>> Brice
>>
>>
>>
>> Le 08/07/2013 18:27, Michael Thomadakis a écrit :
>>
>> People have mentioned that they experience unexpected slow downs in
>> PCIe_gen3 I/O when the pages map to a socket different from the one the HCA
>> connects to. It is speculated that the inter-socket QPI is not provisioned
>> to transfer more than 1GiB/sec for PCIe_gen 3 traffic. This situation may
>> not be in effect on all SandyBrige or IvyBridge systems.
>>
>>  Have you measured anything like this on you systems as well? That would
>> require using physical memory mapped to the socket w/o HCA exclusively for
>> MPI messaging.
>>
>>  Mike
>>
>>
>> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres) <
>> jsquy...@cisco.com> wrote:
>>
>>> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis <
>>> drmichaelt7...@gmail.com> wrote:
>>>
>>> > The issue is that when you read or write PCIe_gen 3 dat to a non-local
>>> NUMA memory, SandyBridge will use the inter-socket QPIs to get this data
>>> across to the other socket. I think there is considerable limitation in
>>> PCIe I/O traffic data going over the inter-socket QPI. One way to get
>>> around this is for reads to buffer all data into memory space local to the

Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Michael Thomadakis
| Remember that the point of IB and other operating-system bypass devices
is that the driver is not involved in the fast path of sending /
| receiving.  One of the side-effects of that design point is that
userspace does all the allocation of send / receive buffers.

That's a good point. It was not clear to me who and with what logic was
allocating memory. But definitely for IB it makes sense that the user
provides pointers to their memory.

thanks
Michael




On Mon, Jul 8, 2013 at 1:07 PM, Jeff Squyres (jsquyres)
wrote:

> On Jul 8, 2013, at 2:01 PM, Brice Goglin  wrote:
>
> > The driver doesn't allocate much memory here. Maybe some small control
> buffers, but nothing significantly involved in large message transfer
> performance. Everything critical here is allocated by user-space (either
> MPI lib or application), so we just have to make sure we bind the process
> memory properly. I used hwloc-bind to do that.
>
> +1
>
> Remember that the point of IB and other operating-system bypass devices is
> that the driver is not involved in the fast path of sending / receiving.
>  One of the side-effects of that design point is that userspace does all
> the allocation of send / receive buffers.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Tim Carlson

On Mon, 8 Jul 2013, Elken, Tom wrote:

It isn't quite so easy.

Out of the box, there is no gcc on the Phi card. You can use the cross 
compiler on the host, but you don't get gcc on the Phi by default.


See this post http://software.intel.com/en-us/forums/topic/382057

I really think you would need to build and install gcc on the Phi first.

My first pass at doing a cross-compile with the GNU compilers failed to 
produce something with OFED support (not surprising)


export PATH=/usr/linux-k1om-4.7/bin:$PATH
./configure --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
--disable-mpi-f77

checking if MCA component btl:openib can compile... no


Tim



 

Thanks Tom, that sounds good. I will give it a try as soon as our Phi host
here host gets installed. 

 

I assume that all the prerequisite libs and bins on the Phi side are
available when we download the Phi s/w stack from Intel's site, right ?

[Tom]

Right.  When you install Intel’s MPSS (Manycore Platform Software Stack),
including following the section on “OFED Support” in the readme file, you
should have all the prerequisite libs and bins.  Note that I have not built
Open MPI for Xeon Phi for your interconnect, but it seems to me that it
should work.

 

-Tom

 

Cheers

Michael

 

 

 

On Mon, Jul 8, 2013 at 12:10 PM, Elken, Tom  wrote:

Do you guys have any plan to support Intel Phi in the future? That is,
running MPI code on the Phi cards or across the multicore and Phi, as Intel
MPI does?

[Tom]

Hi Michael,

Because a Xeon Phi card acts a lot like a Linux host with an x86
architecture, you can build your own Open MPI libraries to serve this
purpose.

Our team has used existing (an older 1.4.3 version of) Open MPI source to
build an Open MPI for running MPI code on Intel Xeon Phi cards over Intel’s
(formerly QLogic’s) True Scale InfiniBand fabric, and it works quite well. 
We have not released a pre-built Open MPI as part of any Intel software
release.   But I think if you have a compiler for Xeon Phi (Intel Compiler
or GCC) and an interconnect for it, you should be able to build an Open MPI
that works on Xeon Phi. 

Cheers,
Tom Elken

thanks...

Michael

 

On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain  wrote:

Rolf will have to answer the question on level of support. The CUDA code is
not in the 1.6 series as it was developed after that series went "stable".
It is in the 1.7 series, although the level of support will likely be
incrementally increasing as that "feature" series continues to evolve.



On Jul 6, 2013, at 12:06 PM, Michael Thomadakis 
wrote:

> Hello OpenMPI,
>
> I am wondering what level of support is there for CUDA and GPUdirect on
OpenMPI 1.6.5 and 1.7.2.
>
> I saw the ./configure --with-cuda=CUDA_DIR option in the FAQ. However, it
seems that with configure v1.6.5 it was ignored.
>
> Can you identify GPU memory and send messages from it directly without
copying to host memory first?
>
>
> Or in general, what level of CUDA support is there on 1.6.5 and 1.7.2 ? Do
you support SDK 5.0 and above?
>
> Cheers ...
> Michael

> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

 


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

 





Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Michael Thomadakis
Hi Tim,


Well, in general and not on MIC I usually build the MPI stacks using the
Intel compiler set. Have you ran into s/w that requires GCC instead of
Intel compilers (beside Nvidia Cuda)? Did you try to use Intel compiler to
produce MIC native code (the OpenMPI stack for that matter)?

regards
Michael


On Mon, Jul 8, 2013 at 4:30 PM, Tim Carlson  wrote:

> On Mon, 8 Jul 2013, Elken, Tom wrote:
>
> It isn't quite so easy.
>
> Out of the box, there is no gcc on the Phi card. You can use the cross
> compiler on the host, but you don't get gcc on the Phi by default.
>
> See this post 
> http://software.intel.com/en-**us/forums/topic/382057
>
> I really think you would need to build and install gcc on the Phi first.
>
> My first pass at doing a cross-compile with the GNU compilers failed to
> produce something with OFED support (not surprising)
>
> export PATH=/usr/linux-k1om-4.7/bin:$**PATH
> ./configure --build=x86_64-unknown-linux-**gnu --host=x86_64-k1om-linux \
> --disable-mpi-f77
>
> checking if MCA component btl:openib can compile... no
>
>
> Tim
>
>
>
>>
>>
>> Thanks Tom, that sounds good. I will give it a try as soon as our Phi host
>> here host gets installed.
>>
>>
>>
>> I assume that all the prerequisite libs and bins on the Phi side are
>> available when we download the Phi s/w stack from Intel's site, right ?
>>
>> [Tom]
>>
>> Right.  When you install Intel’s MPSS (Manycore Platform Software Stack),
>> including following the section on “OFED Support” in the readme file, you
>> should have all the prerequisite libs and bins.  Note that I have not
>> built
>> Open MPI for Xeon Phi for your interconnect, but it seems to me that it
>> should work.
>>
>>
>>
>> -Tom
>>
>>
>>
>> Cheers
>>
>> Michael
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 8, 2013 at 12:10 PM, Elken, Tom  wrote:
>>
>> Do you guys have any plan to support Intel Phi in the future? That is,
>> running MPI code on the Phi cards or across the multicore and Phi, as
>> Intel
>> MPI does?
>>
>> [Tom]
>>
>> Hi Michael,
>>
>> Because a Xeon Phi card acts a lot like a Linux host with an x86
>> architecture, you can build your own Open MPI libraries to serve this
>> purpose.
>>
>> Our team has used existing (an older 1.4.3 version of) Open MPI source to
>> build an Open MPI for running MPI code on Intel Xeon Phi cards over
>> Intel’s
>> (formerly QLogic’s) True Scale InfiniBand fabric, and it works quite
>> well.
>> We have not released a pre-built Open MPI as part of any Intel software
>> release.   But I think if you have a compiler for Xeon Phi (Intel Compiler
>> or GCC) and an interconnect for it, you should be able to build an Open
>> MPI
>> that works on Xeon Phi.
>>
>> Cheers,
>> Tom Elken
>>
>> thanks...
>>
>> Michael
>>
>>
>>
>> On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain  wrote:
>>
>> Rolf will have to answer the question on level of support. The CUDA code
>> is
>> not in the 1.6 series as it was developed after that series went "stable".
>> It is in the 1.7 series, although the level of support will likely be
>> incrementally increasing as that "feature" series continues to evolve.
>>
>>
>>
>> On Jul 6, 2013, at 12:06 PM, Michael Thomadakis > >
>> wrote:
>>
>> > Hello OpenMPI,
>> >
>> > I am wondering what level of support is there for CUDA and GPUdirect on
>> OpenMPI 1.6.5 and 1.7.2.
>> >
>> > I saw the ./configure --with-cuda=CUDA_DIR option in the FAQ. However,
>> it
>> seems that with configure v1.6.5 it was ignored.
>> >
>> > Can you identify GPU memory and send messages from it directly without
>> copying to host memory first?
>> >
>> >
>> > Or in general, what level of CUDA support is there on 1.6.5 and 1.7.2 ?
>> Do
>> you support SDK 5.0 and above?
>> >
>> > Cheers ...
>> > Michael
>>
>> > __**_
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/**mailman/listinfo.cgi/users
>>
>>
>> __**_
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/**mailman/listinfo.cgi/users
>>
>>
>>
>>
>> __**_
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/**mailman/listinfo.cgi/users
>>
>>
>>
>>
>>
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Elken, Tom

Hi Tim,


Well, in general and not on MIC I usually build the MPI stacks using the Intel 
compiler set. Have you ran into s/w that requires GCC instead of Intel 
compilers (beside Nvidia Cuda)? Did you try to use Intel compiler to produce 
MIC native code (the OpenMPI stack for that matter)?
[Tom]
Good idea Michael,  With the Intel Compiler, I would use the -mmic flag to 
build MIC code.

Tim wrote: "My first pass at doing a cross-compile with the GNU compilers 
failed to produce something with OFED support (not surprising)

export PATH=/usr/linux-k1om-4.7/bin:$PATH
./configure --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
--disable-mpi-f77

checking if MCA component btl:openib can compile... no

Regarding a Gnu cross compiler, this worked for one of our engineers building 
for True Scale HCAs and PSM/infinipath.  But it may give useful tips for 
building for btl:openib as well:

3. How to configure/compile OpenMPI:
   a). untar the openmpi tarball.
   b). edit configure in top directory, add '-linfinipath' after 
'-lpsm_infinipath "
   but not necessary for messages, only for command lines.

   c). run the following script,
#!/bin/sh

./configure \
--host=x86_64-k1om-linux \
--enable-mpi-f77=no --enable-mpi-f90=no \
--with-psm=/.../psm-7.6 \
--prefix=/.../openmpi \
CC=x86_64-k1om-linux-gcc  CXX=x86_64-k1om-linux-g++ \
AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib

   d). run 'make' and 'make install'

OK, I see that they did not configure for mpi-f77 & mpif90, but perhaps this is 
still helpful, if the AR and RANLIB flags are important.
-Tom



regards
Michael

On Mon, Jul 8, 2013 at 4:30 PM, Tim Carlson 
mailto:tim.carl...@pnl.gov>> wrote:
On Mon, 8 Jul 2013, Elken, Tom wrote:

It isn't quite so easy.

Out of the box, there is no gcc on the Phi card. You can use the cross compiler 
on the host, but you don't get gcc on the Phi by default.

See this post http://software.intel.com/en-us/forums/topic/382057

I really think you would need to build and install gcc on the Phi first.

My first pass at doing a cross-compile with the GNU compilers failed to produce 
something with OFED support (not surprising)

export PATH=/usr/linux-k1om-4.7/bin:$PATH
./configure --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
--disable-mpi-f77

checking if MCA component btl:openib can compile... no


Tim




Thanks Tom, that sounds good. I will give it a try as soon as our Phi host
here host gets installed.



I assume that all the prerequisite libs and bins on the Phi side are
available when we download the Phi s/w stack from Intel's site, right ?

[Tom]

Right.  When you install Intel's MPSS (Manycore Platform Software Stack),
including following the section on "OFED Support" in the readme file, you
should have all the prerequisite libs and bins.  Note that I have not built
Open MPI for Xeon Phi for your interconnect, but it seems to me that it
should work.



-Tom



Cheers

Michael







On Mon, Jul 8, 2013 at 12:10 PM, Elken, Tom 
mailto:tom.el...@intel.com>> wrote:

Do you guys have any plan to support Intel Phi in the future? That is,
running MPI code on the Phi cards or across the multicore and Phi, as Intel
MPI does?

[Tom]

Hi Michael,

Because a Xeon Phi card acts a lot like a Linux host with an x86
architecture, you can build your own Open MPI libraries to serve this
purpose.

Our team has used existing (an older 1.4.3 version of) Open MPI source to
build an Open MPI for running MPI code on Intel Xeon Phi cards over Intel's
(formerly QLogic's) True Scale InfiniBand fabric, and it works quite well.
We have not released a pre-built Open MPI as part of any Intel software
release.   But I think if you have a compiler for Xeon Phi (Intel Compiler
or GCC) and an interconnect for it, you should be able to build an Open MPI
that works on Xeon Phi.

Cheers,
Tom Elken

thanks...

Michael



On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain 
mailto:r...@open-mpi.org>> wrote:

Rolf will have to answer the question on level of support. The CUDA code is
not in the 1.6 series as it was developed after that series went "stable".
It is in the 1.7 series, although the level of support will likely be
incrementally increasing as that "feature" series continues to evolve.



On Jul 6, 2013, at 12:06 PM, Michael Thomadakis 
mailto:drmichaelt7...@gmail.com>>
wrote:

> Hello OpenMPI,
>
> I am wondering what level of support is there for CUDA and GPUdirect on
OpenMPI 1.6.5 and 1.7.2.
>
> I saw the ./configure --with-cuda=CUDA_DIR option in the FAQ. However, it
seems that with configure v1.6.5 it was ignored.
>
> Can you identify GPU memory and send messages from it directly without
copying to host memory first?
>
>
> Or in general, what level of CUDA support is there on 1.6.5 and 1.7.2 ? Do
you support SDK 5.0 and above?
>
> Cheers ...
> Michael

> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mp

Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Tim Carlson

On Mon, 8 Jul 2013, Elken, Tom wrote:

My mistake on the OFED bits. The host I was installing on did not have all 
of the MPSS software installed (my cluster admin node and not one of the 
compute nodes). Adding the intel-mic-ofed-card RPM fixed the problem with 
compiling the btl:openib bits with both the GNU and Intel compilers using 
the cross-compiler route (-mmic on the Intel side)


Still working on getting the resulting mpicc wrapper working on the MIC 
side. When I get a working example I'll post the results.


Thanks!

Tim




 

Hi Tim,

 

 

Well, in general and not on MIC I usually build the MPI stacks using the
Intel compiler set. Have you ran into s/w that requires GCC instead of Intel
compilers (beside Nvidia Cuda)? Did you try to use Intel compiler to produce
MIC native code (the OpenMPI stack for that matter)? 

[Tom]

Good idea Michael,  With the Intel Compiler, I would use the -mmic flag to
build MIC code.

 

Tim wrote: “My first pass at doing a cross-compile with the GNU compilers
failed to produce something with OFED support (not surprising)

export PATH=/usr/linux-k1om-4.7/bin:$PATH
./configure --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
--disable-mpi-f77

checking if MCA component btl:openib can compile... no

 

Regarding a Gnu cross compiler, this worked for one of our engineers
building for True Scale HCAs and PSM/infinipath.  But it may give useful
tips for building for btl:openib as well:

 

3. How to configure/compile OpenMPI:

   a). untar the openmpi tarball.

   b). edit configure in top directory, add '-linfinipath' after
'-lpsm_infinipath "

       but not necessary for messages, only for command lines.

 

   c). run the following script,

#!/bin/sh

 

./configure \

--host=x86_64-k1om-linux \

--enable-mpi-f77=no --enable-mpi-f90=no \

--with-psm=/…/psm-7.6 \

--prefix=/…/openmpi \

CC=x86_64-k1om-linux-gcc  CXX=x86_64-k1om-linux-g++ \

AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib

 

   d). run 'make' and 'make install'

 

OK, I see that they did not configure for mpi-f77 & mpif90, but perhaps this
is still helpful, if the AR and RANLIB flags are important.

-Tom

 

 

 

regards

Michael

 

On Mon, Jul 8, 2013 at 4:30 PM, Tim Carlson  wrote:

On Mon, 8 Jul 2013, Elken, Tom wrote:

It isn't quite so easy.

Out of the box, there is no gcc on the Phi card. You can use the cross
compiler on the host, but you don't get gcc on the Phi by default.

See this post http://software.intel.com/en-us/forums/topic/382057

I really think you would need to build and install gcc on the Phi first.

My first pass at doing a cross-compile with the GNU compilers failed to
produce something with OFED support (not surprising)

export PATH=/usr/linux-k1om-4.7/bin:$PATH
./configure --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
--disable-mpi-f77

checking if MCA component btl:openib can compile... no


Tim

   


   

  Thanks Tom, that sounds good. I will give it a try as soon as
  our Phi host
  here host gets installed. 

   

  I assume that all the prerequisite libs and bins on the Phi side
  are
  available when we download the Phi s/w stack from Intel's site,
  right ?

  [Tom]

  Right.  When you install Intel’s MPSS (Manycore Platform
  Software Stack),
  including following the section on “OFED Support” in the readme
  file, you
  should have all the prerequisite libs and bins.  Note that I
  have not built
  Open MPI for Xeon Phi for your interconnect, but it seems to me
  that it
  should work.

   

  -Tom

   

  Cheers

  Michael

   

   

   

  On Mon, Jul 8, 2013 at 12:10 PM, Elken, Tom
   wrote:

  Do you guys have any plan to support Intel Phi in the future?
  That is,
  running MPI code on the Phi cards or across the multicore and
  Phi, as Intel
  MPI does?

  [Tom]

  Hi Michael,

  Because a Xeon Phi card acts a lot like a Linux host with an x86
  architecture, you can build your own Open MPI libraries to serve
  this
  purpose.

  Our team has used existing (an older 1.4.3 version of) Open MPI
  source to
  build an Open MPI for running MPI code on Intel Xeon Phi cards
  over Intel’s
  (formerly QLogic’s) True Scale InfiniBand fabric, and it works
  quite well. 
  We have not released a pre-built Open MPI as part of any Intel
  software
  release.   But I think if you have a compiler for Xeon Phi
  (Intel Compiler
  or GCC) and an interconnect for it, you should be able to build
  an Open MPI
  that works on Xeon Phi. 

  Cheers,
  Tom Elken

  thanks...

  Michael

   

  On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain 
  wrote:

  Rolf will have to answer the question on level of support. The
  CUDA code is
  not in the 1.6 series as it was developed