[OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Brice Goglin
Hello,

I am trying to play with nvidia's gpudirect. The test program given with
the gpudirect tarball just does a basic MPI ping-pong between two
process that allocated their buffers with cudaHostMalloc instead of
malloc. It seems to work with Intel MPI but Open MPI 1.5 hangs in the
first MPI_Send. Replacing the cuda buffer with a normally-malloc'ed
buffer makes the program work again. I assume that something goes wrong
when OMPI tries to register/pin the cuda buffer in the IB stack (that's
what gpudirect seems to be about), but I don't see why Intel MPI would
succeed there.

Has anybody ever looked at this?

FWIW, we're using OMPI 1.5, OFED 1.5.2, Intel MPI 4.0.0.28 and SLES11 w/
and w/o the gpudirect patch.

Thanks
Brice Goglin



Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Rolf vandeVaart
Hi Brice:
Yes, I have tired OMPI 1.5 with gpudirect and it worked for me.  You definitely 
need the patch or you will see the behavior just as you described, a hang. One 
thing you could try is disabling the large message RDMA in OMPI and see if that 
works.  That can be done by adjusting the openib BTL flags.

-- mca btl_openib_flags 304

Rolf 

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Brice Goglin
Sent: Monday, February 28, 2011 11:16 AM
To: us...@open-mpi.org
Subject: [OMPI users] anybody tried OMPI with gpudirect?

Hello,

I am trying to play with nvidia's gpudirect. The test program given with the 
gpudirect tarball just does a basic MPI ping-pong between two process that 
allocated their buffers with cudaHostMalloc instead of malloc. It seems to work 
with Intel MPI but Open MPI 1.5 hangs in the first MPI_Send. Replacing the cuda 
buffer with a normally-malloc'ed buffer makes the program work again. I assume 
that something goes wrong when OMPI tries to register/pin the cuda buffer in 
the IB stack (that's what gpudirect seems to be about), but I don't see why 
Intel MPI would succeed there.

Has anybody ever looked at this?

FWIW, we're using OMPI 1.5, OFED 1.5.2, Intel MPI 4.0.0.28 and SLES11 w/ and 
w/o the gpudirect patch.

Thanks
Brice Goglin

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---



Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Brice Goglin
Le 28/02/2011 17:30, Rolf vandeVaart a écrit :
> Hi Brice:
> Yes, I have tired OMPI 1.5 with gpudirect and it worked for me.  You 
> definitely need the patch or you will see the behavior just as you described, 
> a hang. One thing you could try is disabling the large message RDMA in OMPI 
> and see if that works.  That can be done by adjusting the openib BTL flags.
>
> -- mca btl_openib_flags 304
>
> Rolf 
>   

Thanks Rolf. Adding this mca parameter worked-around the hang indeed.
The kernel is supposed to be properly patched for gpudirect. Are you
aware of anything else we might need to make this work? Do we need to
rebuild some OFED kernel modules for instance?

Also, is there any reliable/easy way to check if gpudirect works in our
kernel ? (we had to manually fix the gpudirect patch for SLES11).

Brice



Re: [OMPI users] mpirun error: "error while loading shared libraries: libopen-rte.so.0: cannot open shared object file:"

2011-02-28 Thread Jeff Squyres
More specifically -- ensure that LD_LIBRARY_PATH is set properly *on all nodes 
where you are running Open MPI processes*.

For example, if you're using a hostfile to launch across multiple machines, 
ensure that your shell startup files (e.g., .bashrc) are setup to set your 
LD_LIBRARY_PATH properly, even for non-interactive logins.


On Feb 27, 2011, at 4:49 PM, swagat mishra wrote:

> you need to set up LD_LIBRARY_PATH to point to the folder where your shared 
> libraries are located
> LD_LIBRARY_PATH=/usr/local/library/folder
> 
> On Mon, Feb 28, 2011 at 3:03 AM, Sonyx Wonda  wrote:
> 
> Hello!
> 
> I am a newbie to openmpi and I am having some trouble running openmpi 
> programs. 
> I downloaded and installed the latest version from the web site 
> (openmpi-1.4.3) and the whole process completed successfully. Both 
> ./configure and make all install commands were successful. I am able to 
> compile open-mpi codes (using mpicc and mpiCC) as I did with the example 
> files provided within the source package, but I have a problem when it comes 
> to actually running the executable created. For example, when I tried to run 
> the "hello world" program using: mpirun -np 2 ./hello_c I got the following 
> output:
> 
> hello_c: error while loading shared libraries: libopen-rte.so.0: cannot open 
> shared object file: No such file or directory
> 
> (I did find the libopen-rte.so.0 file in the /usr/local/lib/ folder)
> I have tried re-installing but this doesn't seem to work.
> I use Linux Mandriva 2007 with the bash shell. The attached compressed folder 
> contains the config.log file and the output from the ompi_info --all command 
> (ompi_info.out), and below is the value of the $PATH environment variable
> 
> /sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/usr/local/sbin:/usr/lib/qt3//bin:
> 
> Thanks in advance for your help.
> Regards.
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] RDMACM Differences

2011-02-28 Thread Jagga Soorma
Hello,

I am running into the following issue while trying to run osu_latency:

--
-bash-3.2$ mpiexec --mca btl openib,self -mca btl_openib_warn_default_gid_
prefix 0 -np 2 --hostfile mpihosts
/home/jagga/osu-micro-benchmarks-3.3/openmpi/ofed-1.5.2/bin/osu_latency
# OSU MPI Latency Test v3.3
# SizeLatency (us)
[amber04][[10252,1],1][connect/btl_openib_connect_oob.c:325:qp_connect_all]
error modifing QP to RTR errno says Invalid argument
[amber04][[10252,1],1][connect/btl_openib_connect_oob.c:815:rml_recv_cb]
error in endpoint reply start connect
--
mpiexec has exited due to process rank 1 with PID 6781 on
node amber04 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--
--

I can get around this by adding the "--mca btl_openib_cpc_include rdmacm"
option.  However, I have another host with a different HCA with all the same
drivers and software versions that I can run this same command successfully
with using the rdmacm option.  What could be causing one of my environments
to fail but the other to work fine (without the rdmacm option)?

--
[root@amber03 ~]# ofed_info | grep OFED
MLNX_OFED_LINUX-1.5.2-1.0.0 (OFED-1.5.2-20101020-1520):
MLNX_OFED_LINUX-1.5.2-1.0.0
(/mswg/release/ofed-1.5.2-rpms/rnfs-utils/rnfs-utils-1.1.5-10.OFED.src.rpm):

[root@amber03 ~]# ibv_devinfo
hca_id:mlx4_0
transport:InfiniBand (0)
fw_ver:2.7.9294
node_guid:78e7:d103:0021:8884
sys_image_guid:78e7:d103:0021:8887
vendor_id:0x02c9
vendor_part_id:26438
hw_ver:0xB0
board_id:HP_020003
phys_port_cnt:2
port:1
state:PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu:2048 (4)
sm_lid:1
port_lid:20
port_lmc:0x00
link_layer:IB

port:2
state:PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu:1024 (3)
sm_lid:0
port_lid:0
port_lmc:0x00
link_layer:Ethernet
--

Any help would be greatly appreciated.

Thanks,
-J


Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Rolf vandeVaart

For the GPU Direct to work with Infiniband, you need to get some updated OFED 
bits from your Infiniband vendor. 

In terms of checking the driver updates, you can do a grep on the string 
get_driver_pages in the file/proc/kallsyms.  If it is there, then the Linux 
kernel is updated correctly.

The GPU Direct functioning should be independent of the MPI you are using.

Rolf  


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Brice Goglin
Sent: Monday, February 28, 2011 11:42 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] anybody tried OMPI with gpudirect?

Le 28/02/2011 17:30, Rolf vandeVaart a écrit :
> Hi Brice:
> Yes, I have tired OMPI 1.5 with gpudirect and it worked for me.  You 
> definitely need the patch or you will see the behavior just as you described, 
> a hang. One thing you could try is disabling the large message RDMA in OMPI 
> and see if that works.  That can be done by adjusting the openib BTL flags.
>
> -- mca btl_openib_flags 304
>
> Rolf 
>   

Thanks Rolf. Adding this mca parameter worked-around the hang indeed.
The kernel is supposed to be properly patched for gpudirect. Are you
aware of anything else we might need to make this work? Do we need to
rebuild some OFED kernel modules for instance?

Also, is there any reliable/easy way to check if gpudirect works in our
kernel ? (we had to manually fix the gpudirect patch for SLES11).

Brice

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---



Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Brice Goglin
Le 28/02/2011 19:49, Rolf vandeVaart a écrit :
> For the GPU Direct to work with Infiniband, you need to get some updated OFED 
> bits from your Infiniband vendor. 
>
> In terms of checking the driver updates, you can do a grep on the string 
> get_driver_pages in the file/proc/kallsyms.  If it is there, then the Linux 
> kernel is updated correctly.
>   

The kernel looks ok then. But I couldn't find any kernel modules (tried
nvidia.ko and all ib modules) which references this symbol. So I guess
my OFED kernel modules aren't ok. I'll check on Mellanox website (we
have some very recent Mellanox ConnectX QDR boards).

thanks
Brice



Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Rolf vandeVaart
-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Brice Goglin
Sent: Monday, February 28, 2011 2:14 PM
To: Open MPI Users
Subject: Re: [OMPI users] anybody tried OMPI with gpudirect? 

Le 28/02/2011 19:49, Rolf vandeVaart a écrit :
> For the GPU Direct to work with Infiniband, you need to get some updated OFED 
> bits from your Infiniband vendor. 
>
> In terms of checking the driver updates, you can do a grep on the string 
> get_driver_pages in the file/proc/kallsyms.  If it is there, then the Linux 
> kernel is updated correctly.
>   

The kernel looks ok then. But I couldn't find any kernel modules (tried 
nvidia.ko and all ib modules) which references this symbol. So I guess my OFED 
kernel modules aren't ok. I'll check on Mellanox website (we have some very 
recent Mellanox ConnectX QDR boards).

thanks
Brice

--
I have since learned that you can check /sys/module/ib_core/parameters/*  which 
will list a couple of GPU direct files if the driver is installed correctly and 
loaded.

Rolf

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---



Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Pak Lui
Hi Brice, 

You will need the MLNX_OFED with the GPUDirect support in order to work. I will 
check to there's a release of it that supports SLES and let you know.

[pak@maia001 ~]$ /sbin/modinfo ib_core
filename:   
/lib/modules/2.6.18-194.nvel5/updates/kernel/drivers/infiniband/core/ib_core.ko

parm:   gpu_direct_enable:Enable GPU Direct [default 1] (int)
parm:   gpu_direct_shares:GPU Direct Calls Number [default 0] (int)
parm:   gpu_direct_pages:GPU Direct Shared Pages Number [default 0] 
(int)
parm:   gpu_direct_fail:GPU Direct Failures Number [default 0] (int)

Once that IB driver is loaded, you should find that there are additional 
counters being available from ib_core. And if you are using GPUDirect, the 
gpu_direct_shares and gpu_direct_pages counters will be incremented. The 
counters are located at:
/sys/module/ib_core/parameters/gpu_direct_shares
/sys/module/ib_core/parameters/gpu_direct_pages

Regards,

- Pak


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Brice Goglin
Sent: Monday, February 28, 2011 11:14 AM
To: Open MPI Users
Subject: Re: [OMPI users] anybody tried OMPI with gpudirect?

Le 28/02/2011 19:49, Rolf vandeVaart a écrit :
> For the GPU Direct to work with Infiniband, you need to get some updated OFED 
> bits from your Infiniband vendor. 
>
> In terms of checking the driver updates, you can do a grep on the string 
> get_driver_pages in the file/proc/kallsyms.  If it is there, then the Linux 
> kernel is updated correctly.
>   

The kernel looks ok then. But I couldn't find any kernel modules (tried
nvidia.ko and all ib modules) which references this symbol. So I guess
my OFED kernel modules aren't ok. I'll check on Mellanox website (we
have some very recent Mellanox ConnectX QDR boards).

thanks
Brice

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Pak Lui
Actually, since that GPUDirect is not yet officially released, but you may want 
to contact h...@mellanox.com to get the needed info and when the drivers will 
be released. Thanks!

- Pak


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Pak Lui
Sent: Monday, February 28, 2011 11:30 AM
To: Open MPI Users
Subject: Re: [OMPI users] anybody tried OMPI with gpudirect?

Hi Brice, 

You will need the MLNX_OFED with the GPUDirect support in order to work. I will 
check to there's a release of it that supports SLES and let you know.

[pak@maia001 ~]$ /sbin/modinfo ib_core
filename:   
/lib/modules/2.6.18-194.nvel5/updates/kernel/drivers/infiniband/core/ib_core.ko

parm:   gpu_direct_enable:Enable GPU Direct [default 1] (int)
parm:   gpu_direct_shares:GPU Direct Calls Number [default 0] (int)
parm:   gpu_direct_pages:GPU Direct Shared Pages Number [default 0] 
(int)
parm:   gpu_direct_fail:GPU Direct Failures Number [default 0] (int)

Once that IB driver is loaded, you should find that there are additional 
counters being available from ib_core. And if you are using GPUDirect, the 
gpu_direct_shares and gpu_direct_pages counters will be incremented. The 
counters are located at:
/sys/module/ib_core/parameters/gpu_direct_shares
/sys/module/ib_core/parameters/gpu_direct_pages

Regards,

- Pak


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Brice Goglin
Sent: Monday, February 28, 2011 11:14 AM
To: Open MPI Users
Subject: Re: [OMPI users] anybody tried OMPI with gpudirect?

Le 28/02/2011 19:49, Rolf vandeVaart a écrit :
> For the GPU Direct to work with Infiniband, you need to get some updated OFED 
> bits from your Infiniband vendor. 
>
> In terms of checking the driver updates, you can do a grep on the string 
> get_driver_pages in the file/proc/kallsyms.  If it is there, then the Linux 
> kernel is updated correctly.
>   

The kernel looks ok then. But I couldn't find any kernel modules (tried
nvidia.ko and all ib modules) which references this symbol. So I guess
my OFED kernel modules aren't ok. I'll check on Mellanox website (we
have some very recent Mellanox ConnectX QDR boards).

thanks
Brice

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users