[OMPI users] anybody tried OMPI with gpudirect?
Hello, I am trying to play with nvidia's gpudirect. The test program given with the gpudirect tarball just does a basic MPI ping-pong between two process that allocated their buffers with cudaHostMalloc instead of malloc. It seems to work with Intel MPI but Open MPI 1.5 hangs in the first MPI_Send. Replacing the cuda buffer with a normally-malloc'ed buffer makes the program work again. I assume that something goes wrong when OMPI tries to register/pin the cuda buffer in the IB stack (that's what gpudirect seems to be about), but I don't see why Intel MPI would succeed there. Has anybody ever looked at this? FWIW, we're using OMPI 1.5, OFED 1.5.2, Intel MPI 4.0.0.28 and SLES11 w/ and w/o the gpudirect patch. Thanks Brice Goglin
Re: [OMPI users] anybody tried OMPI with gpudirect?
Hi Brice: Yes, I have tired OMPI 1.5 with gpudirect and it worked for me. You definitely need the patch or you will see the behavior just as you described, a hang. One thing you could try is disabling the large message RDMA in OMPI and see if that works. That can be done by adjusting the openib BTL flags. -- mca btl_openib_flags 304 Rolf -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brice Goglin Sent: Monday, February 28, 2011 11:16 AM To: us...@open-mpi.org Subject: [OMPI users] anybody tried OMPI with gpudirect? Hello, I am trying to play with nvidia's gpudirect. The test program given with the gpudirect tarball just does a basic MPI ping-pong between two process that allocated their buffers with cudaHostMalloc instead of malloc. It seems to work with Intel MPI but Open MPI 1.5 hangs in the first MPI_Send. Replacing the cuda buffer with a normally-malloc'ed buffer makes the program work again. I assume that something goes wrong when OMPI tries to register/pin the cuda buffer in the IB stack (that's what gpudirect seems to be about), but I don't see why Intel MPI would succeed there. Has anybody ever looked at this? FWIW, we're using OMPI 1.5, OFED 1.5.2, Intel MPI 4.0.0.28 and SLES11 w/ and w/o the gpudirect patch. Thanks Brice Goglin ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
Re: [OMPI users] anybody tried OMPI with gpudirect?
Le 28/02/2011 17:30, Rolf vandeVaart a écrit : > Hi Brice: > Yes, I have tired OMPI 1.5 with gpudirect and it worked for me. You > definitely need the patch or you will see the behavior just as you described, > a hang. One thing you could try is disabling the large message RDMA in OMPI > and see if that works. That can be done by adjusting the openib BTL flags. > > -- mca btl_openib_flags 304 > > Rolf > Thanks Rolf. Adding this mca parameter worked-around the hang indeed. The kernel is supposed to be properly patched for gpudirect. Are you aware of anything else we might need to make this work? Do we need to rebuild some OFED kernel modules for instance? Also, is there any reliable/easy way to check if gpudirect works in our kernel ? (we had to manually fix the gpudirect patch for SLES11). Brice
Re: [OMPI users] mpirun error: "error while loading shared libraries: libopen-rte.so.0: cannot open shared object file:"
More specifically -- ensure that LD_LIBRARY_PATH is set properly *on all nodes where you are running Open MPI processes*. For example, if you're using a hostfile to launch across multiple machines, ensure that your shell startup files (e.g., .bashrc) are setup to set your LD_LIBRARY_PATH properly, even for non-interactive logins. On Feb 27, 2011, at 4:49 PM, swagat mishra wrote: > you need to set up LD_LIBRARY_PATH to point to the folder where your shared > libraries are located > LD_LIBRARY_PATH=/usr/local/library/folder > > On Mon, Feb 28, 2011 at 3:03 AM, Sonyx Wonda wrote: > > Hello! > > I am a newbie to openmpi and I am having some trouble running openmpi > programs. > I downloaded and installed the latest version from the web site > (openmpi-1.4.3) and the whole process completed successfully. Both > ./configure and make all install commands were successful. I am able to > compile open-mpi codes (using mpicc and mpiCC) as I did with the example > files provided within the source package, but I have a problem when it comes > to actually running the executable created. For example, when I tried to run > the "hello world" program using: mpirun -np 2 ./hello_c I got the following > output: > > hello_c: error while loading shared libraries: libopen-rte.so.0: cannot open > shared object file: No such file or directory > > (I did find the libopen-rte.so.0 file in the /usr/local/lib/ folder) > I have tried re-installing but this doesn't seem to work. > I use Linux Mandriva 2007 with the bash shell. The attached compressed folder > contains the config.log file and the output from the ompi_info --all command > (ompi_info.out), and below is the value of the $PATH environment variable > > /sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/usr/local/sbin:/usr/lib/qt3//bin: > > Thanks in advance for your help. > Regards. > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] RDMACM Differences
Hello, I am running into the following issue while trying to run osu_latency: -- -bash-3.2$ mpiexec --mca btl openib,self -mca btl_openib_warn_default_gid_ prefix 0 -np 2 --hostfile mpihosts /home/jagga/osu-micro-benchmarks-3.3/openmpi/ofed-1.5.2/bin/osu_latency # OSU MPI Latency Test v3.3 # SizeLatency (us) [amber04][[10252,1],1][connect/btl_openib_connect_oob.c:325:qp_connect_all] error modifing QP to RTR errno says Invalid argument [amber04][[10252,1],1][connect/btl_openib_connect_oob.c:815:rml_recv_cb] error in endpoint reply start connect -- mpiexec has exited due to process rank 1 with PID 6781 on node amber04 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpiexec (as reported here). -- -- I can get around this by adding the "--mca btl_openib_cpc_include rdmacm" option. However, I have another host with a different HCA with all the same drivers and software versions that I can run this same command successfully with using the rdmacm option. What could be causing one of my environments to fail but the other to work fine (without the rdmacm option)? -- [root@amber03 ~]# ofed_info | grep OFED MLNX_OFED_LINUX-1.5.2-1.0.0 (OFED-1.5.2-20101020-1520): MLNX_OFED_LINUX-1.5.2-1.0.0 (/mswg/release/ofed-1.5.2-rpms/rnfs-utils/rnfs-utils-1.1.5-10.OFED.src.rpm): [root@amber03 ~]# ibv_devinfo hca_id:mlx4_0 transport:InfiniBand (0) fw_ver:2.7.9294 node_guid:78e7:d103:0021:8884 sys_image_guid:78e7:d103:0021:8887 vendor_id:0x02c9 vendor_part_id:26438 hw_ver:0xB0 board_id:HP_020003 phys_port_cnt:2 port:1 state:PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu:2048 (4) sm_lid:1 port_lid:20 port_lmc:0x00 link_layer:IB port:2 state:PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu:1024 (3) sm_lid:0 port_lid:0 port_lmc:0x00 link_layer:Ethernet -- Any help would be greatly appreciated. Thanks, -J
Re: [OMPI users] anybody tried OMPI with gpudirect?
For the GPU Direct to work with Infiniband, you need to get some updated OFED bits from your Infiniband vendor. In terms of checking the driver updates, you can do a grep on the string get_driver_pages in the file/proc/kallsyms. If it is there, then the Linux kernel is updated correctly. The GPU Direct functioning should be independent of the MPI you are using. Rolf -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brice Goglin Sent: Monday, February 28, 2011 11:42 AM To: us...@open-mpi.org Subject: Re: [OMPI users] anybody tried OMPI with gpudirect? Le 28/02/2011 17:30, Rolf vandeVaart a écrit : > Hi Brice: > Yes, I have tired OMPI 1.5 with gpudirect and it worked for me. You > definitely need the patch or you will see the behavior just as you described, > a hang. One thing you could try is disabling the large message RDMA in OMPI > and see if that works. That can be done by adjusting the openib BTL flags. > > -- mca btl_openib_flags 304 > > Rolf > Thanks Rolf. Adding this mca parameter worked-around the hang indeed. The kernel is supposed to be properly patched for gpudirect. Are you aware of anything else we might need to make this work? Do we need to rebuild some OFED kernel modules for instance? Also, is there any reliable/easy way to check if gpudirect works in our kernel ? (we had to manually fix the gpudirect patch for SLES11). Brice ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
Re: [OMPI users] anybody tried OMPI with gpudirect?
Le 28/02/2011 19:49, Rolf vandeVaart a écrit : > For the GPU Direct to work with Infiniband, you need to get some updated OFED > bits from your Infiniband vendor. > > In terms of checking the driver updates, you can do a grep on the string > get_driver_pages in the file/proc/kallsyms. If it is there, then the Linux > kernel is updated correctly. > The kernel looks ok then. But I couldn't find any kernel modules (tried nvidia.ko and all ib modules) which references this symbol. So I guess my OFED kernel modules aren't ok. I'll check on Mellanox website (we have some very recent Mellanox ConnectX QDR boards). thanks Brice
Re: [OMPI users] anybody tried OMPI with gpudirect?
-Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brice Goglin Sent: Monday, February 28, 2011 2:14 PM To: Open MPI Users Subject: Re: [OMPI users] anybody tried OMPI with gpudirect? Le 28/02/2011 19:49, Rolf vandeVaart a écrit : > For the GPU Direct to work with Infiniband, you need to get some updated OFED > bits from your Infiniband vendor. > > In terms of checking the driver updates, you can do a grep on the string > get_driver_pages in the file/proc/kallsyms. If it is there, then the Linux > kernel is updated correctly. > The kernel looks ok then. But I couldn't find any kernel modules (tried nvidia.ko and all ib modules) which references this symbol. So I guess my OFED kernel modules aren't ok. I'll check on Mellanox website (we have some very recent Mellanox ConnectX QDR boards). thanks Brice -- I have since learned that you can check /sys/module/ib_core/parameters/* which will list a couple of GPU direct files if the driver is installed correctly and loaded. Rolf --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
Re: [OMPI users] anybody tried OMPI with gpudirect?
Hi Brice, You will need the MLNX_OFED with the GPUDirect support in order to work. I will check to there's a release of it that supports SLES and let you know. [pak@maia001 ~]$ /sbin/modinfo ib_core filename: /lib/modules/2.6.18-194.nvel5/updates/kernel/drivers/infiniband/core/ib_core.ko parm: gpu_direct_enable:Enable GPU Direct [default 1] (int) parm: gpu_direct_shares:GPU Direct Calls Number [default 0] (int) parm: gpu_direct_pages:GPU Direct Shared Pages Number [default 0] (int) parm: gpu_direct_fail:GPU Direct Failures Number [default 0] (int) Once that IB driver is loaded, you should find that there are additional counters being available from ib_core. And if you are using GPUDirect, the gpu_direct_shares and gpu_direct_pages counters will be incremented. The counters are located at: /sys/module/ib_core/parameters/gpu_direct_shares /sys/module/ib_core/parameters/gpu_direct_pages Regards, - Pak -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brice Goglin Sent: Monday, February 28, 2011 11:14 AM To: Open MPI Users Subject: Re: [OMPI users] anybody tried OMPI with gpudirect? Le 28/02/2011 19:49, Rolf vandeVaart a écrit : > For the GPU Direct to work with Infiniband, you need to get some updated OFED > bits from your Infiniband vendor. > > In terms of checking the driver updates, you can do a grep on the string > get_driver_pages in the file/proc/kallsyms. If it is there, then the Linux > kernel is updated correctly. > The kernel looks ok then. But I couldn't find any kernel modules (tried nvidia.ko and all ib modules) which references this symbol. So I guess my OFED kernel modules aren't ok. I'll check on Mellanox website (we have some very recent Mellanox ConnectX QDR boards). thanks Brice ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] anybody tried OMPI with gpudirect?
Actually, since that GPUDirect is not yet officially released, but you may want to contact h...@mellanox.com to get the needed info and when the drivers will be released. Thanks! - Pak -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Pak Lui Sent: Monday, February 28, 2011 11:30 AM To: Open MPI Users Subject: Re: [OMPI users] anybody tried OMPI with gpudirect? Hi Brice, You will need the MLNX_OFED with the GPUDirect support in order to work. I will check to there's a release of it that supports SLES and let you know. [pak@maia001 ~]$ /sbin/modinfo ib_core filename: /lib/modules/2.6.18-194.nvel5/updates/kernel/drivers/infiniband/core/ib_core.ko parm: gpu_direct_enable:Enable GPU Direct [default 1] (int) parm: gpu_direct_shares:GPU Direct Calls Number [default 0] (int) parm: gpu_direct_pages:GPU Direct Shared Pages Number [default 0] (int) parm: gpu_direct_fail:GPU Direct Failures Number [default 0] (int) Once that IB driver is loaded, you should find that there are additional counters being available from ib_core. And if you are using GPUDirect, the gpu_direct_shares and gpu_direct_pages counters will be incremented. The counters are located at: /sys/module/ib_core/parameters/gpu_direct_shares /sys/module/ib_core/parameters/gpu_direct_pages Regards, - Pak -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brice Goglin Sent: Monday, February 28, 2011 11:14 AM To: Open MPI Users Subject: Re: [OMPI users] anybody tried OMPI with gpudirect? Le 28/02/2011 19:49, Rolf vandeVaart a écrit : > For the GPU Direct to work with Infiniband, you need to get some updated OFED > bits from your Infiniband vendor. > > In terms of checking the driver updates, you can do a grep on the string > get_driver_pages in the file/proc/kallsyms. If it is there, then the Linux > kernel is updated correctly. > The kernel looks ok then. But I couldn't find any kernel modules (tried nvidia.ko and all ib modules) which references this symbol. So I guess my OFED kernel modules aren't ok. I'll check on Mellanox website (we have some very recent Mellanox ConnectX QDR boards). thanks Brice ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users