Re: [OMPI users] CUDA-aware codes not using GPU

2019-09-06 Thread Joshua Ladd via users
Did you build UCX with CUDA support (--with-cuda) ?

Josh

On Thu, Sep 5, 2019 at 8:45 PM AFernandez via users <
users@lists.open-mpi.org> wrote:

> Hello OpenMPI Team,
>
> I'm trying to use CUDA-aware OpenMPI but the system simply ignores the GPU
> and the code runs on the CPUs. I've tried different software but will focus
> on the OSU benchmarks (collective and pt2pt communications). Let me provide
> some data about the configuration of the system:
>
> -OFED v4.17-1-rc2 (the NIC is virtualized but I also tried a Mellanox card
> with MOFED a few days ago and found the same issue)
>
> -CUDA v10.1
>
> -gdrcopy v1.3
>
> -UCX 1.6.0
>
> -OpenMPI 4.0.1
>
> Everything looks like good (CUDA programs work fine, MPI programs run on
> the CPUs without any problem), and the ompi_info outputs what I was
> expecting (but maybe I'm missing something):
>
>
> mca:opal:base:param:opal_built_with_cuda_support:synonym:name:mpi_built_with_cuda_support
>
> mca:mpi:base:param:mpi_built_with_cuda_support:value:true
>
> mca:mpi:base:param:mpi_built_with_cuda_support:source:default
>
> mca:mpi:base:param:mpi_built_with_cuda_support:status:read-only
>
> mca:mpi:base:param:mpi_built_with_cuda_support:level:4
>
> mca:mpi:base:param:mpi_built_with_cuda_support:help:Whether CUDA GPU
> buffer support is built into library or not
>
> mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:0:false
>
> mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:1:true
>
> mca:mpi:base:param:mpi_built_with_cuda_support:deprecated:no
>
> mca:mpi:base:param:mpi_built_with_cuda_support:type:bool
>
>
> mca:mpi:base:param:mpi_built_with_cuda_support:synonym_of:name:opal_built_with_cuda_support
>
> mca:mpi:base:param:mpi_built_with_cuda_support:disabled:false
>
> The available btls are the usual self, openib, tcp & vader plus smcuda,
> uct & usnic. The full output from ompi_info is attached. If I try the flag
> '--mca opal_cuda_verbose 10,' it doesn't output anything, which seems to
> agree with the lack of GPU use. If I try with '--mca btl smcuda,' it makes
> no difference. I have also tried to specify the program to use host and
> device (e.g. mpirun -np 2 ./osu_latency D H) but the same result. I am
> probably missing something but not sure where else to look at or what else
> to try.
>
> Thank you,
>
> AFernandez
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] CUDA-aware codes not using GPU

2019-09-06 Thread Arturo Fernandez via users
<<< text/html: Unrecognized >>>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] UCX and MPI_THREAD_MULTIPLE

2019-09-06 Thread Paul Edmon via users
As a coda to this I managed to get UCX 1.6.0 built with threading and 
OpenMPI 4.0.1 to build using this: 
https://github.com/openucx/ucx/issues/4020


That appears to be working.

-Paul Edmon-

On 8/26/19 9:20 PM, Joshua Ladd wrote:

**apropos  :-)

On Mon, Aug 26, 2019 at 9:19 PM Joshua Ladd > wrote:


Hi, Paul

I must say, this is eerily appropo. I've just sent a request for
Wombat last week as I was planning to have my group start looking
at the performance of UCX OSC on IB. We are most interested in
ensuring UCX OSC MT performs well on Wombat. The bitbucket you're
referencing; is this the source code? Can we build and run it?


Best,

Josh

On Fri, Aug 23, 2019 at 9:37 PM Paul Edmon via users
mailto:users@lists.open-mpi.org>> wrote:

I forgot to include that we have not rebuilt this OpenMPI
4.0.1 against 1.6.0 of UCX but rather 1.5.1. When we upgraded
to 1.6.0 everything seemed to be working for OpenMPI when we
swapped the UCX version with out recompiling (at least in
normal rank level MPI as we had to do the upgrade to UCX to
get MPI_THREAD_MULTIPLE to work at all).

-Paul Edmon-

On 8/23/2019 9:31 PM, Paul Edmon wrote:


Sure.  The code I'm using is the latest version of Wombat
(https://bitbucket.org/pmendygral/wombat-public/wiki/Home ,
I'm using an unreleased updated version as I know the devs). 
I'm using OMP_THREAD_NUM=12 and the command line is:

mpirun -np 16 --hostfile hosts ./wombat

Where the host file lists 4 machines, so 4 ranks per machine
and 12 threads per rank.  Each node has 48 Intel Cascade Lake
cores. I've also tried using the Slurm scheduler version
which is:

srun -n 16 -c 12 --mpi=pmix ./wombat

Which also hangs.  It works if I constrain to one or two
nodes but any greater than that hangs.  As for network hardware:

[root@holy7c02101 ~]# ibstat
CA 'mlx5_0'
    CA type: MT4119
    Number of ports: 1
    Firmware version: 16.25.6000
    Hardware version: 0
    Node GUID: 0xb8599f0300158f20
    System image GUID: 0xb8599f0300158f20
    Port 1:
    State: Active
    Physical state: LinkUp
    Rate: 100
    Base lid: 808
    LMC: 1
    SM lid: 584
    Capability mask: 0x2651e848
    Port GUID: 0xb8599f0300158f20
    Link layer: InfiniBand

[root@holy7c02101 ~]# lspci | grep Mellanox
58:00.0 Infiniband controller: Mellanox Technologies MT27800
Family [ConnectX-5]

As for IB RDMA kernel stack we are using the default drivers
that come with CentOS 7.6.1810 which is rdma core 17.2-3.

I will note that I successfully ran an old version of Wombat
on all 30,000 cores of this system using OpenMPI 3.1.3 and
regular IB Verbs with no problem earlier this week, though
that was pure MPI ranks with no threads.  Nonetheless the
fabric itself is healthy and in good shape.  It seems to be
this edge case using the latest OpenMPI with UCX and threads
that is causing the hang ups.  To be sure the latest version
of Wombat (as I believe the public version does as well) uses
many of the state of the art MPI RMA direct calls, so its
definitely pushing the envelope in ways our typical user base
here will not.  Still it would be good to iron out this kink
so if users do hit it we have a solution.  As noted UCX is
very new to us and thus it is entirely possible that we are
missing something in its interaction with OpenMPI.  Our MPI
is compiled thusly:


https://github.com/fasrc/helmod/blob/master/rpmbuild/SPECS/centos7/openmpi-4.0.1-fasrc01.spec

I will note that when I built this it was built using the
default version of UCX that comes with EPEL (1.5.1).  We only
built 1.6.0 as the version provided by EPEL did not build
with MT enabled, which to me seems strange as I don't see any
reason not to build with MT enabled.  Anyways that's the
deeper context.

-Paul Edmon-

On 8/23/2019 5:49 PM, Joshua Ladd via users wrote:

Paul,

Can you provide a repro and command line, please. Also, what
network hardware are you using?

Josh

On Fri, Aug 23, 2019 at 3:35 PM Paul Edmon via users
mailto:users@lists.open-mpi.org>>
wrote:

I have a code using MPI_THREAD_MULTIPLE along with
MPI-RMA that I'm
using OpenMPI 4.0.1.  Since 4.0.1 requires UCX I have it
installed with
MT on (1.6.0 bui

Re: [OMPI users] CUDA-aware codes not using GPU

2019-09-06 Thread Akshay Venkatesh via users
Hi, Arturo.

Usually, for OpenMPI+UCX we use the following recipe

for UCX:


./configure --prefix=/path/to/ucx-cuda-install
--with-cuda=/usr/local/cuda --with-gdrcopy=/usr


make -j install


then OpenMPI:

./configure --with-cuda=/usr/local/cuda --with-ucx=/path/to/ucx-cuda-install

make -j install


Can you run with the following to see if it helps:

mpirun -np 2 --mca pml ucx --mca btl ^smcuda,openib ./osu_latency D H

There are details here that may be useful:
https://www.open-mpi.org/faq/?category=runcuda#run-ompi-cuda-ucx

Also, note that for short messages D->H path for inter-node may not involve
call CUDA API (if you're using nvprof to detect CUDA activity) because
GPUDirectRDMA path and gdrcopy is used.

On Fri, Sep 6, 2019 at 7:36 AM Arturo Fernandez via users <
users@lists.open-mpi.org> wrote:

> Josh,
> Thank you. Yes, I built UCX with CUDA and gdrcopy support. I also had to
> disable numa (--disable-numa) as requested during the installation.
> AFernandez
>
> Joshua Ladd wrote
>
> Did you build UCX with CUDA support (--with-cuda) ?
>
> Josh
>
> On Thu, Sep 5, 2019 at 8:45 PM AFernandez via users < users@lists.open-mpi.org
> > wrote:
>
>> Hello OpenMPI Team,
>>
>> I'm trying to use CUDA-aware OpenMPI but the system simply ignores the
>> GPU and the code runs on the CPUs. I've tried different software but will
>> focus on the OSU benchmarks (collective and pt2pt communications). Let me
>> provide some data about the configuration of the system:
>>
>> -OFED v4.17-1-rc2 (the NIC is virtualized but I also tried a Mellanox
>> card with MOFED a few days ago and found the same issue)
>>
>> -CUDA v10.1
>>
>> -gdrcopy v1.3
>>
>> -UCX 1.6.0
>>
>> -OpenMPI 4.0.1
>>
>> Everything looks like good (CUDA programs work fine, MPI programs run on
>> the CPUs without any problem), and the ompi_info outputs what I was
>> expecting (but maybe I'm missing something):
>>
>> mca:opal:base:param:opal_built_with_cuda_support:synonym:name:mpi_built_with_cuda_support
>>
>>
>> mca:mpi:base:param:mpi_built_with_cuda_support:value:true
>>
>> mca:mpi:base:param:mpi_built_with_cuda_support:source:default
>>
>> mca:mpi:base:param:mpi_built_with_cuda_support:status:read-only
>>
>> mca:mpi:base:param:mpi_built_with_cuda_support:level:4
>>
>> mca:mpi:base:param:mpi_built_with_cuda_support:help:Whether CUDA GPU
>> buffer support is built into library or not
>>
>> mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:0:false
>>
>> mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:1:true
>>
>> mca:mpi:base:param:mpi_built_with_cuda_support:deprecated:no
>>
>> mca:mpi:base:param:mpi_built_with_cuda_support:type:bool
>>
>> mca:mpi:base:param:mpi_built_with_cuda_support:synonym_of:name:opal_built_with_cuda_support
>>
>>
>> mca:mpi:base:param:mpi_built_with_cuda_support:disabled:false
>>
>> The available btls are the usual self, openib, tcp & vader plus smcuda,
>> uct & usnic. The full output from ompi_info is attached. If I try the flag
>> '--mca opal_cuda_verbose 10,' it doesn't output anything, which seems to
>> agree with the lack of GPU use. If I try with '--mca btl smcuda,' it makes
>> no difference. I have also tried to specify the program to use host and
>> device (e.g. mpirun -np 2 ./osu_latency D H) but the same result. I am
>> probably missing something but not sure where else to look at or what else
>> to try.
>>
>> Thank you,
>>
>> AFernandez
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



-- 
-Akshay
NVIDIA
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Floating point overflow and tuning

2019-09-06 Thread Logan Stonebraker via users
I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with 
star ccm+)
Also trying to make openmpi work (more on that later)
Cisco UCS b200 and c240 cluster using USNIC fabric over 10gbeIntel(R) Xeon(R) 
CPU E5-26987 nodes280 total cores
enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installedusnic 
RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installedenic modinfo 
version: 3.2.210.22enic loaded module version: 3.2.210.22usnic_verbs modinfo 
version: 3.2.158.15usnic_verbs loaded module version: 3.2.158.15libdaplusnic 
RPM version 2.0.39cisco3.2.112.8 installedlibfabric RPM version 
1.6.0cisco3.2.112.9.rhel7u6 installed
On batch runs less than 5 hours, everything works flawlessly, the jobs complete 
with out error, and is quite fast with dapl. Especially compared to TCP btl.

However when running with n-1 273 total cores at or around 5 hours into a job, 
the longer jobs die with a star ccm floating point exception.The same job 
completes fine with no more than 210 cores, 30 cores per each of 7 nodes.  I 
would like to be able to use the 60 more cores.I am using PBS Pro with 99 hour 
wall time
Here is the overflow error. --Turbulent viscosity limited on 56 
cells in RegionA floating point exception has occurred: floating point 
exception [Overflow].  The specific cause cannot be identified.  Please refer 
to the troubleshooting section of the User's Guide.Context: 
star.coupledflow.CoupledImplicitSolverCommand: Automation.Run   error: Server 
Error--
I have not ruled out that I am missing some parameters or tuning with Intel MPI 
as this is a new cluster.
I am also trying to make Open MPI work.  I have openmpi compiled and it runs 
and I can see it is using the usnic fabric, however it only runs with very 
small number of CPU.  Anything over about 2 cores per node it hangs 
indefinately, right after the job starts.
I have compiled Open MPI 3.1.3 from [url]https://www.open-mpi.org/[/url] 
because this is what Star CCM version I am running supports.  I am telling star 
to use the open mpi that I installed so it can support the Cisco USNIC fabric, 
which I can verify using Cisco native tools (star ships with openmpi btw 
however I'm not using it).
I am thinking that I need to tune OpenMPI, which was also requried with Intel 
MPI in order to run with out indefinite hang.
With Intel MPI prior to tuning, jobs with more than about 100 cores would hang 
forever until I added these parameters:
reference: 
[url]https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/542591[/url]reference:
 
[url]https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-advanced-techniques[/url]
export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208export 
I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208export 
I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704export 
I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704export I_MPI_DAPL_UD_RNDV_EP_NUM=2export 
I_MPI_DAPL_UD_REQ_EVD_SIZE=2000export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096export 
I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647
After adding these parms I can scale to 273 cores and it runs very fast, up 
until the point where it gets the floating point exception about 5 hours into 
the job.
I am struggling trying to find equivelant turning parms for Open MPI.
I have listed all the MCA available with Open using MCA, and have tried setting 
these parms with no success, I may not have the equivalent parms listed here, 
this is what I have tried.
btl_max_send_size = 4096btl_usnic_eager_limit = 
2147483647btl_usnic_rndv_eager_limit = 2147483647btl_usnic_sd_num = 
8208btl_usnic_rd_num = 8208btl_usnic_prio_sd_num = 8704btl_usnic_prio_rd_num = 
8704btl_usnic_pack_lazy_threshold = -1

Does anyone have any advice or ideas for:
1.) The floating point overflow issueand   2.)  Know of equivelant tuning parms 
for Open MPI 
Many thanks in advance!
-Logan___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] CUDA-aware codes not using GPU

2019-09-06 Thread AFernandez via users
Hi Akshay,

I'm building both UCX and OpenMPI as you mention. The portions of the script 
read:

./configure --prefix=/usr/local/ucx-cuda-install 
--with-cuda=/usr/local/cuda-10.1  --with-gdrcopy=/home/odyhpc/gdrcopy 
--disable-numa

sudo make install

&

./configure --with-cuda=/usr/local/cuda-10.1 
--with-cuda-libdir=/usr/local/cuda-10.1/lib64 
--with-ucx=/usr/local/ucx-cuda-install --prefix=/opt/openmpi

sudo make all install

As far as the job submission, I have tried several combinations with different 
MCAs (yesterday I forgot to include '--mca pml ucx' flag as it had made no 
difference in the past). I just tried your suggested syntax (mpirun -np 2 --mca 
pml ucx --mca btl ^smcuda,openib ./osu_latency D H) with the same results. The 
latency times are of the same order no matter which flags I include. As far as 
checking GPU usage, I'm not familiar with 'nvprof' and simply using the basic 
continuous info (nvidia-smi -l). I'm trying all of this in a cloud environment, 
and my suspicion is that there might be some interference (maybe because of 
some virtualization component) but cannot pinpoint the cause.

Thanks,

Arturo

 

From: Akshay Venkatesh  
Sent: Friday, September 06, 2019 11:14 AM
To: Open MPI Users 
Cc: Joshua Ladd ; Arturo Fernandez 
Subject: Re: [OMPI users] CUDA-aware codes not using GPU

 

Hi, Arturo.

 

Usually, for OpenMPI+UCX we use the following recipe 

 

for UCX:

 
./configure --prefix=/path/to/ucx-cuda-install --with-cuda=/usr/local/cuda 
--with-gdrcopy=/usr
 
make -j install


then OpenMPI:

 

./configure --with-cuda=/usr/local/cuda --with-ucx=/path/to/ucx-cuda-install
 
make -j install
 

Can you run with the following to see if it helps: 

 
mpirun -np 2 --mca pml ucx --mca btl ^smcuda,openib ./osu_latency D H

There are details here that may be useful: 
https://www.open-mpi.org/faq/?category=runcuda#run-ompi-cuda-ucx  

 

Also, note that for short messages D->H path for inter-node may not involve 
call CUDA API (if you're using nvprof to detect CUDA activity) because 
GPUDirectRDMA path and gdrcopy is used.

 

On Fri, Sep 6, 2019 at 7:36 AM Arturo Fernandez via users 
mailto:users@lists.open-mpi.org> > wrote:

Josh, 

Thank you. Yes, I built UCX with CUDA and gdrcopy support. I also had to 
disable numa (--disable-numa) as requested during the installation. 

AFernandez 

 

Joshua Ladd wrote 

Did you build UCX with CUDA support (--with-cuda) ? 

 

Josh 

 

On Thu, Sep 5, 2019 at 8:45 PM AFernandez via users < users@lists.open-mpi.org  
 > wrote: 

Hello OpenMPI Team, 

I'm trying to use CUDA-aware OpenMPI but the system simply ignores the GPU and 
the code runs on the CPUs. I've tried different software but will focus on the 
OSU benchmarks (collective and pt2pt communications). Let me provide some data 
about the configuration of the system: 

-OFED v4.17-1-rc2 (the NIC is virtualized but I also tried a Mellanox card with 
MOFED a few days ago and found the same issue) 

-CUDA v10.1 

-gdrcopy v1.3 

-UCX 1.6.0 

-OpenMPI 4.0.1 

Everything looks like good (CUDA programs work fine, MPI programs run on the 
CPUs without any problem), and the ompi_info outputs what I was expecting (but 
maybe I'm missing something): 

mca:opal:base:param:opal_built_with_cuda_support:synonym:name:mpi_built_with_cuda_support
 

mca:mpi:base:param:mpi_built_with_cuda_support:value:true 

mca:mpi:base:param:mpi_built_with_cuda_support:source:default 

mca:mpi:base:param:mpi_built_with_cuda_support:status:read-only 

mca:mpi:base:param:mpi_built_with_cuda_support:level:4 

mca:mpi:base:param:mpi_built_with_cuda_support:help:Whether CUDA GPU buffer 
support is built into library or not 

mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:0:false 

mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:1:true 

mca:mpi:base:param:mpi_built_with_cuda_support:deprecated:no 

mca:mpi:base:param:mpi_built_with_cuda_support:type:bool 

mca:mpi:base:param:mpi_built_with_cuda_support:synonym_of:name:opal_built_with_cuda_support
 

mca:mpi:base:param:mpi_built_with_cuda_support:disabled:false 

The available btls are the usual self, openib, tcp & vader plus smcuda, uct & 
usnic. The full output from ompi_info is attached. If I try the flag '--mca 
opal_cuda_verbose 10,' it doesn't output anything, which seems to agree with 
the lack of GPU use. If I try with '--mca btl smcuda,' it makes no difference. 
I have also tried to specify the program to use host and device (e.g. mpirun 
-np 2 ./osu_latency D H) but the same result. I am probably missing something 
but not sure where else to look at or what else to try. 

Thank you, 

AFernandez 

___ 
users mailing list 
users@lists.open-mpi.org   
https://lists.open-mpi.org/mailman/listinfo/users  
 

___
users mailing list
user