**apropos  :-)

On Mon, Aug 26, 2019 at 9:19 PM Joshua Ladd <jladd.m...@gmail.com> wrote:

> Hi, Paul
>
> I must say, this is eerily appropo. I've just sent a request for Wombat
> last week as I was planning to have my group start looking at the
> performance of UCX OSC on IB. We are most interested in ensuring UCX OSC MT
> performs well on Wombat. The bitbucket you're referencing; is this the
> source code? Can we build and run it?
>
>
> Best,
>
> Josh
>
> On Fri, Aug 23, 2019 at 9:37 PM Paul Edmon via users <
> users@lists.open-mpi.org> wrote:
>
>> I forgot to include that we have not rebuilt this OpenMPI 4.0.1 against
>> 1.6.0 of UCX but rather 1.5.1.  When we upgraded to 1.6.0 everything seemed
>> to be working for OpenMPI when we swapped the UCX version with out
>> recompiling (at least in normal rank level MPI as we had to do the upgrade
>> to UCX to get MPI_THREAD_MULTIPLE to work at all).
>>
>> -Paul Edmon-
>> On 8/23/2019 9:31 PM, Paul Edmon wrote:
>>
>> Sure.  The code I'm using is the latest version of Wombat (
>> https://bitbucket.org/pmendygral/wombat-public/wiki/Home , I'm using an
>> unreleased updated version as I know the devs).  I'm using
>> OMP_THREAD_NUM=12 and the command line is:
>>
>> mpirun -np 16 --hostfile hosts ./wombat
>>
>> Where the host file lists 4 machines, so 4 ranks per machine and 12
>> threads per rank.  Each node has 48 Intel Cascade Lake cores. I've also
>> tried using the Slurm scheduler version which is:
>>
>> srun -n 16 -c 12 --mpi=pmix ./wombat
>>
>> Which also hangs.  It works if I constrain to one or two nodes but any
>> greater than that hangs.  As for network hardware:
>>
>> [root@holy7c02101 ~]# ibstat
>> CA 'mlx5_0'
>>         CA type: MT4119
>>         Number of ports: 1
>>         Firmware version: 16.25.6000
>>         Hardware version: 0
>>         Node GUID: 0xb8599f0300158f20
>>         System image GUID: 0xb8599f0300158f20
>>         Port 1:
>>                 State: Active
>>                 Physical state: LinkUp
>>                 Rate: 100
>>                 Base lid: 808
>>                 LMC: 1
>>                 SM lid: 584
>>                 Capability mask: 0x2651e848
>>                 Port GUID: 0xb8599f0300158f20
>>                 Link layer: InfiniBand
>>
>> [root@holy7c02101 ~]# lspci | grep Mellanox
>> 58:00.0 Infiniband controller: Mellanox Technologies MT27800 Family
>> [ConnectX-5]
>>
>> As for IB RDMA kernel stack we are using the default drivers that come
>> with CentOS 7.6.1810 which is rdma core 17.2-3.
>>
>> I will note that I successfully ran an old version of Wombat on all
>> 30,000 cores of this system using OpenMPI 3.1.3 and regular IB Verbs with
>> no problem earlier this week, though that was pure MPI ranks with no
>> threads.  Nonetheless the fabric itself is healthy and in good shape.  It
>> seems to be this edge case using the latest OpenMPI with UCX and threads
>> that is causing the hang ups.  To be sure the latest version of Wombat (as
>> I believe the public version does as well) uses many of the state of the
>> art MPI RMA direct calls, so its definitely pushing the envelope in ways
>> our typical user base here will not.  Still it would be good to iron out
>> this kink so if users do hit it we have a solution.  As noted UCX is very
>> new to us and thus it is entirely possible that we are missing something in
>> its interaction with OpenMPI.  Our MPI is compiled thusly:
>>
>>
>> https://github.com/fasrc/helmod/blob/master/rpmbuild/SPECS/centos7/openmpi-4.0.1-fasrc01.spec
>>
>> I will note that when I built this it was built using the default version
>> of UCX that comes with EPEL (1.5.1).  We only built 1.6.0 as the version
>> provided by EPEL did not build with MT enabled, which to me seems strange
>> as I don't see any reason not to build with MT enabled.  Anyways that's the
>> deeper context.
>>
>> -Paul Edmon-
>> On 8/23/2019 5:49 PM, Joshua Ladd via users wrote:
>>
>> Paul,
>>
>> Can you provide a repro and command line, please. Also, what network
>> hardware are you using?
>>
>> Josh
>>
>> On Fri, Aug 23, 2019 at 3:35 PM Paul Edmon via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> I have a code using MPI_THREAD_MULTIPLE along with MPI-RMA that I'm
>>> using OpenMPI 4.0.1.  Since 4.0.1 requires UCX I have it installed with
>>> MT on (1.6.0 build).  The thing is that the code keeps stalling out when
>>> I go above a couple of nodes.  UCX is new to our environment as
>>> previously we have just used the regular IB Verbs with no problem.  My
>>> guess is that there is either some option in OpenMPI I am missing or
>>> some variable in UCX I am not setting.  Any insight on what could be
>>> causing the stalls?
>>>
>>> -Paul Edmon-
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>>
>> _______________________________________________
>> users mailing 
>> listus...@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to