**apropos :-) On Mon, Aug 26, 2019 at 9:19 PM Joshua Ladd <jladd.m...@gmail.com> wrote:
> Hi, Paul > > I must say, this is eerily appropo. I've just sent a request for Wombat > last week as I was planning to have my group start looking at the > performance of UCX OSC on IB. We are most interested in ensuring UCX OSC MT > performs well on Wombat. The bitbucket you're referencing; is this the > source code? Can we build and run it? > > > Best, > > Josh > > On Fri, Aug 23, 2019 at 9:37 PM Paul Edmon via users < > users@lists.open-mpi.org> wrote: > >> I forgot to include that we have not rebuilt this OpenMPI 4.0.1 against >> 1.6.0 of UCX but rather 1.5.1. When we upgraded to 1.6.0 everything seemed >> to be working for OpenMPI when we swapped the UCX version with out >> recompiling (at least in normal rank level MPI as we had to do the upgrade >> to UCX to get MPI_THREAD_MULTIPLE to work at all). >> >> -Paul Edmon- >> On 8/23/2019 9:31 PM, Paul Edmon wrote: >> >> Sure. The code I'm using is the latest version of Wombat ( >> https://bitbucket.org/pmendygral/wombat-public/wiki/Home , I'm using an >> unreleased updated version as I know the devs). I'm using >> OMP_THREAD_NUM=12 and the command line is: >> >> mpirun -np 16 --hostfile hosts ./wombat >> >> Where the host file lists 4 machines, so 4 ranks per machine and 12 >> threads per rank. Each node has 48 Intel Cascade Lake cores. I've also >> tried using the Slurm scheduler version which is: >> >> srun -n 16 -c 12 --mpi=pmix ./wombat >> >> Which also hangs. It works if I constrain to one or two nodes but any >> greater than that hangs. As for network hardware: >> >> [root@holy7c02101 ~]# ibstat >> CA 'mlx5_0' >> CA type: MT4119 >> Number of ports: 1 >> Firmware version: 16.25.6000 >> Hardware version: 0 >> Node GUID: 0xb8599f0300158f20 >> System image GUID: 0xb8599f0300158f20 >> Port 1: >> State: Active >> Physical state: LinkUp >> Rate: 100 >> Base lid: 808 >> LMC: 1 >> SM lid: 584 >> Capability mask: 0x2651e848 >> Port GUID: 0xb8599f0300158f20 >> Link layer: InfiniBand >> >> [root@holy7c02101 ~]# lspci | grep Mellanox >> 58:00.0 Infiniband controller: Mellanox Technologies MT27800 Family >> [ConnectX-5] >> >> As for IB RDMA kernel stack we are using the default drivers that come >> with CentOS 7.6.1810 which is rdma core 17.2-3. >> >> I will note that I successfully ran an old version of Wombat on all >> 30,000 cores of this system using OpenMPI 3.1.3 and regular IB Verbs with >> no problem earlier this week, though that was pure MPI ranks with no >> threads. Nonetheless the fabric itself is healthy and in good shape. It >> seems to be this edge case using the latest OpenMPI with UCX and threads >> that is causing the hang ups. To be sure the latest version of Wombat (as >> I believe the public version does as well) uses many of the state of the >> art MPI RMA direct calls, so its definitely pushing the envelope in ways >> our typical user base here will not. Still it would be good to iron out >> this kink so if users do hit it we have a solution. As noted UCX is very >> new to us and thus it is entirely possible that we are missing something in >> its interaction with OpenMPI. Our MPI is compiled thusly: >> >> >> https://github.com/fasrc/helmod/blob/master/rpmbuild/SPECS/centos7/openmpi-4.0.1-fasrc01.spec >> >> I will note that when I built this it was built using the default version >> of UCX that comes with EPEL (1.5.1). We only built 1.6.0 as the version >> provided by EPEL did not build with MT enabled, which to me seems strange >> as I don't see any reason not to build with MT enabled. Anyways that's the >> deeper context. >> >> -Paul Edmon- >> On 8/23/2019 5:49 PM, Joshua Ladd via users wrote: >> >> Paul, >> >> Can you provide a repro and command line, please. Also, what network >> hardware are you using? >> >> Josh >> >> On Fri, Aug 23, 2019 at 3:35 PM Paul Edmon via users < >> users@lists.open-mpi.org> wrote: >> >>> I have a code using MPI_THREAD_MULTIPLE along with MPI-RMA that I'm >>> using OpenMPI 4.0.1. Since 4.0.1 requires UCX I have it installed with >>> MT on (1.6.0 build). The thing is that the code keeps stalling out when >>> I go above a couple of nodes. UCX is new to our environment as >>> previously we have just used the regular IB Verbs with no problem. My >>> guess is that there is either some option in OpenMPI I am missing or >>> some variable in UCX I am not setting. Any insight on what could be >>> causing the stalls? >>> >>> -Paul Edmon- >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> >> >> _______________________________________________ >> users mailing >> listus...@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users