hi max, >> you let pmix do it's job and thus simply start the mpi parts with srun > instead of mpirun > > In this case, the srun command works fine for 'srun -N2 -n2 --mpi=pmix > pingpong 100 100', but the IB connection is not used for the > communication, only the tcp connection.
hmmmm, this odd. you should check if ucx is used at all (there are env variables to make it verbose; it should also spit out what connection options it uses). how did you set the `UCX_TLS` var? are you sure it's not in the job env? (do a "env | grep UCX" before the srun) > The output of 'pmix_info' is: > > MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA gds: ds12 (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA gds: ds21 (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component > > v3.1.5) > > MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component > > v3.1.5) > > MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component > v3.1.5) > > MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0, > Component v3.1.5) > > MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA plog: default (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA pnet: tcp (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA pnet: test (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA preg: native (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA psec: native (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA psec: none (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component > > v3.1.5) > > MCA pshmem: mmap (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA ptl: usock (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > MCA ptl: tcp (MCA v2.1.0, API v1.0.0, Component v3.1.5) > > > > Isn't there supposed to appear something with ucx? dunno, i would have to check our setup again. stijn > > > > Thanks :) > > max > > > >> hi max, > >> > >> > I have set: 'UCX_TLS=tcp,self,sm' on the slurmd's. > >> > Is it better to build slurm without UCX support or should I simply > >> > install rdma-core? > >> i would look into using mellanox ofed with rdma-core, as it is what > mellanox is shifting towards or has already done (not sure what 4.9 has > tbh). or leave the env vars, i think for pmix it's ok unless you have > very large clusters (but i'm no expert here). > >> > >> > > >> > How do I use ucx together with OpenMPI and srun now? > >> > It works when I set this manually: > >> > 'mpirun -np 2 -H lsm218,lsm219 --mca pml ucx -x UCX_TLS=rc -x > >> > UCX_NET_DEVICES=mlx5_0:1 pingpong 1000 1000'. > >> > But if I put srun before mpirun four tasks will be created, two on > >> > each node. > >> you let pmix do it's job and thus simply start the mpi parts with srun > instead of mpirun > >> > >> srun pingpong 1000 1000 > >> > >> if you must tune UCX (as in: default behaviour is not ok), also set it > via env vars. (at least try to use the defaults, it's pretty good i think) > >> > >> (shameless plug: one of my colleagues setup a tech talk with openmpi > people wrt pmix, ucx, openmpi etc; see > >> https://github.com/easybuilders/easybuild/issues/630 for details and > link to youtube recording) > >> > >> stijn > >> >