I have most of the code ready, but I still have troubles doing OPAL_MODEX_RECV. I am using the following lines, based on the code from orte/test/mpi/pmix.c:
OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", &value, OPAL_INT); This sets rc to 0. For receiving: OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", &wildcard_rank, &ptr, OPAL_INT); and rc is always set to -13. Is this how it is supposed to work, or do I have to do it differently? Adrian On Mon, Jul 22, 2019 at 02:03:20PM +0000, Ralph Castain via users wrote: > If that works, then it might be possible to include the namespace ID in the > job-info provided by PMIx at startup - would have to investigate, so please > confirm that the modex option works first. > > > On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users > > <users@lists.open-mpi.org> wrote: > > > > Adrian, > > > > > > An option is to involve the modex. > > > > each task would OPAL_MODEX_SEND() its own namespace ID, and then > > OPAL_MODEX_RECV() > > > > the one from its peers and decide whether CMA support can be enabled. > > > > > > Cheers, > > > > > > Gilles > > > > On 7/22/2019 4:53 PM, Adrian Reber via users wrote: > >> I had a look at it and not sure if it really makes sense. > >> > >> In btl_vader_{put,get}.c it would be easy to check for the user > >> namespace ID of the other process, but the function would then just > >> return OPAL_ERROR a bit earlier instead of as a result of > >> process_vm_{read,write}v(). Nothing would really change. > >> > >> A better place for the check would be mca_btl_vader_check_single_copy() > >> but I do not know if at this point the PID of the other processes is > >> already known. Not sure if I can check for the user namespace ID of the > >> other processes. > >> > >> Any recommendations how to do this? > >> > >> Adrian > >> > >> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: > >>> Patches are always welcome. What would be great is a nice big warning > >>> that CMA support is disabled because the processes are on different > >>> namespaces. Ideally all MPI processes should be on the same namespace to > >>> ensure the best performance. > >>> > >>> -Nathan > >>> > >>>> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users > >>>> <users@lists.open-mpi.org> wrote: > >>>> > >>>> For completeness I am mentioning my results also here. > >>>> > >>>> To be able to mount file systems in the container it can only work if > >>>> user namespaces are used and even if the user IDs are all the same (in > >>>> each container and on the host), to be able to ptrace the kernel also > >>>> checks if the processes are in the same user namespace (in addition to > >>>> being owned by the same user). This check - same user namespace - fails > >>>> and so process_vm_readv() and process_vm_writev() will also fail. > >>>> > >>>> So Open MPI's checks are currently not enough to detect if 'cma' can be > >>>> used. Checking for the same user namespace would also be necessary. > >>>> > >>>> Is this a use case important enough to accept a patch for it? > >>>> > >>>> Adrian > >>>> > >>>>> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote: > >>>>> Gilles, > >>>>> > >>>>> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps > >>>>> indeed. > >>>>> > >>>>> The default seems to be 'cma' and that seems to use process_vm_readv() > >>>>> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but > >>>>> telling Podman to give the process CAP_SYS_PTRACE with > >>>>> '--cap-add=SYS_PTRACE' > >>>>> does not seem to be enough. Not sure yet if this related to the fact > >>>>> that Podman is running rootless. I will continue to investigate, but now > >>>>> I know where to look. Thanks! > >>>>> > >>>>> Adrian > >>>>> > >>>>>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via > >>>>>> users wrote: > >>>>>> Adrian, > >>>>>> > >>>>>> Can you try > >>>>>> mpirun --mca btl_vader_copy_mechanism none ... > >>>>>> > >>>>>> Please double check the MCA parameter name, I am AFK > >>>>>> > >>>>>> IIRC, the default copy mechanism used by vader directly accesses the > >>>>>> remote process address space, and this requires some permission > >>>>>> (ptrace?) that might be dropped by podman. > >>>>>> > >>>>>> Note Open MPI might not detect both MPI tasks run on the same node > >>>>>> because of podman. > >>>>>> If you use UCX, then btl/vader is not used at all (pml/ucx is used > >>>>>> instead) > >>>>>> > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Gilles > >>>>>> > >>>>>> Sent from my iPod > >>>>>> > >>>>>>> On Jul 12, 2019, at 18:33, Adrian Reber via users > >>>>>>> <users@lists.open-mpi.org> wrote: > >>>>>>> > >>>>>>> So upstream Podman was really fast and merged a PR which makes my > >>>>>>> wrapper unnecessary: > >>>>>>> > >>>>>>> Add support for --env-host : > >>>>>>> https://github.com/containers/libpod/pull/3557 > >>>>>>> > >>>>>>> As commented in the PR I can now start mpirun with Podman without a > >>>>>>> wrapper: > >>>>>>> > >>>>>>> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun > >>>>>>> podman run --env-host --security-opt label=disable -v > >>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host > >>>>>>> mpi-test /home/mpi/ring > >>>>>>> Rank 0 has cleared MPI_Init > >>>>>>> Rank 1 has cleared MPI_Init > >>>>>>> Rank 0 has completed ring > >>>>>>> Rank 0 has completed MPI_Barrier > >>>>>>> Rank 1 has completed ring > >>>>>>> Rank 1 has completed MPI_Barrier > >>>>>>> > >>>>>>> This is example was using TCP and on an InfiniBand based system I have > >>>>>>> to map the InfiniBand devices into the container. > >>>>>>> > >>>>>>> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base > >>>>>>> /tmp/podman-mpirun podman run --env-host -v > >>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable > >>>>>>> --userns=keep-id --device /dev/infiniband/uverbs0 --device > >>>>>>> /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host > >>>>>>> mpi-test /home/mpi/ring > >>>>>>> Rank 0 has cleared MPI_Init > >>>>>>> Rank 1 has cleared MPI_Init > >>>>>>> Rank 0 has completed ring > >>>>>>> Rank 0 has completed MPI_Barrier > >>>>>>> Rank 1 has completed ring > >>>>>>> Rank 1 has completed MPI_Barrier > >>>>>>> > >>>>>>> This is all running without root and only using Podman's rootless > >>>>>>> support. > >>>>>>> > >>>>>>> Running multiple processes on one system, however, still gives me an > >>>>>>> error. If I disable vader I guess that Open MPI is using TCP for > >>>>>>> localhost communication and that works. But with vader it fails. > >>>>>>> > >>>>>>> The first error message I get is a segfault: > >>>>>>> > >>>>>>> [test1:00001] *** Process received signal *** > >>>>>>> [test1:00001] Signal: Segmentation fault (11) > >>>>>>> [test1:00001] Signal code: Address not mapped (1) > >>>>>>> [test1:00001] Failing at address: 0x7fb7b1552010 > >>>>>>> [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80] > >>>>>>> [test1:00001] [ 1] > >>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b] > >>>>>>> [test1:00001] [ 2] > >>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb] > >>>>>>> [test1:00001] [ 3] > >>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086] > >>>>>>> [test1:00001] [ 4] > >>>>>>> /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d] > >>>>>>> [test1:00001] [ 5] /home/mpi/ring[0x400b76] > >>>>>>> [test1:00001] [ 6] > >>>>>>> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813] > >>>>>>> [test1:00001] [ 7] /home/mpi/ring[0x4008be] > >>>>>>> [test1:00001] *** End of error message *** > >>>>>>> > >>>>>>> Guessing that vader uses shared memory this is expected to fail, with > >>>>>>> all the namespace isolations in place. Maybe not with a segfault, but > >>>>>>> each container has its own shared memory. So next step was to use the > >>>>>>> host's ipc and pid namespace and mount /dev/shm: > >>>>>>> > >>>>>>> '-v /dev/shm:/dev/shm --ipc=host --pid=host' > >>>>>>> > >>>>>>> Which does not segfault, but still does not look correct: > >>>>>>> > >>>>>>> Rank 0 has cleared MPI_Init > >>>>>>> Rank 1 has cleared MPI_Init > >>>>>>> Rank 2 has cleared MPI_Init > >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>>>>> Rank 0 has completed ring > >>>>>>> Rank 2 has completed ring > >>>>>>> Rank 0 has completed MPI_Barrier > >>>>>>> Rank 1 has completed ring > >>>>>>> Rank 2 has completed MPI_Barrier > >>>>>>> Rank 1 has completed MPI_Barrier > >>>>>>> > >>>>>>> This is using the Open MPI ring.c example with SIZE increased from 20 > >>>>>>> to 20000. > >>>>>>> > >>>>>>> Any recommendations what vader needs to communicate correctly? > >>>>>>> > >>>>>>> Adrian > >>>>>>> > >>>>>>>> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users > >>>>>>>> wrote: > >>>>>>>> Gilles, > >>>>>>>> > >>>>>>>> thanks for pointing out the environment variables. I quickly created > >>>>>>>> a > >>>>>>>> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables > >>>>>>>> (grep "\(PMIX\|OMPI\)"). Now it works: > >>>>>>>> > >>>>>>>> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id > >>>>>>>> --net=host mpi-test /home/mpi/hello > >>>>>>>> > >>>>>>>> Hello, world (2 procs total) > >>>>>>>> --> Process # 0 of 2 is alive. ->test1 > >>>>>>>> --> Process # 1 of 2 is alive. ->test2 > >>>>>>>> > >>>>>>>> I need to tell Podman to mount /tmp from the host into the > >>>>>>>> container, as > >>>>>>>> I am running rootless I also need to tell Podman to use the same > >>>>>>>> user ID > >>>>>>>> in the container as outside (so that the Open MPI files in /tmp) can > >>>>>>>> be > >>>>>>>> shared and I am also running without a network namespace. > >>>>>>>> > >>>>>>>> So this is now with the full Podman provided isolation except the > >>>>>>>> network namespace. Thanks for you help! > >>>>>>>> > >>>>>>>> Adrian > >>>>>>>> > >>>>>>>>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via > >>>>>>>>> users wrote: > >>>>>>>>> Adrian, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> the MPI application relies on some environment variables (they > >>>>>>>>> typically > >>>>>>>>> start with OMPI_ and PMIX_). > >>>>>>>>> > >>>>>>>>> The MPI application internally uses a PMIx client that must be able > >>>>>>>>> to > >>>>>>>>> contact a PMIx server > >>>>>>>>> > >>>>>>>>> (that is included in mpirun and the orted daemon(s) spawned on the > >>>>>>>>> remote > >>>>>>>>> hosts). > >>>>>>>>> > >>>>>>>>> located on the same host. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> If podman provides some isolation between the app inside the > >>>>>>>>> container (e.g. > >>>>>>>>> /home/mpi/hello) > >>>>>>>>> > >>>>>>>>> and the outside world (e.g. mpirun/orted), that won't be an easy > >>>>>>>>> ride. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Gilles > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote: > >>>>>>>>>> I did a quick test to see if I can use Podman in combination with > >>>>>>>>>> Open > >>>>>>>>>> MPI: > >>>>>>>>>> > >>>>>>>>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run > >>>>>>>>>> quay.io/adrianreber/mpi-test /home/mpi/hello > >>>>>>>>>> > >>>>>>>>>> Hello, world (1 procs total) > >>>>>>>>>> --> Process # 0 of 1 is alive. ->789b8fb622ef > >>>>>>>>>> > >>>>>>>>>> Hello, world (1 procs total) > >>>>>>>>>> --> Process # 0 of 1 is alive. ->749eb4e1c01a > >>>>>>>>>> > >>>>>>>>>> The test program (hello) is taken from > >>>>>>>>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> The problem with this is that each process thinks it is process 0 > >>>>>>>>>> of 1 > >>>>>>>>>> instead of > >>>>>>>>>> > >>>>>>>>>> Hello, world (2 procs total) > >>>>>>>>>> --> Process # 1 of 2 is alive. ->test1 > >>>>>>>>>> --> Process # 0 of 2 is alive. ->test2 > >>>>>>>>>> > >>>>>>>>>> My questions is how is the rank determined? What resources do I > >>>>>>>>>> need to have > >>>>>>>>>> in my container to correctly determine the rank. > >>>>>>>>>> > >>>>>>>>>> This is Podman 1.4.2 and Open MPI 4.0.1. > >>>>>>>>>> > >>>>>>>>>> Adrian > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> users mailing list > >>>>>>>>>> users@lists.open-mpi.org > >>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> users@lists.open-mpi.org > >>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> users@lists.open-mpi.org > >>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> users@lists.open-mpi.org > >>>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> users@lists.open-mpi.org > >>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> users@lists.open-mpi.org > >>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>> _______________________________________________ > >>>> users mailing list > >>>> users@lists.open-mpi.org > >>>> https://lists.open-mpi.org/mailman/listinfo/users > >> Adrian > >> > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users