Just add it to the existing modex. -Nathan
> On Jul 22, 2019, at 12:20 PM, Adrian Reber via users > <users@lists.open-mpi.org> wrote: > > I have most of the code ready, but I still have troubles doing > OPAL_MODEX_RECV. I am using the following lines, based on the code from > orte/test/mpi/pmix.c: > > OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", &value, OPAL_INT); > > This sets rc to 0. For receiving: > > OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", &wildcard_rank, &ptr, OPAL_INT); > > and rc is always set to -13. Is this how it is supposed to work, or do I > have to do it differently? > > Adrian > >> On Mon, Jul 22, 2019 at 02:03:20PM +0000, Ralph Castain via users wrote: >> If that works, then it might be possible to include the namespace ID in the >> job-info provided by PMIx at startup - would have to investigate, so please >> confirm that the modex option works first. >> >>> On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users >>> <users@lists.open-mpi.org> wrote: >>> >>> Adrian, >>> >>> >>> An option is to involve the modex. >>> >>> each task would OPAL_MODEX_SEND() its own namespace ID, and then >>> OPAL_MODEX_RECV() >>> >>> the one from its peers and decide whether CMA support can be enabled. >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >>>> On 7/22/2019 4:53 PM, Adrian Reber via users wrote: >>>> I had a look at it and not sure if it really makes sense. >>>> >>>> In btl_vader_{put,get}.c it would be easy to check for the user >>>> namespace ID of the other process, but the function would then just >>>> return OPAL_ERROR a bit earlier instead of as a result of >>>> process_vm_{read,write}v(). Nothing would really change. >>>> >>>> A better place for the check would be mca_btl_vader_check_single_copy() >>>> but I do not know if at this point the PID of the other processes is >>>> already known. Not sure if I can check for the user namespace ID of the >>>> other processes. >>>> >>>> Any recommendations how to do this? >>>> >>>> Adrian >>>> >>>>> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: >>>>> Patches are always welcome. What would be great is a nice big warning >>>>> that CMA support is disabled because the processes are on different >>>>> namespaces. Ideally all MPI processes should be on the same namespace to >>>>> ensure the best performance. >>>>> >>>>> -Nathan >>>>> >>>>>> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users >>>>>> <users@lists.open-mpi.org> wrote: >>>>>> >>>>>> For completeness I am mentioning my results also here. >>>>>> >>>>>> To be able to mount file systems in the container it can only work if >>>>>> user namespaces are used and even if the user IDs are all the same (in >>>>>> each container and on the host), to be able to ptrace the kernel also >>>>>> checks if the processes are in the same user namespace (in addition to >>>>>> being owned by the same user). This check - same user namespace - fails >>>>>> and so process_vm_readv() and process_vm_writev() will also fail. >>>>>> >>>>>> So Open MPI's checks are currently not enough to detect if 'cma' can be >>>>>> used. Checking for the same user namespace would also be necessary. >>>>>> >>>>>> Is this a use case important enough to accept a patch for it? >>>>>> >>>>>> Adrian >>>>>> >>>>>>> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote: >>>>>>> Gilles, >>>>>>> >>>>>>> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps >>>>>>> indeed. >>>>>>> >>>>>>> The default seems to be 'cma' and that seems to use process_vm_readv() >>>>>>> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but >>>>>>> telling Podman to give the process CAP_SYS_PTRACE with >>>>>>> '--cap-add=SYS_PTRACE' >>>>>>> does not seem to be enough. Not sure yet if this related to the fact >>>>>>> that Podman is running rootless. I will continue to investigate, but now >>>>>>> I know where to look. Thanks! >>>>>>> >>>>>>> Adrian >>>>>>> >>>>>>>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via >>>>>>>> users wrote: >>>>>>>> Adrian, >>>>>>>> >>>>>>>> Can you try >>>>>>>> mpirun --mca btl_vader_copy_mechanism none ... >>>>>>>> >>>>>>>> Please double check the MCA parameter name, I am AFK >>>>>>>> >>>>>>>> IIRC, the default copy mechanism used by vader directly accesses the >>>>>>>> remote process address space, and this requires some permission >>>>>>>> (ptrace?) that might be dropped by podman. >>>>>>>> >>>>>>>> Note Open MPI might not detect both MPI tasks run on the same node >>>>>>>> because of podman. >>>>>>>> If you use UCX, then btl/vader is not used at all (pml/ucx is used >>>>>>>> instead) >>>>>>>> >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Gilles >>>>>>>> >>>>>>>> Sent from my iPod >>>>>>>> >>>>>>>>> On Jul 12, 2019, at 18:33, Adrian Reber via users >>>>>>>>> <users@lists.open-mpi.org> wrote: >>>>>>>>> >>>>>>>>> So upstream Podman was really fast and merged a PR which makes my >>>>>>>>> wrapper unnecessary: >>>>>>>>> >>>>>>>>> Add support for --env-host : >>>>>>>>> https://github.com/containers/libpod/pull/3557 >>>>>>>>> >>>>>>>>> As commented in the PR I can now start mpirun with Podman without a >>>>>>>>> wrapper: >>>>>>>>> >>>>>>>>> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun >>>>>>>>> podman run --env-host --security-opt label=disable -v >>>>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host >>>>>>>>> mpi-test /home/mpi/ring >>>>>>>>> Rank 0 has cleared MPI_Init >>>>>>>>> Rank 1 has cleared MPI_Init >>>>>>>>> Rank 0 has completed ring >>>>>>>>> Rank 0 has completed MPI_Barrier >>>>>>>>> Rank 1 has completed ring >>>>>>>>> Rank 1 has completed MPI_Barrier >>>>>>>>> >>>>>>>>> This is example was using TCP and on an InfiniBand based system I have >>>>>>>>> to map the InfiniBand devices into the container. >>>>>>>>> >>>>>>>>> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base >>>>>>>>> /tmp/podman-mpirun podman run --env-host -v >>>>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable >>>>>>>>> --userns=keep-id --device /dev/infiniband/uverbs0 --device >>>>>>>>> /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host >>>>>>>>> mpi-test /home/mpi/ring >>>>>>>>> Rank 0 has cleared MPI_Init >>>>>>>>> Rank 1 has cleared MPI_Init >>>>>>>>> Rank 0 has completed ring >>>>>>>>> Rank 0 has completed MPI_Barrier >>>>>>>>> Rank 1 has completed ring >>>>>>>>> Rank 1 has completed MPI_Barrier >>>>>>>>> >>>>>>>>> This is all running without root and only using Podman's rootless >>>>>>>>> support. >>>>>>>>> >>>>>>>>> Running multiple processes on one system, however, still gives me an >>>>>>>>> error. If I disable vader I guess that Open MPI is using TCP for >>>>>>>>> localhost communication and that works. But with vader it fails. >>>>>>>>> >>>>>>>>> The first error message I get is a segfault: >>>>>>>>> >>>>>>>>> [test1:00001] *** Process received signal *** >>>>>>>>> [test1:00001] Signal: Segmentation fault (11) >>>>>>>>> [test1:00001] Signal code: Address not mapped (1) >>>>>>>>> [test1:00001] Failing at address: 0x7fb7b1552010 >>>>>>>>> [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80] >>>>>>>>> [test1:00001] [ 1] >>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b] >>>>>>>>> [test1:00001] [ 2] >>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb] >>>>>>>>> [test1:00001] [ 3] >>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086] >>>>>>>>> [test1:00001] [ 4] >>>>>>>>> /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d] >>>>>>>>> [test1:00001] [ 5] /home/mpi/ring[0x400b76] >>>>>>>>> [test1:00001] [ 6] >>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813] >>>>>>>>> [test1:00001] [ 7] /home/mpi/ring[0x4008be] >>>>>>>>> [test1:00001] *** End of error message *** >>>>>>>>> >>>>>>>>> Guessing that vader uses shared memory this is expected to fail, with >>>>>>>>> all the namespace isolations in place. Maybe not with a segfault, but >>>>>>>>> each container has its own shared memory. So next step was to use the >>>>>>>>> host's ipc and pid namespace and mount /dev/shm: >>>>>>>>> >>>>>>>>> '-v /dev/shm:/dev/shm --ipc=host --pid=host' >>>>>>>>> >>>>>>>>> Which does not segfault, but still does not look correct: >>>>>>>>> >>>>>>>>> Rank 0 has cleared MPI_Init >>>>>>>>> Rank 1 has cleared MPI_Init >>>>>>>>> Rank 2 has cleared MPI_Init >>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>> Rank 0 has completed ring >>>>>>>>> Rank 2 has completed ring >>>>>>>>> Rank 0 has completed MPI_Barrier >>>>>>>>> Rank 1 has completed ring >>>>>>>>> Rank 2 has completed MPI_Barrier >>>>>>>>> Rank 1 has completed MPI_Barrier >>>>>>>>> >>>>>>>>> This is using the Open MPI ring.c example with SIZE increased from 20 >>>>>>>>> to 20000. >>>>>>>>> >>>>>>>>> Any recommendations what vader needs to communicate correctly? >>>>>>>>> >>>>>>>>> Adrian >>>>>>>>> >>>>>>>>>> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users >>>>>>>>>> wrote: >>>>>>>>>> Gilles, >>>>>>>>>> >>>>>>>>>> thanks for pointing out the environment variables. I quickly created >>>>>>>>>> a >>>>>>>>>> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables >>>>>>>>>> (grep "\(PMIX\|OMPI\)"). Now it works: >>>>>>>>>> >>>>>>>>>> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id >>>>>>>>>> --net=host mpi-test /home/mpi/hello >>>>>>>>>> >>>>>>>>>> Hello, world (2 procs total) >>>>>>>>>> --> Process # 0 of 2 is alive. ->test1 >>>>>>>>>> --> Process # 1 of 2 is alive. ->test2 >>>>>>>>>> >>>>>>>>>> I need to tell Podman to mount /tmp from the host into the >>>>>>>>>> container, as >>>>>>>>>> I am running rootless I also need to tell Podman to use the same >>>>>>>>>> user ID >>>>>>>>>> in the container as outside (so that the Open MPI files in /tmp) can >>>>>>>>>> be >>>>>>>>>> shared and I am also running without a network namespace. >>>>>>>>>> >>>>>>>>>> So this is now with the full Podman provided isolation except the >>>>>>>>>> network namespace. Thanks for you help! >>>>>>>>>> >>>>>>>>>> Adrian >>>>>>>>>> >>>>>>>>>>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via >>>>>>>>>>> users wrote: >>>>>>>>>>> Adrian, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> the MPI application relies on some environment variables (they >>>>>>>>>>> typically >>>>>>>>>>> start with OMPI_ and PMIX_). >>>>>>>>>>> >>>>>>>>>>> The MPI application internally uses a PMIx client that must be able >>>>>>>>>>> to >>>>>>>>>>> contact a PMIx server >>>>>>>>>>> >>>>>>>>>>> (that is included in mpirun and the orted daemon(s) spawned on the >>>>>>>>>>> remote >>>>>>>>>>> hosts). >>>>>>>>>>> >>>>>>>>>>> located on the same host. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> If podman provides some isolation between the app inside the >>>>>>>>>>> container (e.g. >>>>>>>>>>> /home/mpi/hello) >>>>>>>>>>> >>>>>>>>>>> and the outside world (e.g. mpirun/orted), that won't be an easy >>>>>>>>>>> ride. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Gilles >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote: >>>>>>>>>>>> I did a quick test to see if I can use Podman in combination with >>>>>>>>>>>> Open >>>>>>>>>>>> MPI: >>>>>>>>>>>> >>>>>>>>>>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run >>>>>>>>>>>> quay.io/adrianreber/mpi-test /home/mpi/hello >>>>>>>>>>>> >>>>>>>>>>>> Hello, world (1 procs total) >>>>>>>>>>>> --> Process # 0 of 1 is alive. ->789b8fb622ef >>>>>>>>>>>> >>>>>>>>>>>> Hello, world (1 procs total) >>>>>>>>>>>> --> Process # 0 of 1 is alive. ->749eb4e1c01a >>>>>>>>>>>> >>>>>>>>>>>> The test program (hello) is taken from >>>>>>>>>>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The problem with this is that each process thinks it is process 0 >>>>>>>>>>>> of 1 >>>>>>>>>>>> instead of >>>>>>>>>>>> >>>>>>>>>>>> Hello, world (2 procs total) >>>>>>>>>>>> --> Process # 1 of 2 is alive. ->test1 >>>>>>>>>>>> --> Process # 0 of 2 is alive. ->test2 >>>>>>>>>>>> >>>>>>>>>>>> My questions is how is the rank determined? What resources do I >>>>>>>>>>>> need to have >>>>>>>>>>>> in my container to correctly determine the rank. >>>>>>>>>>>> >>>>>>>>>>>> This is Podman 1.4.2 and Open MPI 4.0.1. >>>>>>>>>>>> >>>>>>>>>>>> Adrian >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> users@lists.open-mpi.org >>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> users@lists.open-mpi.org >>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> users@lists.open-mpi.org >>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> Adrian >>>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users