I have most of the code ready, but I still have troubles doing
OPAL_MODEX_RECV. I am using the following lines, based on the code from
orte/test/mpi/pmix.c:

OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", &value, OPAL_INT);

This sets rc to 0. For receiving:

OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", &wildcard_rank, &ptr, OPAL_INT);

and rc is always set to -13. Is this how it is supposed to work, or do I
have to do it differently?

                Adrian

On Mon, Jul 22, 2019 at 02:03:20PM +0000, Ralph Castain via users wrote:
> If that works, then it might be possible to include the namespace ID in the 
> job-info provided by PMIx at startup - would have to investigate, so please 
> confirm that the modex option works first.
> 
> > On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users 
> > <users@lists.open-mpi.org> wrote:
> > 
> > Adrian,
> > 
> > 
> > An option is to involve the modex.
> > 
> > each task would OPAL_MODEX_SEND() its own namespace ID, and then 
> > OPAL_MODEX_RECV()
> > 
> > the one from its peers and decide whether CMA support can be enabled.
> > 
> > 
> > Cheers,
> > 
> > 
> > Gilles
> > 
> > On 7/22/2019 4:53 PM, Adrian Reber via users wrote:
> >> I had a look at it and not sure if it really makes sense.
> >> 
> >> In btl_vader_{put,get}.c it would be easy to check for the user
> >> namespace ID of the other process, but the function would then just
> >> return OPAL_ERROR a bit earlier instead of as a result of
> >> process_vm_{read,write}v(). Nothing would really change.
> >> 
> >> A better place for the check would be mca_btl_vader_check_single_copy()
> >> but I do not know if at this point the PID of the other processes is
> >> already known. Not sure if I can check for the user namespace ID of the
> >> other processes.
> >> 
> >> Any recommendations how to do this?
> >> 
> >>            Adrian
> >> 
> >> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote:
> >>> Patches are always welcome. What would be great is a nice big warning 
> >>> that CMA support is disabled because the processes are on different 
> >>> namespaces. Ideally all MPI processes should be on the same namespace to 
> >>> ensure the best performance.
> >>> 
> >>> -Nathan
> >>> 
> >>>> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users 
> >>>> <users@lists.open-mpi.org> wrote:
> >>>> 
> >>>> For completeness I am mentioning my results also here.
> >>>> 
> >>>> To be able to mount file systems in the container it can only work if
> >>>> user namespaces are used and even if the user IDs are all the same (in
> >>>> each container and on the host), to be able to ptrace the kernel also
> >>>> checks if the processes are in the same user namespace (in addition to
> >>>> being owned by the same user). This check - same user namespace - fails
> >>>> and so process_vm_readv() and process_vm_writev() will also fail.
> >>>> 
> >>>> So Open MPI's checks are currently not enough to detect if 'cma' can be
> >>>> used. Checking for the same user namespace would also be necessary.
> >>>> 
> >>>> Is this a use case important enough to accept a patch for it?
> >>>> 
> >>>>        Adrian
> >>>> 
> >>>>> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote:
> >>>>> Gilles,
> >>>>> 
> >>>>> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps
> >>>>> indeed.
> >>>>> 
> >>>>> The default seems to be 'cma' and that seems to use process_vm_readv()
> >>>>> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but
> >>>>> telling Podman to give the process CAP_SYS_PTRACE with 
> >>>>> '--cap-add=SYS_PTRACE'
> >>>>> does not seem to be enough. Not sure yet if this related to the fact
> >>>>> that Podman is running rootless. I will continue to investigate, but now
> >>>>> I know where to look. Thanks!
> >>>>> 
> >>>>>        Adrian
> >>>>> 
> >>>>>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via 
> >>>>>> users wrote:
> >>>>>> Adrian,
> >>>>>> 
> >>>>>> Can you try
> >>>>>> mpirun --mca btl_vader_copy_mechanism none ...
> >>>>>> 
> >>>>>> Please double check the MCA parameter name, I am AFK
> >>>>>> 
> >>>>>> IIRC, the default copy mechanism used by vader directly accesses the 
> >>>>>> remote process address space, and this requires some permission 
> >>>>>> (ptrace?) that might be dropped by podman.
> >>>>>> 
> >>>>>> Note Open MPI might not detect both MPI tasks run on the same node 
> >>>>>> because of podman.
> >>>>>> If you use UCX, then btl/vader is not used at all (pml/ucx is used 
> >>>>>> instead)
> >>>>>> 
> >>>>>> 
> >>>>>> Cheers,
> >>>>>> 
> >>>>>> Gilles
> >>>>>> 
> >>>>>> Sent from my iPod
> >>>>>> 
> >>>>>>> On Jul 12, 2019, at 18:33, Adrian Reber via users 
> >>>>>>> <users@lists.open-mpi.org> wrote:
> >>>>>>> 
> >>>>>>> So upstream Podman was really fast and merged a PR which makes my
> >>>>>>> wrapper unnecessary:
> >>>>>>> 
> >>>>>>> Add support for --env-host : 
> >>>>>>> https://github.com/containers/libpod/pull/3557
> >>>>>>> 
> >>>>>>> As commented in the PR I can now start mpirun with Podman without a
> >>>>>>> wrapper:
> >>>>>>> 
> >>>>>>> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun 
> >>>>>>> podman run --env-host --security-opt label=disable -v 
> >>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host 
> >>>>>>> mpi-test /home/mpi/ring
> >>>>>>> Rank 0 has cleared MPI_Init
> >>>>>>> Rank 1 has cleared MPI_Init
> >>>>>>> Rank 0 has completed ring
> >>>>>>> Rank 0 has completed MPI_Barrier
> >>>>>>> Rank 1 has completed ring
> >>>>>>> Rank 1 has completed MPI_Barrier
> >>>>>>> 
> >>>>>>> This is example was using TCP and on an InfiniBand based system I have
> >>>>>>> to map the InfiniBand devices into the container.
> >>>>>>> 
> >>>>>>> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base 
> >>>>>>> /tmp/podman-mpirun podman run --env-host -v 
> >>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable 
> >>>>>>> --userns=keep-id --device /dev/infiniband/uverbs0 --device 
> >>>>>>> /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host 
> >>>>>>> mpi-test /home/mpi/ring
> >>>>>>> Rank 0 has cleared MPI_Init
> >>>>>>> Rank 1 has cleared MPI_Init
> >>>>>>> Rank 0 has completed ring
> >>>>>>> Rank 0 has completed MPI_Barrier
> >>>>>>> Rank 1 has completed ring
> >>>>>>> Rank 1 has completed MPI_Barrier
> >>>>>>> 
> >>>>>>> This is all running without root and only using Podman's rootless
> >>>>>>> support.
> >>>>>>> 
> >>>>>>> Running multiple processes on one system, however, still gives me an
> >>>>>>> error. If I disable vader I guess that Open MPI is using TCP for
> >>>>>>> localhost communication and that works. But with vader it fails.
> >>>>>>> 
> >>>>>>> The first error message I get is a segfault:
> >>>>>>> 
> >>>>>>> [test1:00001] *** Process received signal ***
> >>>>>>> [test1:00001] Signal: Segmentation fault (11)
> >>>>>>> [test1:00001] Signal code: Address not mapped (1)
> >>>>>>> [test1:00001] Failing at address: 0x7fb7b1552010
> >>>>>>> [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80]
> >>>>>>> [test1:00001] [ 1] 
> >>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b]
> >>>>>>> [test1:00001] [ 2] 
> >>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb]
> >>>>>>> [test1:00001] [ 3] 
> >>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086]
> >>>>>>> [test1:00001] [ 4] 
> >>>>>>> /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d]
> >>>>>>> [test1:00001] [ 5] /home/mpi/ring[0x400b76]
> >>>>>>> [test1:00001] [ 6] 
> >>>>>>> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813]
> >>>>>>> [test1:00001] [ 7] /home/mpi/ring[0x4008be]
> >>>>>>> [test1:00001] *** End of error message ***
> >>>>>>> 
> >>>>>>> Guessing that vader uses shared memory this is expected to fail, with
> >>>>>>> all the namespace isolations in place. Maybe not with a segfault, but
> >>>>>>> each container has its own shared memory. So next step was to use the
> >>>>>>> host's ipc and pid namespace and mount /dev/shm:
> >>>>>>> 
> >>>>>>> '-v /dev/shm:/dev/shm --ipc=host --pid=host'
> >>>>>>> 
> >>>>>>> Which does not segfault, but still does not look correct:
> >>>>>>> 
> >>>>>>> Rank 0 has cleared MPI_Init
> >>>>>>> Rank 1 has cleared MPI_Init
> >>>>>>> Rank 2 has cleared MPI_Init
> >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>>>>> Rank 0 has completed ring
> >>>>>>> Rank 2 has completed ring
> >>>>>>> Rank 0 has completed MPI_Barrier
> >>>>>>> Rank 1 has completed ring
> >>>>>>> Rank 2 has completed MPI_Barrier
> >>>>>>> Rank 1 has completed MPI_Barrier
> >>>>>>> 
> >>>>>>> This is using the Open MPI ring.c example with SIZE increased from 20 
> >>>>>>> to 20000.
> >>>>>>> 
> >>>>>>> Any recommendations what vader needs to communicate correctly?
> >>>>>>> 
> >>>>>>>       Adrian
> >>>>>>> 
> >>>>>>>> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users 
> >>>>>>>> wrote:
> >>>>>>>> Gilles,
> >>>>>>>> 
> >>>>>>>> thanks for pointing out the environment variables. I quickly created 
> >>>>>>>> a
> >>>>>>>> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables
> >>>>>>>> (grep "\(PMIX\|OMPI\)"). Now it works:
> >>>>>>>> 
> >>>>>>>> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id 
> >>>>>>>> --net=host mpi-test /home/mpi/hello
> >>>>>>>> 
> >>>>>>>> Hello, world (2 procs total)
> >>>>>>>>   --> Process #   0 of   2 is alive. ->test1
> >>>>>>>>   --> Process #   1 of   2 is alive. ->test2
> >>>>>>>> 
> >>>>>>>> I need to tell Podman to mount /tmp from the host into the 
> >>>>>>>> container, as
> >>>>>>>> I am running rootless I also need to tell Podman to use the same 
> >>>>>>>> user ID
> >>>>>>>> in the container as outside (so that the Open MPI files in /tmp) can 
> >>>>>>>> be
> >>>>>>>> shared and I am also running without a network namespace.
> >>>>>>>> 
> >>>>>>>> So this is now with the full Podman provided isolation except the
> >>>>>>>> network namespace. Thanks for you help!
> >>>>>>>> 
> >>>>>>>>       Adrian
> >>>>>>>> 
> >>>>>>>>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via 
> >>>>>>>>> users wrote:
> >>>>>>>>> Adrian,
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> the MPI application relies on some environment variables (they 
> >>>>>>>>> typically
> >>>>>>>>> start with OMPI_ and PMIX_).
> >>>>>>>>> 
> >>>>>>>>> The MPI application internally uses a PMIx client that must be able 
> >>>>>>>>> to
> >>>>>>>>> contact a PMIx server
> >>>>>>>>> 
> >>>>>>>>> (that is included in mpirun and the orted daemon(s) spawned on the 
> >>>>>>>>> remote
> >>>>>>>>> hosts).
> >>>>>>>>> 
> >>>>>>>>> located on the same host.
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> If podman provides some isolation between the app inside the 
> >>>>>>>>> container (e.g.
> >>>>>>>>> /home/mpi/hello)
> >>>>>>>>> 
> >>>>>>>>> and the outside world (e.g. mpirun/orted), that won't be an easy 
> >>>>>>>>> ride.
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Cheers,
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Gilles
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote:
> >>>>>>>>>> I did a quick test to see if I can use Podman in combination with 
> >>>>>>>>>> Open
> >>>>>>>>>> MPI:
> >>>>>>>>>> 
> >>>>>>>>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run 
> >>>>>>>>>> quay.io/adrianreber/mpi-test /home/mpi/hello
> >>>>>>>>>> 
> >>>>>>>>>> Hello, world (1 procs total)
> >>>>>>>>>>    --> Process #   0 of   1 is alive. ->789b8fb622ef
> >>>>>>>>>> 
> >>>>>>>>>> Hello, world (1 procs total)
> >>>>>>>>>>    --> Process #   0 of   1 is alive. ->749eb4e1c01a
> >>>>>>>>>> 
> >>>>>>>>>> The test program (hello) is taken from 
> >>>>>>>>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> The problem with this is that each process thinks it is process 0 
> >>>>>>>>>> of 1
> >>>>>>>>>> instead of
> >>>>>>>>>> 
> >>>>>>>>>> Hello, world (2 procs total)
> >>>>>>>>>>    --> Process #   1 of   2 is alive.  ->test1
> >>>>>>>>>>    --> Process #   0 of   2 is alive.  ->test2
> >>>>>>>>>> 
> >>>>>>>>>> My questions is how is the rank determined? What resources do I 
> >>>>>>>>>> need to have
> >>>>>>>>>> in my container to correctly determine the rank.
> >>>>>>>>>> 
> >>>>>>>>>> This is Podman 1.4.2 and Open MPI 4.0.1.
> >>>>>>>>>> 
> >>>>>>>>>>       Adrian
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> users@lists.open-mpi.org
> >>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>>>>>>> 
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users@lists.open-mpi.org
> >>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> users@lists.open-mpi.org
> >>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> users@lists.open-mpi.org
> >>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users@lists.open-mpi.org
> >>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users@lists.open-mpi.org
> >>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users@lists.open-mpi.org
> >>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>            Adrian
> >> 
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to