Just add it to the existing modex.

-Nathan

> On Jul 22, 2019, at 12:20 PM, Adrian Reber via users 
> <users@lists.open-mpi.org> wrote:
> 
> I have most of the code ready, but I still have troubles doing
> OPAL_MODEX_RECV. I am using the following lines, based on the code from
> orte/test/mpi/pmix.c:
> 
> OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", &value, OPAL_INT);
> 
> This sets rc to 0. For receiving:
> 
> OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", &wildcard_rank, &ptr, OPAL_INT);
> 
> and rc is always set to -13. Is this how it is supposed to work, or do I
> have to do it differently?
> 
>        Adrian
> 
>> On Mon, Jul 22, 2019 at 02:03:20PM +0000, Ralph Castain via users wrote:
>> If that works, then it might be possible to include the namespace ID in the 
>> job-info provided by PMIx at startup - would have to investigate, so please 
>> confirm that the modex option works first.
>> 
>>> On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users 
>>> <users@lists.open-mpi.org> wrote:
>>> 
>>> Adrian,
>>> 
>>> 
>>> An option is to involve the modex.
>>> 
>>> each task would OPAL_MODEX_SEND() its own namespace ID, and then 
>>> OPAL_MODEX_RECV()
>>> 
>>> the one from its peers and decide whether CMA support can be enabled.
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> Gilles
>>> 
>>>> On 7/22/2019 4:53 PM, Adrian Reber via users wrote:
>>>> I had a look at it and not sure if it really makes sense.
>>>> 
>>>> In btl_vader_{put,get}.c it would be easy to check for the user
>>>> namespace ID of the other process, but the function would then just
>>>> return OPAL_ERROR a bit earlier instead of as a result of
>>>> process_vm_{read,write}v(). Nothing would really change.
>>>> 
>>>> A better place for the check would be mca_btl_vader_check_single_copy()
>>>> but I do not know if at this point the PID of the other processes is
>>>> already known. Not sure if I can check for the user namespace ID of the
>>>> other processes.
>>>> 
>>>> Any recommendations how to do this?
>>>> 
>>>>        Adrian
>>>> 
>>>>> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote:
>>>>> Patches are always welcome. What would be great is a nice big warning 
>>>>> that CMA support is disabled because the processes are on different 
>>>>> namespaces. Ideally all MPI processes should be on the same namespace to 
>>>>> ensure the best performance.
>>>>> 
>>>>> -Nathan
>>>>> 
>>>>>> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users 
>>>>>> <users@lists.open-mpi.org> wrote:
>>>>>> 
>>>>>> For completeness I am mentioning my results also here.
>>>>>> 
>>>>>> To be able to mount file systems in the container it can only work if
>>>>>> user namespaces are used and even if the user IDs are all the same (in
>>>>>> each container and on the host), to be able to ptrace the kernel also
>>>>>> checks if the processes are in the same user namespace (in addition to
>>>>>> being owned by the same user). This check - same user namespace - fails
>>>>>> and so process_vm_readv() and process_vm_writev() will also fail.
>>>>>> 
>>>>>> So Open MPI's checks are currently not enough to detect if 'cma' can be
>>>>>> used. Checking for the same user namespace would also be necessary.
>>>>>> 
>>>>>> Is this a use case important enough to accept a patch for it?
>>>>>> 
>>>>>>       Adrian
>>>>>> 
>>>>>>> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote:
>>>>>>> Gilles,
>>>>>>> 
>>>>>>> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps
>>>>>>> indeed.
>>>>>>> 
>>>>>>> The default seems to be 'cma' and that seems to use process_vm_readv()
>>>>>>> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but
>>>>>>> telling Podman to give the process CAP_SYS_PTRACE with 
>>>>>>> '--cap-add=SYS_PTRACE'
>>>>>>> does not seem to be enough. Not sure yet if this related to the fact
>>>>>>> that Podman is running rootless. I will continue to investigate, but now
>>>>>>> I know where to look. Thanks!
>>>>>>> 
>>>>>>>       Adrian
>>>>>>> 
>>>>>>>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via 
>>>>>>>> users wrote:
>>>>>>>> Adrian,
>>>>>>>> 
>>>>>>>> Can you try
>>>>>>>> mpirun --mca btl_vader_copy_mechanism none ...
>>>>>>>> 
>>>>>>>> Please double check the MCA parameter name, I am AFK
>>>>>>>> 
>>>>>>>> IIRC, the default copy mechanism used by vader directly accesses the 
>>>>>>>> remote process address space, and this requires some permission 
>>>>>>>> (ptrace?) that might be dropped by podman.
>>>>>>>> 
>>>>>>>> Note Open MPI might not detect both MPI tasks run on the same node 
>>>>>>>> because of podman.
>>>>>>>> If you use UCX, then btl/vader is not used at all (pml/ucx is used 
>>>>>>>> instead)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> Gilles
>>>>>>>> 
>>>>>>>> Sent from my iPod
>>>>>>>> 
>>>>>>>>> On Jul 12, 2019, at 18:33, Adrian Reber via users 
>>>>>>>>> <users@lists.open-mpi.org> wrote:
>>>>>>>>> 
>>>>>>>>> So upstream Podman was really fast and merged a PR which makes my
>>>>>>>>> wrapper unnecessary:
>>>>>>>>> 
>>>>>>>>> Add support for --env-host : 
>>>>>>>>> https://github.com/containers/libpod/pull/3557
>>>>>>>>> 
>>>>>>>>> As commented in the PR I can now start mpirun with Podman without a
>>>>>>>>> wrapper:
>>>>>>>>> 
>>>>>>>>> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun 
>>>>>>>>> podman run --env-host --security-opt label=disable -v 
>>>>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host 
>>>>>>>>> mpi-test /home/mpi/ring
>>>>>>>>> Rank 0 has cleared MPI_Init
>>>>>>>>> Rank 1 has cleared MPI_Init
>>>>>>>>> Rank 0 has completed ring
>>>>>>>>> Rank 0 has completed MPI_Barrier
>>>>>>>>> Rank 1 has completed ring
>>>>>>>>> Rank 1 has completed MPI_Barrier
>>>>>>>>> 
>>>>>>>>> This is example was using TCP and on an InfiniBand based system I have
>>>>>>>>> to map the InfiniBand devices into the container.
>>>>>>>>> 
>>>>>>>>> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base 
>>>>>>>>> /tmp/podman-mpirun podman run --env-host -v 
>>>>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable 
>>>>>>>>> --userns=keep-id --device /dev/infiniband/uverbs0 --device 
>>>>>>>>> /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host 
>>>>>>>>> mpi-test /home/mpi/ring
>>>>>>>>> Rank 0 has cleared MPI_Init
>>>>>>>>> Rank 1 has cleared MPI_Init
>>>>>>>>> Rank 0 has completed ring
>>>>>>>>> Rank 0 has completed MPI_Barrier
>>>>>>>>> Rank 1 has completed ring
>>>>>>>>> Rank 1 has completed MPI_Barrier
>>>>>>>>> 
>>>>>>>>> This is all running without root and only using Podman's rootless
>>>>>>>>> support.
>>>>>>>>> 
>>>>>>>>> Running multiple processes on one system, however, still gives me an
>>>>>>>>> error. If I disable vader I guess that Open MPI is using TCP for
>>>>>>>>> localhost communication and that works. But with vader it fails.
>>>>>>>>> 
>>>>>>>>> The first error message I get is a segfault:
>>>>>>>>> 
>>>>>>>>> [test1:00001] *** Process received signal ***
>>>>>>>>> [test1:00001] Signal: Segmentation fault (11)
>>>>>>>>> [test1:00001] Signal code: Address not mapped (1)
>>>>>>>>> [test1:00001] Failing at address: 0x7fb7b1552010
>>>>>>>>> [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80]
>>>>>>>>> [test1:00001] [ 1] 
>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b]
>>>>>>>>> [test1:00001] [ 2] 
>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb]
>>>>>>>>> [test1:00001] [ 3] 
>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086]
>>>>>>>>> [test1:00001] [ 4] 
>>>>>>>>> /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d]
>>>>>>>>> [test1:00001] [ 5] /home/mpi/ring[0x400b76]
>>>>>>>>> [test1:00001] [ 6] 
>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813]
>>>>>>>>> [test1:00001] [ 7] /home/mpi/ring[0x4008be]
>>>>>>>>> [test1:00001] *** End of error message ***
>>>>>>>>> 
>>>>>>>>> Guessing that vader uses shared memory this is expected to fail, with
>>>>>>>>> all the namespace isolations in place. Maybe not with a segfault, but
>>>>>>>>> each container has its own shared memory. So next step was to use the
>>>>>>>>> host's ipc and pid namespace and mount /dev/shm:
>>>>>>>>> 
>>>>>>>>> '-v /dev/shm:/dev/shm --ipc=host --pid=host'
>>>>>>>>> 
>>>>>>>>> Which does not segfault, but still does not look correct:
>>>>>>>>> 
>>>>>>>>> Rank 0 has cleared MPI_Init
>>>>>>>>> Rank 1 has cleared MPI_Init
>>>>>>>>> Rank 2 has cleared MPI_Init
>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>>>> Rank 0 has completed ring
>>>>>>>>> Rank 2 has completed ring
>>>>>>>>> Rank 0 has completed MPI_Barrier
>>>>>>>>> Rank 1 has completed ring
>>>>>>>>> Rank 2 has completed MPI_Barrier
>>>>>>>>> Rank 1 has completed MPI_Barrier
>>>>>>>>> 
>>>>>>>>> This is using the Open MPI ring.c example with SIZE increased from 20 
>>>>>>>>> to 20000.
>>>>>>>>> 
>>>>>>>>> Any recommendations what vader needs to communicate correctly?
>>>>>>>>> 
>>>>>>>>>      Adrian
>>>>>>>>> 
>>>>>>>>>> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users 
>>>>>>>>>> wrote:
>>>>>>>>>> Gilles,
>>>>>>>>>> 
>>>>>>>>>> thanks for pointing out the environment variables. I quickly created 
>>>>>>>>>> a
>>>>>>>>>> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables
>>>>>>>>>> (grep "\(PMIX\|OMPI\)"). Now it works:
>>>>>>>>>> 
>>>>>>>>>> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id 
>>>>>>>>>> --net=host mpi-test /home/mpi/hello
>>>>>>>>>> 
>>>>>>>>>> Hello, world (2 procs total)
>>>>>>>>>>  --> Process #   0 of   2 is alive. ->test1
>>>>>>>>>>  --> Process #   1 of   2 is alive. ->test2
>>>>>>>>>> 
>>>>>>>>>> I need to tell Podman to mount /tmp from the host into the 
>>>>>>>>>> container, as
>>>>>>>>>> I am running rootless I also need to tell Podman to use the same 
>>>>>>>>>> user ID
>>>>>>>>>> in the container as outside (so that the Open MPI files in /tmp) can 
>>>>>>>>>> be
>>>>>>>>>> shared and I am also running without a network namespace.
>>>>>>>>>> 
>>>>>>>>>> So this is now with the full Podman provided isolation except the
>>>>>>>>>> network namespace. Thanks for you help!
>>>>>>>>>> 
>>>>>>>>>>      Adrian
>>>>>>>>>> 
>>>>>>>>>>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via 
>>>>>>>>>>> users wrote:
>>>>>>>>>>> Adrian,
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> the MPI application relies on some environment variables (they 
>>>>>>>>>>> typically
>>>>>>>>>>> start with OMPI_ and PMIX_).
>>>>>>>>>>> 
>>>>>>>>>>> The MPI application internally uses a PMIx client that must be able 
>>>>>>>>>>> to
>>>>>>>>>>> contact a PMIx server
>>>>>>>>>>> 
>>>>>>>>>>> (that is included in mpirun and the orted daemon(s) spawned on the 
>>>>>>>>>>> remote
>>>>>>>>>>> hosts).
>>>>>>>>>>> 
>>>>>>>>>>> located on the same host.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> If podman provides some isolation between the app inside the 
>>>>>>>>>>> container (e.g.
>>>>>>>>>>> /home/mpi/hello)
>>>>>>>>>>> 
>>>>>>>>>>> and the outside world (e.g. mpirun/orted), that won't be an easy 
>>>>>>>>>>> ride.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Gilles
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote:
>>>>>>>>>>>> I did a quick test to see if I can use Podman in combination with 
>>>>>>>>>>>> Open
>>>>>>>>>>>> MPI:
>>>>>>>>>>>> 
>>>>>>>>>>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run 
>>>>>>>>>>>> quay.io/adrianreber/mpi-test /home/mpi/hello
>>>>>>>>>>>> 
>>>>>>>>>>>> Hello, world (1 procs total)
>>>>>>>>>>>>   --> Process #   0 of   1 is alive. ->789b8fb622ef
>>>>>>>>>>>> 
>>>>>>>>>>>> Hello, world (1 procs total)
>>>>>>>>>>>>   --> Process #   0 of   1 is alive. ->749eb4e1c01a
>>>>>>>>>>>> 
>>>>>>>>>>>> The test program (hello) is taken from 
>>>>>>>>>>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> The problem with this is that each process thinks it is process 0 
>>>>>>>>>>>> of 1
>>>>>>>>>>>> instead of
>>>>>>>>>>>> 
>>>>>>>>>>>> Hello, world (2 procs total)
>>>>>>>>>>>>   --> Process #   1 of   2 is alive.  ->test1
>>>>>>>>>>>>   --> Process #   0 of   2 is alive.  ->test2
>>>>>>>>>>>> 
>>>>>>>>>>>> My questions is how is the rank determined? What resources do I 
>>>>>>>>>>>> need to have
>>>>>>>>>>>> in my container to correctly determine the rank.
>>>>>>>>>>>> 
>>>>>>>>>>>> This is Podman 1.4.2 and Open MPI 4.0.1.
>>>>>>>>>>>> 
>>>>>>>>>>>>      Adrian
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users@lists.open-mpi.org
>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users@lists.open-mpi.org
>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>        Adrian
>>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to