Re: [OMPI users] How it the rank determined (Open MPI and Podman)

2019-07-12 Thread Adrian Reber via users
So upstream Podman was really fast and merged a PR which makes my
wrapper unnecessary:

 Add support for --env-host : https://github.com/containers/libpod/pull/3557

As commented in the PR I can now start mpirun with Podman without a
wrapper:

$ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun podman 
run --env-host --security-opt label=disable -v 
/tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test 
/home/mpi/ring
Rank 0 has cleared MPI_Init
Rank 1 has cleared MPI_Init
Rank 0 has completed ring
Rank 0 has completed MPI_Barrier
Rank 1 has completed ring
Rank 1 has completed MPI_Barrier

This is example was using TCP and on an InfiniBand based system I have
to map the InfiniBand devices into the container.

$ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base 
/tmp/podman-mpirun podman run --env-host -v 
/tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable 
--userns=keep-id --device /dev/infiniband/uverbs0 --device 
/dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test 
/home/mpi/ring
Rank 0 has cleared MPI_Init
Rank 1 has cleared MPI_Init
Rank 0 has completed ring
Rank 0 has completed MPI_Barrier
Rank 1 has completed ring
Rank 1 has completed MPI_Barrier

This is all running without root and only using Podman's rootless
support.

Running multiple processes on one system, however, still gives me an
error. If I disable vader I guess that Open MPI is using TCP for
localhost communication and that works. But with vader it fails.

The first error message I get is a segfault:

[test1:1] *** Process received signal ***
[test1:1] Signal: Segmentation fault (11)
[test1:1] Signal code: Address not mapped (1)
[test1:1] Failing at address: 0x7fb7b1552010
[test1:1] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80]
[test1:1] [ 1] 
/usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b]
[test1:1] [ 2] 
/usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb]
[test1:1] [ 3] 
/usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086]
[test1:1] [ 4] 
/usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d]
[test1:1] [ 5] /home/mpi/ring[0x400b76]
[test1:1] [ 6] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813]
[test1:1] [ 7] /home/mpi/ring[0x4008be]
[test1:1] *** End of error message ***

Guessing that vader uses shared memory this is expected to fail, with
all the namespace isolations in place. Maybe not with a segfault, but
each container has its own shared memory. So next step was to use the
host's ipc and pid namespace and mount /dev/shm:

 '-v /dev/shm:/dev/shm --ipc=host --pid=host'

Which does not segfault, but still does not look correct:

Rank 0 has cleared MPI_Init
Rank 1 has cleared MPI_Init
Rank 2 has cleared MPI_Init
[test1:17722] Read -1, expected 8, errno = 1
[test1:17722] Read -1, expected 8, errno = 1
[test1:17722] Read -1, expected 8, errno = 1
[test1:17722] Read -1, expected 8, errno = 1
[test1:17722] Read -1, expected 8, errno = 1
[test1:17722] Read -1, expected 8, errno = 1
[test1:17722] Read -1, expected 8, errno = 1
[test1:17722] Read -1, expected 8, errno = 1
[test1:17722] Read -1, expected 8, errno = 1
[test1:17722] Read -1, expected 8, errno = 1
[test1:17722] Read -1, expected 8, errno = 1
Rank 0 has completed ring
Rank 2 has completed ring
Rank 0 has completed MPI_Barrier
Rank 1 has completed ring
Rank 2 has completed MPI_Barrier
Rank 1 has completed MPI_Barrier

This is using the Open MPI ring.c example with SIZE increased from 20 to 2.

Any recommendations what vader needs to communicate correctly?

Adrian

On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users wrote:
> Gilles,
> 
> thanks for pointing out the environment variables. I quickly created a
> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables
> (grep "\(PMIX\|OMPI\)"). Now it works:
> 
> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id 
> --net=host mpi-test /home/mpi/hello
> 
>  Hello, world (2 procs total)
> --> Process #   0 of   2 is alive. ->test1
> --> Process #   1 of   2 is alive. ->test2
> 
> I need to tell Podman to mount /tmp from the host into the container, as
> I am running rootless I also need to tell Podman to use the same user ID
> in the container as outside (so that the Open MPI files in /tmp) can be
> shared and I am also running without a network namespace.
> 
> So this is now with the full Podman provided isolation except the
> network namespace. Thanks for you help!
> 
>   Adrian
> 
> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via users wrote:
> > Adrian,
> > 
> > 
> > the MPI application relies on some environment variables (they typically
> > start with OMPI_ and PMIX_).
> > 
> > The MPI ap

Re: [OMPI users] How it the rank determined (Open MPI and Podman)

2019-07-12 Thread Gilles Gouaillardet via users
Adrian,

Can you try
mpirun --mca btl_vader_copy_mechanism none ...

Please double check the MCA parameter name, I am AFK

IIRC, the default copy mechanism used by vader directly accesses the remote 
process address space, and this requires some permission (ptrace?) that might 
be dropped by podman.

Note Open MPI might not detect both MPI tasks run on the same node because of 
podman.
If you use UCX, then btl/vader is not used at all (pml/ucx is used instead)


Cheers,

Gilles

Sent from my iPod

> On Jul 12, 2019, at 18:33, Adrian Reber via users  
> wrote:
> 
> So upstream Podman was really fast and merged a PR which makes my
> wrapper unnecessary:
> 
> Add support for --env-host : https://github.com/containers/libpod/pull/3557
> 
> As commented in the PR I can now start mpirun with Podman without a
> wrapper:
> 
> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun podman 
> run --env-host --security-opt label=disable -v 
> /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test 
> /home/mpi/ring
> Rank 0 has cleared MPI_Init
> Rank 1 has cleared MPI_Init
> Rank 0 has completed ring
> Rank 0 has completed MPI_Barrier
> Rank 1 has completed ring
> Rank 1 has completed MPI_Barrier
> 
> This is example was using TCP and on an InfiniBand based system I have
> to map the InfiniBand devices into the container.
> 
> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base 
> /tmp/podman-mpirun podman run --env-host -v 
> /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable 
> --userns=keep-id --device /dev/infiniband/uverbs0 --device 
> /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test 
> /home/mpi/ring
> Rank 0 has cleared MPI_Init
> Rank 1 has cleared MPI_Init
> Rank 0 has completed ring
> Rank 0 has completed MPI_Barrier
> Rank 1 has completed ring
> Rank 1 has completed MPI_Barrier
> 
> This is all running without root and only using Podman's rootless
> support.
> 
> Running multiple processes on one system, however, still gives me an
> error. If I disable vader I guess that Open MPI is using TCP for
> localhost communication and that works. But with vader it fails.
> 
> The first error message I get is a segfault:
> 
> [test1:1] *** Process received signal ***
> [test1:1] Signal: Segmentation fault (11)
> [test1:1] Signal code: Address not mapped (1)
> [test1:1] Failing at address: 0x7fb7b1552010
> [test1:1] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80]
> [test1:1] [ 1] 
> /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b]
> [test1:1] [ 2] 
> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb]
> [test1:1] [ 3] 
> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086]
> [test1:1] [ 4] 
> /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d]
> [test1:1] [ 5] /home/mpi/ring[0x400b76]
> [test1:1] [ 6] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813]
> [test1:1] [ 7] /home/mpi/ring[0x4008be]
> [test1:1] *** End of error message ***
> 
> Guessing that vader uses shared memory this is expected to fail, with
> all the namespace isolations in place. Maybe not with a segfault, but
> each container has its own shared memory. So next step was to use the
> host's ipc and pid namespace and mount /dev/shm:
> 
> '-v /dev/shm:/dev/shm --ipc=host --pid=host'
> 
> Which does not segfault, but still does not look correct:
> 
> Rank 0 has cleared MPI_Init
> Rank 1 has cleared MPI_Init
> Rank 2 has cleared MPI_Init
> [test1:17722] Read -1, expected 8, errno = 1
> [test1:17722] Read -1, expected 8, errno = 1
> [test1:17722] Read -1, expected 8, errno = 1
> [test1:17722] Read -1, expected 8, errno = 1
> [test1:17722] Read -1, expected 8, errno = 1
> [test1:17722] Read -1, expected 8, errno = 1
> [test1:17722] Read -1, expected 8, errno = 1
> [test1:17722] Read -1, expected 8, errno = 1
> [test1:17722] Read -1, expected 8, errno = 1
> [test1:17722] Read -1, expected 8, errno = 1
> [test1:17722] Read -1, expected 8, errno = 1
> Rank 0 has completed ring
> Rank 2 has completed ring
> Rank 0 has completed MPI_Barrier
> Rank 1 has completed ring
> Rank 2 has completed MPI_Barrier
> Rank 1 has completed MPI_Barrier
> 
> This is using the Open MPI ring.c example with SIZE increased from 20 to 
> 2.
> 
> Any recommendations what vader needs to communicate correctly?
> 
>Adrian
> 
>> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users wrote:
>> Gilles,
>> 
>> thanks for pointing out the environment variables. I quickly created a
>> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables
>> (grep "\(PMIX\|OMPI\)"). Now it works:
>> 
>> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id 
>> --net=host mpi-test /home/mpi/hello
>> 
>> Hello, world (2 pro

Re: [OMPI users] How it the rank determined (Open MPI and Podman)

2019-07-12 Thread Adrian Reber via users
Gilles,

thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps
indeed.

The default seems to be 'cma' and that seems to use process_vm_readv()
and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but
telling Podman to give the process CAP_SYS_PTRACE with '--cap-add=SYS_PTRACE'
does not seem to be enough. Not sure yet if this related to the fact
that Podman is running rootless. I will continue to investigate, but now
I know where to look. Thanks!

Adrian

On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users wrote:
> Adrian,
> 
> Can you try
> mpirun --mca btl_vader_copy_mechanism none ...
> 
> Please double check the MCA parameter name, I am AFK
> 
> IIRC, the default copy mechanism used by vader directly accesses the remote 
> process address space, and this requires some permission (ptrace?) that might 
> be dropped by podman.
> 
> Note Open MPI might not detect both MPI tasks run on the same node because of 
> podman.
> If you use UCX, then btl/vader is not used at all (pml/ucx is used instead)
> 
> 
> Cheers,
> 
> Gilles
> 
> Sent from my iPod
> 
> > On Jul 12, 2019, at 18:33, Adrian Reber via users 
> >  wrote:
> > 
> > So upstream Podman was really fast and merged a PR which makes my
> > wrapper unnecessary:
> > 
> > Add support for --env-host : https://github.com/containers/libpod/pull/3557
> > 
> > As commented in the PR I can now start mpirun with Podman without a
> > wrapper:
> > 
> > $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun 
> > podman run --env-host --security-opt label=disable -v 
> > /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test 
> > /home/mpi/ring
> > Rank 0 has cleared MPI_Init
> > Rank 1 has cleared MPI_Init
> > Rank 0 has completed ring
> > Rank 0 has completed MPI_Barrier
> > Rank 1 has completed ring
> > Rank 1 has completed MPI_Barrier
> > 
> > This is example was using TCP and on an InfiniBand based system I have
> > to map the InfiniBand devices into the container.
> > 
> > $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base 
> > /tmp/podman-mpirun podman run --env-host -v 
> > /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable 
> > --userns=keep-id --device /dev/infiniband/uverbs0 --device 
> > /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test 
> > /home/mpi/ring
> > Rank 0 has cleared MPI_Init
> > Rank 1 has cleared MPI_Init
> > Rank 0 has completed ring
> > Rank 0 has completed MPI_Barrier
> > Rank 1 has completed ring
> > Rank 1 has completed MPI_Barrier
> > 
> > This is all running without root and only using Podman's rootless
> > support.
> > 
> > Running multiple processes on one system, however, still gives me an
> > error. If I disable vader I guess that Open MPI is using TCP for
> > localhost communication and that works. But with vader it fails.
> > 
> > The first error message I get is a segfault:
> > 
> > [test1:1] *** Process received signal ***
> > [test1:1] Signal: Segmentation fault (11)
> > [test1:1] Signal code: Address not mapped (1)
> > [test1:1] Failing at address: 0x7fb7b1552010
> > [test1:1] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80]
> > [test1:1] [ 1] 
> > /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b]
> > [test1:1] [ 2] 
> > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb]
> > [test1:1] [ 3] 
> > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086]
> > [test1:1] [ 4] 
> > /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d]
> > [test1:1] [ 5] /home/mpi/ring[0x400b76]
> > [test1:1] [ 6] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813]
> > [test1:1] [ 7] /home/mpi/ring[0x4008be]
> > [test1:1] *** End of error message ***
> > 
> > Guessing that vader uses shared memory this is expected to fail, with
> > all the namespace isolations in place. Maybe not with a segfault, but
> > each container has its own shared memory. So next step was to use the
> > host's ipc and pid namespace and mount /dev/shm:
> > 
> > '-v /dev/shm:/dev/shm --ipc=host --pid=host'
> > 
> > Which does not segfault, but still does not look correct:
> > 
> > Rank 0 has cleared MPI_Init
> > Rank 1 has cleared MPI_Init
> > Rank 2 has cleared MPI_Init
> > [test1:17722] Read -1, expected 8, errno = 1
> > [test1:17722] Read -1, expected 8, errno = 1
> > [test1:17722] Read -1, expected 8, errno = 1
> > [test1:17722] Read -1, expected 8, errno = 1
> > [test1:17722] Read -1, expected 8, errno = 1
> > [test1:17722] Read -1, expected 8, errno = 1
> > [test1:17722] Read -1, expected 8, errno = 1
> > [test1:17722] Read -1, expected 8, errno = 1
> > [test1:17722] Read -1, expected 8, errno = 1
> > [test1:17722] Read -1, expected 8, errno = 1
> > [test1:17722] Read -1, expected 80