Hi Craig,

You are experiencing problems because you don't have a transport installed
that UCX can use for oshmem.

You either need to go and buy a connectx4/5 HCA from mellanox (and maybe a
switch), and install that
on your system, or else install xpmem (https://github.com/hjelmn/xpmem).
Note there is a bug right now
in UCX that you may hit if you try to go thee xpmem only  route:

https://github.com/open-mpi/ompi/issues/5083
and
https://github.com/openucx/ucx/issues/2588

If you are just running on a single node and want to experiment with the
OpenSHMEM program model,
and do not have mellanox mlx5 equipment installed on the node, you are much
better off trying to use SOS
over OFI libfabric:

https://github.com/Sandia-OpenSHMEM/SOS
https://github.com/ofiwg/libfabric/releases

For SOS you will need to install the hydra launcher as well:

http://www.mpich.org/downloads/

I really wish google would do a better job at hitting my responses about
this type of problem.  I seem to
respond every couple of months to this exact problem on this mail list.


Howard


2018-05-09 13:11 GMT-06:00 Craig Reese <cfre...@super.org>:

>
> I'm trying to play with oshmem on a single node (just to have a way to do
> some simple
> experimentation and playing around) and having spectacular problems:
>
> CentOS 6.9 (gcc 4.4.7)
> built and installed ucx 1.3.0
> built and installed openmpi-3.1.0
>
> [cfreese]$ cat oshmem.c
>
> #include <mpp/shmem.h>
> int
> main() {
>     shmem_init();
> }
>
> [cfreese]$ mpicc oshmem.c -loshmem
>
> [cfreese]$ shmemrun -np 2 ./a.out
>
> [ucs1l:30118] mca: base: components_register: registering framework spml
> components
> [ucs1l:30118] mca: base: components_register: found loaded component ucx
> [ucs1l:30119] mca: base: components_register: registering framework spml
> components
> [ucs1l:30119] mca: base: components_register: found loaded component ucx
> [ucs1l:30119] mca: base: components_register: component ucx register
> function successful
> [ucs1l:30118] mca: base: components_register: component ucx register
> function successful
> [ucs1l:30119] mca: base: components_open: opening spml components
> [ucs1l:30119] mca: base: components_open: found loaded component ucx
> [ucs1l:30118] mca: base: components_open: opening spml components
> [ucs1l:30118] mca: base: components_open: found loaded component ucx
> [ucs1l:30119] mca: base: components_open: component ucx open function
> successful
> [ucs1l:30118] mca: base: components_open: component ucx open function
> successful
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
> mca_spml_base_select() select: initializing spml component ucx
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
> - mca_spml_ucx_component_init() in ucx, my priority is 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
> mca_spml_base_select() select: initializing spml component ucx
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
> - mca_spml_ucx_component_init() in ucx, my priority is 21
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
> - mca_spml_ucx_component_init() *** ucx initialized ****
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
> mca_spml_base_select() select: init returned priority 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
> mca_spml_base_select() selected ucx best priority 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
> mca_spml_base_select() select: component ucx selected
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
> mca_spml_ucx_enable() *** ucx ENABLED ****
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
> - mca_spml_ucx_component_init() *** ucx initialized ****
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
> mca_spml_base_select() select: init returned priority 21
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
> mca_spml_base_select() selected ucx best priority 21
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
> mca_spml_base_select() select: component ucx selected
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
> mca_spml_ucx_enable() *** ucx ENABLED ****
>
> here's where I think the real issue is....
>
> [1525891910.424102] [ucs1l:30119:0]         select.c:316  UCX  ERROR no
> remote registered memory access transport to <no debug data>: mm/posix -
> Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0
> - no put short, self/self - Destination is unreachable
> [1525891910.424104] [ucs1l:30118:0]         select.c:316  UCX  ERROR no
> remote registered memory access transport to <no debug data>: mm/posix -
> Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0
> - no put short, self/self - Destination is unreachable
>
> [ucs1l:30119] Error ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
> mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is unreachable
> [ucs1l:30118] Error ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
> mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is unreachable
> *** glibc detected *** ./a.out: double free or corruption (!prev):
> 0x0000000000bb0f10 ***
> *** glibc detected *** ./a.out: double free or corruption (!prev):
> 0x0000000000f98ef0 ***
> ======= Backtrace: =========
> ======= Backtrace: =========
> /lib64/libc.so.6[0x338d875dee]
> /lib64/libc.so.6[0x338d875dee]
> /lib64/libc.so.6[0x338d878c80]
> /lib64/libc.so.6[0x338d878c80]
> /opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[
> 0x7fea58e4637c]
> /opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[
> 0x7f1dc261437c]
> /opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_
> init+0x273)[0x7fea58e07833]
> /opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_
> init+0x273)[0x7f1dc25d5833]
> /opt/openmpi-3.1.0/lib/liboshmem.so.40(pshmem_init+0x28)[0x7f1dc25d8438]
> ./a.out[0x40061d]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x338d81ed1d]
> ./a.out[0x400559]
> ======= Memory map: ========
> [ucs1l:30118] *** Process received signal ***
> [ucs1l:30118] Signal: Aborted (6)
> [ucs1l:30118] Signal code:  (-6)
>    .
>    .
>    .
>
> So it looks like UCX is found, but none of the underlying "transports"
> work. Futzing with
> ucx_info I do see posix, sysv, tcp, self is known within UCX...
>
> # Memory domain: posix
> #            component: posix
> #             allocate: unlimited
> #           remote key: 37 bytes
> #
> #   Transport: mm
> #
> #   Device: posix
> #
> #      capabilities:
> #            bandwidth: 6911.00 MB/sec
> #              latency: 80 nsec
> #             overhead: 10 nsec
> #            put_short: <= 4294967295
> #            put_bcopy: unlimited
> #            get_bcopy: unlimited
> #             am_short: <= 92
> #             am_bcopy: <= 8k
> #           atomic_add: 32, 64 bit, cpu
> #          atomic_fadd: 32, 64 bit, cpu
> #          atomic_swap: 32, 64 bit, cpu
> #         atomic_cswap: 32, 64 bit, cpu
> #           connection: to iface
> #             priority: 0
> #       device address: 8 bytes
> #        iface address: 16 bytes
> #       error handling: none
> # ...
>
> (with various futzing around with parameters I think I was able to get the
> UCX ucx_perftest to
> do something, so I'm not convinced it's completely a UCX fault).
>
> I'm guessing there's something simple that I'm missing to get oshmem/ucx
> configured, but I've been unable to find much
> with regard to setting up and using UCX so I'm hoping someone might be
> able to point me in the right direction.
>
> Thanks.
>
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to