I'm trying to play with oshmem on a single node (just to have a way to do some simple
experimentation and playing around) and having spectacular problems:

CentOS 6.9 (gcc 4.4.7)
built and installed ucx 1.3.0
built and installed openmpi-3.1.0

   [cfreese]$ cat oshmem.c

   #include <mpp/shmem.h>
   int
   main() {
        shmem_init();
   }

   [cfreese]$ mpicc oshmem.c -loshmem

   [cfreese]$ shmemrun -np 2 ./a.out

   [ucs1l:30118] mca: base: components_register: registering framework
   spml components
   [ucs1l:30118] mca: base: components_register: found loaded component ucx
   [ucs1l:30119] mca: base: components_register: registering framework
   spml components
   [ucs1l:30119] mca: base: components_register: found loaded component ucx
   [ucs1l:30119] mca: base: components_register: component ucx register
   function successful
   [ucs1l:30118] mca: base: components_register: component ucx register
   function successful
   [ucs1l:30119] mca: base: components_open: opening spml components
   [ucs1l:30119] mca: base: components_open: found loaded component ucx
   [ucs1l:30118] mca: base: components_open: opening spml components
   [ucs1l:30118] mca: base: components_open: found loaded component ucx
   [ucs1l:30119] mca: base: components_open: component ucx open
   function successful
   [ucs1l:30118] mca: base: components_open: component ucx open
   function successful
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
   mca_spml_base_select() select: initializing spml component ucx
   [ucs1l:30119]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173 -
   mca_spml_ucx_component_init() in ucx, my priority is 21
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
   mca_spml_base_select() select: initializing spml component ucx
   [ucs1l:30118]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173 -
   mca_spml_ucx_component_init() in ucx, my priority is 21
   [ucs1l:30118]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184 -
   mca_spml_ucx_component_init() *** ucx initialized ****
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
   mca_spml_base_select() select: init returned priority 21
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
   mca_spml_base_select() selected ucx best priority 21
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
   mca_spml_base_select() select: component ucx selected
   [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
   mca_spml_ucx_enable() *** ucx ENABLED ****
   [ucs1l:30119]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184 -
   mca_spml_ucx_component_init() *** ucx initialized ****
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
   mca_spml_base_select() select: init returned priority 21
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
   mca_spml_base_select() selected ucx best priority 21
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
   mca_spml_base_select() select: component ucx selected
   [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
   mca_spml_ucx_enable() *** ucx ENABLED ****

here's where I think the real issue is....

   [1525891910.424102] [ucs1l:30119:0] select.c:316  UCX  ERROR no
   remote registered memory access transport to <no debug data>:
   mm/posix - Destination is unreachable, mm/sysv - Destination is
   unreachable, tcp/eth0 - no put short, self/self - Destination is
   unreachable
   [1525891910.424104] [ucs1l:30118:0]         select.c:316  UCX ERROR
   no remote registered memory access transport to <no debug data>:
   mm/posix - Destination is unreachable, mm/sysv - Destination is
   unreachable, tcp/eth0 - no put short, self/self - Destination is
   unreachable

   [ucs1l:30119] Error
   ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
   mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is
   unreachable
   [ucs1l:30118] Error
   ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
   mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is
   unreachable
   *** glibc detected *** ./a.out: double free or corruption (!prev):
   0x0000000000bb0f10 ***
   *** glibc detected *** ./a.out: double free or corruption (!prev):
   0x0000000000f98ef0 ***
   ======= Backtrace: =========
   ======= Backtrace: =========
   /lib64/libc.so.6[0x338d875dee]
   /lib64/libc.so.6[0x338d875dee]
   /lib64/libc.so.6[0x338d878c80]
   /lib64/libc.so.6[0x338d878c80]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[0x7fea58e4637c]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[0x7f1dc261437c]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_init+0x273)[0x7fea58e07833]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_init+0x273)[0x7f1dc25d5833]
   /opt/openmpi-3.1.0/lib/liboshmem.so.40(pshmem_init+0x28)[0x7f1dc25d8438]
   ./a.out[0x40061d]
   /lib64/libc.so.6(__libc_start_main+0xfd)[0x338d81ed1d]
   ./a.out[0x400559]
   ======= Memory map: ========
   [ucs1l:30118] *** Process received signal ***
   [ucs1l:30118] Signal: Aborted (6)
   [ucs1l:30118] Signal code:  (-6)
       .
       .
       .

So it looks like UCX is found, but none of the underlying "transports" work. Futzing with
ucx_info I do see posix, sysv, tcp, self is known within UCX...

   # Memory domain: posix
   #            component: posix
   #             allocate: unlimited
   #           remote key: 37 bytes
   #
   #   Transport: mm
   #
   #   Device: posix
   #
   #      capabilities:
   #            bandwidth: 6911.00 MB/sec
   #              latency: 80 nsec
   #             overhead: 10 nsec
   #            put_short: <= 4294967295
   #            put_bcopy: unlimited
   #            get_bcopy: unlimited
   #             am_short: <= 92
   #             am_bcopy: <= 8k
   #           atomic_add: 32, 64 bit, cpu
   #          atomic_fadd: 32, 64 bit, cpu
   #          atomic_swap: 32, 64 bit, cpu
   #         atomic_cswap: 32, 64 bit, cpu
   #           connection: to iface
   #             priority: 0
   #       device address: 8 bytes
   #        iface address: 16 bytes
   #       error handling: none
   # ...

(with various futzing around with parameters I think I was able to get the UCX ucx_perftest to
do something, so I'm not convinced it's completely a UCX fault).

I'm guessing there's something simple that I'm missing to get oshmem/ucx configured, but I've been unable to find much with regard to setting up and using UCX so I'm hoping someone might be able to point me in the right direction.

Thanks.



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to