I'm trying to play with oshmem on a single node (just to have a way to
do some simple
experimentation and playing around) and having spectacular problems:
CentOS 6.9 (gcc 4.4.7)
built and installed ucx 1.3.0
built and installed openmpi-3.1.0
[cfreese]$ cat oshmem.c
#include <mpp/shmem.h>
int
main() {
shmem_init();
}
[cfreese]$ mpicc oshmem.c -loshmem
[cfreese]$ shmemrun -np 2 ./a.out
[ucs1l:30118] mca: base: components_register: registering framework
spml components
[ucs1l:30118] mca: base: components_register: found loaded component ucx
[ucs1l:30119] mca: base: components_register: registering framework
spml components
[ucs1l:30119] mca: base: components_register: found loaded component ucx
[ucs1l:30119] mca: base: components_register: component ucx register
function successful
[ucs1l:30118] mca: base: components_register: component ucx register
function successful
[ucs1l:30119] mca: base: components_open: opening spml components
[ucs1l:30119] mca: base: components_open: found loaded component ucx
[ucs1l:30118] mca: base: components_open: opening spml components
[ucs1l:30118] mca: base: components_open: found loaded component ucx
[ucs1l:30119] mca: base: components_open: component ucx open
function successful
[ucs1l:30118] mca: base: components_open: component ucx open
function successful
[ucs1l:30119]
../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
mca_spml_base_select() select: initializing spml component ucx
[ucs1l:30119]
../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173 -
mca_spml_ucx_component_init() in ucx, my priority is 21
[ucs1l:30118]
../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
mca_spml_base_select() select: initializing spml component ucx
[ucs1l:30118]
../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173 -
mca_spml_ucx_component_init() in ucx, my priority is 21
[ucs1l:30118]
../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184 -
mca_spml_ucx_component_init() *** ucx initialized ****
[ucs1l:30118]
../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
mca_spml_base_select() select: init returned priority 21
[ucs1l:30118]
../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
mca_spml_base_select() selected ucx best priority 21
[ucs1l:30118]
../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
mca_spml_base_select() select: component ucx selected
[ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
mca_spml_ucx_enable() *** ucx ENABLED ****
[ucs1l:30119]
../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184 -
mca_spml_ucx_component_init() *** ucx initialized ****
[ucs1l:30119]
../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
mca_spml_base_select() select: init returned priority 21
[ucs1l:30119]
../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
mca_spml_base_select() selected ucx best priority 21
[ucs1l:30119]
../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
mca_spml_base_select() select: component ucx selected
[ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
mca_spml_ucx_enable() *** ucx ENABLED ****
here's where I think the real issue is....
[1525891910.424102] [ucs1l:30119:0] select.c:316 UCX ERROR no
remote registered memory access transport to <no debug data>:
mm/posix - Destination is unreachable, mm/sysv - Destination is
unreachable, tcp/eth0 - no put short, self/self - Destination is
unreachable
[1525891910.424104] [ucs1l:30118:0] select.c:316 UCX ERROR
no remote registered memory access transport to <no debug data>:
mm/posix - Destination is unreachable, mm/sysv - Destination is
unreachable, tcp/eth0 - no put short, self/self - Destination is
unreachable
[ucs1l:30119] Error
../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is
unreachable
[ucs1l:30118] Error
../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is
unreachable
*** glibc detected *** ./a.out: double free or corruption (!prev):
0x0000000000bb0f10 ***
*** glibc detected *** ./a.out: double free or corruption (!prev):
0x0000000000f98ef0 ***
======= Backtrace: =========
======= Backtrace: =========
/lib64/libc.so.6[0x338d875dee]
/lib64/libc.so.6[0x338d875dee]
/lib64/libc.so.6[0x338d878c80]
/lib64/libc.so.6[0x338d878c80]
/opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[0x7fea58e4637c]
/opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[0x7f1dc261437c]
/opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_init+0x273)[0x7fea58e07833]
/opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_init+0x273)[0x7f1dc25d5833]
/opt/openmpi-3.1.0/lib/liboshmem.so.40(pshmem_init+0x28)[0x7f1dc25d8438]
./a.out[0x40061d]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x338d81ed1d]
./a.out[0x400559]
======= Memory map: ========
[ucs1l:30118] *** Process received signal ***
[ucs1l:30118] Signal: Aborted (6)
[ucs1l:30118] Signal code: (-6)
.
.
.
So it looks like UCX is found, but none of the underlying "transports"
work. Futzing with
ucx_info I do see posix, sysv, tcp, self is known within UCX...
# Memory domain: posix
# component: posix
# allocate: unlimited
# remote key: 37 bytes
#
# Transport: mm
#
# Device: posix
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 92
# am_bcopy: <= 8k
# atomic_add: 32, 64 bit, cpu
# atomic_fadd: 32, 64 bit, cpu
# atomic_swap: 32, 64 bit, cpu
# atomic_cswap: 32, 64 bit, cpu
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
# ...
(with various futzing around with parameters I think I was able to get
the UCX ucx_perftest to
do something, so I'm not convinced it's completely a UCX fault).
I'm guessing there's something simple that I'm missing to get oshmem/ucx
configured, but I've been unable to find much
with regard to setting up and using UCX so I'm hoping someone might be
able to point me in the right direction.
Thanks.
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users