Progress! I can run my trivial program on the local PHI, but not
the other PHI, on the system. Here are the interesting parts:
A pretty good recipe with last night's nightly master:
$ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
CXX="icpc -mmic" \
--build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib
LD=x86_64-k1om-linux-ld \
--enable-mpirun-prefix-by-default --disable-io-romio
--disable-mpi-fortran \
--enable-orterun-prefix-by-default \
--enable-debug
$ make && make install
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca
spml yoda --mca btl sm,self,tcp $PWD/mic.out
Hello World from process 0 of 2
Hello World from process 1 of 2
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca
spml yoda --mca btl openib,sm,self $PWD/mic.out
Hello World from process 0 of 2
Hello World from process 1 of 2
$
However, I can't seem to cross the fabric. I can ssh freely back and
forth between mic0 and mic1. However, running the next 2 tests from
mic0, it certainly seems like the second one should work, too:
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml
yoda --mca btl sm,self,tcp $PWD/mic.out
Hello World from process 0 of 2
Hello World from process 1 of 2
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml
yoda --mca btl sm,self,tcp $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading
shared libraries: libimf.so: cannot open shared object file: No
such file or directory
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with
--enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location
to use.
* compilation of the orted with dynamic libraries when static are
required
(e.g., on Cray). Please check your configure cmd line and consider
using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
...
$
(Note that I get the same results with "--mca btl openib,sm,self"....)
$ ssh mic1 file /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so:
ELF 64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om),
version 1 (SYSV), dynamically linked, not stripped
$ shmemrun -x
LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
-H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
libraries: libimf.so: cannot open shared object file: No such file
or directory
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with
--enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location
to use.
* compilation of the orted with dynamic libraries when static are
required
(e.g., on Cray). Please check your configure cmd line and consider
using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
Following here is
- IB information
- Running the failing case with lots of debugging information. (As
you might imagine, I've tried 17 ways from Sunday to try to ensure
that libimf.so is found.)
$ ibv_devices
device node GUID
------ ----------------
mlx4_0 24be05ffffa57160
scif0 4c79bafffe4402b6
$ ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.1250
node_guid: 24be:05ff:ffa5:7160
sys_image_guid: 24be:05ff:ffa5:7163
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 8
port_lid: 86
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
hca_id: scif0
transport: SCIF (2)
fw_ver: 0.0.1
node_guid: 4c79:baff:fe44:02b6
sys_image_guid: 4c79:baff:fe44:02b6
vendor_id: 0x8086
vendor_part_id: 0
hw_ver: 0x1
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1001
port_lmc: 0x00
link_layer: SCIF
$ shmemrun -x
LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
-H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca
plm_base_verbose 5 --mca memheap_base_verbose 100 $PWD/mic.out
[atl1-01-mic0:191024] mca:base:select:( plm) Querying component
[rsh]
[atl1-01-mic0:191024] [[INVALID],INVALID] plm:rsh_lookup on agent
ssh : rsh path NULL
[atl1-01-mic0:191024] mca:base:select:( plm) Query of component
[rsh] set priority to 10
[atl1-01-mic0:191024] mca:base:select:( plm) Querying component
[isolated]
[atl1-01-mic0:191024] mca:base:select:( plm) Query of component
[isolated] set priority to 0
[atl1-01-mic0:191024] mca:base:select:( plm) Querying component
[slurm]
[atl1-01-mic0:191024] mca:base:select:( plm) Skipping component
[slurm]. Query failed to return a module
[atl1-01-mic0:191024] mca:base:select:( plm) Selected component
[rsh]
[atl1-01-mic0:191024] plm:base:set_hnp_name: initial bias 191024
nodename hash 4121194178
[atl1-01-mic0:191024] plm:base:set_hnp_name: final jobfam 29012
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh_setup on agent ssh : rsh
path NULL
[atl1-01-mic0:191024] [[29012,0],0] plm:base:receive start comm
[atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_job
[atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm
[atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:191024] [[29012,0],0] setup:vm: working unmanaged
allocation
[atl1-01-mic0:191024] [[29012,0],0] using dash_host
[atl1-01-mic0:191024] [[29012,0],0] checking node mic1
[atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm add new daemon
[[29012,0],1]
[atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm assigning new
daemon [[29012,0],1] to node mic1
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: launching vm
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: local shell: 0 (bash)
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: assuming same remote
shell as local shell
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: remote shell: 0 (bash)
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template>
PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;
/home/ariebs/mic/mpi-nightly/bin/orted --hnp-topo-sig
0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid
"1901330432" -mca orte_ess_vpid "<template>" -mca
orte_ess_num_procs "2" -mca orte_hnp_uri
"1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
--tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "2"
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a
child of mine
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to
launch list
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of
daemon [[29012,0],1]
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing:
(/usr/bin/ssh) [/usr/bin/ssh mic1
PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;
/home/ariebs/mic/mpi-nightly/bin/orted --hnp-topo-sig
0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid
"1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca
orte_hnp_uri
"1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
--tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "2"]
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
libraries: libimf.so: cannot open shared object file: No such file
or directory
[atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127
[atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending
orted_exit commands
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with
--enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location
to use.
* compilation of the orted with dynamic libraries when static are
required
(e.g., on Cray). Please check your configure cmd line and consider
using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
[atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm
On 04/13/2015 08:50 AM, Andy Riebs
wrote:
Hi Ralph,
Here are the results with last night's "master" nightly,
openmpi-dev-1487-g9c6d452.tar.bz2, and adding the
memheap_base_verbose option (yes, it looks like the "ERROR_LOG"
problem has gone away):
$ cat /proc/sys/kernel/shmmax
33554432
$ cat /proc/sys/kernel/shmall
2097152
$ cat /proc/sys/kernel/shmmni
4096
$ export SHMEM_SYMMETRIC_HEAP=1M
$ shmemrun -H localhost -N 2 --mca sshmem mmap --mca
plm_base_verbose 5 --mca memheap_base_verbose 100 $PWD/mic.out
[atl1-01-mic0:190439] mca:base:select:( plm) Querying component
[rsh]
[atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent
ssh : rsh path NULL
[atl1-01-mic0:190439] mca:base:select:( plm) Query of component
[rsh] set priority to 10
[atl1-01-mic0:190439] mca:base:select:( plm) Querying component
[isolated]
[atl1-01-mic0:190439] mca:base:select:( plm) Query of component
[isolated] set priority to 0
[atl1-01-mic0:190439] mca:base:select:( plm) Querying component
[slurm]
[atl1-01-mic0:190439] mca:base:select:( plm) Skipping component
[slurm]. Query failed to return a module
[atl1-01-mic0:190439] mca:base:select:( plm) Selected component
[rsh]
[atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439
nodename hash 4121194178
[atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
[atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh :
rsh path NULL
[atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged
allocation
[atl1-01-mic0:190439] [[31875,0],0] using dash_host
[atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
[atl1-01-mic0:190439] [[31875,0],0] ignoring myself
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in
allocation
[atl1-01-mic0:190439] [[31875,0],0] complete_setup on job
[31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job
[31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof
for job [31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1]
registered
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1]
is not a dynamic spawn
[atl1-01-mic0:190441] mca: base: components_register: registering
memheap components
[atl1-01-mic0:190441] mca: base: components_register: found loaded
component buddy
[atl1-01-mic0:190441] mca: base: components_register: component
buddy has no register or open function
[atl1-01-mic0:190442] mca: base: components_register: registering
memheap components
[atl1-01-mic0:190442] mca: base: components_register: found loaded
component buddy
[atl1-01-mic0:190442] mca: base: components_register: component
buddy has no register or open function
[atl1-01-mic0:190442] mca: base: components_register: found loaded
component ptmalloc
[atl1-01-mic0:190442] mca: base: components_register: component
ptmalloc has no register or open function
[atl1-01-mic0:190441] mca: base: components_register: found loaded
component ptmalloc
[atl1-01-mic0:190441] mca: base: components_register: component
ptmalloc has no register or open function
[atl1-01-mic0:190441] mca: base: components_open: opening memheap
components
[atl1-01-mic0:190441] mca: base: components_open: found loaded
component buddy
[atl1-01-mic0:190441] mca: base: components_open: component buddy
open function successful
[atl1-01-mic0:190441] mca: base: components_open: found loaded
component ptmalloc
[atl1-01-mic0:190441] mca: base: components_open: component
ptmalloc open function successful
[atl1-01-mic0:190442] mca: base: components_open: opening memheap
components
[atl1-01-mic0:190442] mca: base: components_open: found loaded
component buddy
[atl1-01-mic0:190442] mca: base: components_open: component buddy
open function successful
[atl1-01-mic0:190442] mca: base: components_open: found loaded
component ptmalloc
[atl1-01-mic0:190442] mca: base: components_open: component
ptmalloc open function successful
[atl1-01-mic0:190442] base/memheap_base_alloc.c:38 -
mca_memheap_base_alloc_init() Memheap alloc memory: 270532608
byte(s), 1 segments by method: 1
[atl1-01-mic0:190441] base/memheap_base_alloc.c:38 -
mca_memheap_base_alloc_init() Memheap alloc memory: 270532608
byte(s), 1 segments by method: 1
[atl1-01-mic0:190442] base/memheap_base_static.c:205 -
_load_segments() add: 00600000-00601000 rw-p 00000000 00:11
6029314
/home/ariebs/bench/hello/mic.out
[atl1-01-mic0:190441] base/memheap_base_static.c:205 -
_load_segments() add: 00600000-00601000 rw-p 00000000 00:11
6029314
/home/ariebs/bench/hello/mic.out
[atl1-01-mic0:190442] base/memheap_base_static.c:75 -
mca_memheap_base_static_init() Memheap static memory: 3824
byte(s), 2 segments
[atl1-01-mic0:190442] base/memheap_base_register.c:39 -
mca_memheap_base_reg() register seg#00: 0x0xff000000 -
0x0x10f200000 270532608 bytes type=0x1 id=0xFFFFFFFF
[atl1-01-mic0:190441] base/memheap_base_static.c:75 -
mca_memheap_base_static_init() Memheap static memory: 3824
byte(s), 2 segments
[atl1-01-mic0:190441] base/memheap_base_register.c:39 -
mca_memheap_base_reg() register seg#00: 0x0xff000000 -
0x0x10f200000 270532608 bytes type=0x1 id=0xFFFFFFFF
[atl1-01-mic0:190442] Error base/memheap_base_register.c:130 -
_reg_segment() Failed to register segment
[atl1-01-mic0:190441] Error base/memheap_base_register.c:130 -
_reg_segment() Failed to register segment
[atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
[atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel
process is
likely to abort. There are many reasons that a parallel process
can
fail during SHMEM_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure; here's
some
additional information (which may only be relevant to an Open
SHMEM
developer):
mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0)
with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee
that all
of its peer processes in the job will be killed properly. You
should
double check that everything has shut down cleanly.
Local host: atl1-01-mic0
PID: 190441
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been
aborted.
-------------------------------------------------------
[atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending
orted_exit commands
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[31875,1],0]
Exit code: 255
--------------------------------------------------------------------------
[atl1-01-mic0:190439] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages
[atl1-01-mic0:190439] 1 more process has sent help message
help-shmem-api.txt / shmem-abort
[atl1-01-mic0:190439] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
[atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm
On 04/12/2015 03:09 PM, Ralph Castain
wrote:
Sorry about that - I hadn’t brought it over to the
1.8 branch yet. I’ve done so now, which means the ERROR_LOG
shouldn’t show up any more. It won’t fix the memheap problem,
though.
You might try adding “--mca memheap_base_verbose
100” to your cmd line so we can see why none of the memheap
components are being selected.
Hi
Ralph,
Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:
$ shmemrun -H localhost -N 2 --mca sshmem mmap --mca
plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:190189] mca:base:select:( plm) Querying
component [rsh]
[atl1-01-mic0:190189] [[INVALID],INVALID]
plm:rsh_lookup on agent ssh : rsh path NULL
[atl1-01-mic0:190189] mca:base:select:( plm) Query of
component [rsh] set priority to 10
[atl1-01-mic0:190189] mca:base:select:( plm) Querying
component [isolated]
[atl1-01-mic0:190189] mca:base:select:( plm) Query of
component [isolated] set priority to 0
[atl1-01-mic0:190189] mca:base:select:( plm) Querying
component [slurm]
[atl1-01-mic0:190189] mca:base:select:( plm) Skipping
component [slurm]. Query failed to return a module
[atl1-01-mic0:190189] mca:base:select:( plm) Selected
component [rsh]
[atl1-01-mic0:190189] plm:base:set_hnp_name: initial
bias 190189 nodename hash 4121194178
[atl1-01-mic0:190189] plm:base:set_hnp_name: final
jobfam 32137
[atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on
agent ssh : rsh path NULL
[atl1-01-mic0:190189] [[32137,0],0] plm:base:receive
start comm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
creating map
[atl1-01-mic0:190189] [[32137,0],0] setup:vm: working
unmanaged allocation
[atl1-01-mic0:190189] [[32137,0],0] using dash_host
[atl1-01-mic0:190189] [[32137,0],0] checking node
atl1-01-mic0
[atl1-01-mic0:190189] [[32137,0],0] ignoring myself
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
only HNP in allocation
[atl1-01-mic0:190189] [[32137,0],0] complete_setup on
job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG:
Not found in file base/plm_base_launch_support.c at
line 440
[atl1-01-mic0:190189] [[32137,0],0]
plm:base:launch_apps for job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch
wiring up iof for job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch
[32137,1] registered
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch
job [32137,1] is not a dynamic spawn
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your
parallel process is
likely to abort. There are many reasons that a
parallel process can
fail during SHMEM_INIT; some of which are due to
configuration or environment
problems. This failure appears to be an internal
failure; here's some
additional information (which may only be relevant to
an Open SHMEM
developer):
mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of "Success"
(0)
--------------------------------------------------------------------------
[atl1-01-mic0:190191] Error: pshmem_init.c:61 -
shmem_init() SHMEM failed to initialize - aborting
[atl1-01-mic0:190192] Error: pshmem_init.c:61 -
shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 1 (pid 190192,
host=atl1-01-mic0) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot
guarantee that all
of its peer processes in the job will be killed
properly. You should
double check that everything has shut down cleanly.
Local host: atl1-01-mic0
PID: 190192
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process
returned
a non-zero exit code.. Per user-direction, the job has
been aborted.
-------------------------------------------------------
[atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd
sending orted_exit commands
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited
with non-zero status, thus causing
the job to be terminated. The first process to do so
was:
Process name: [[32137,1],0]
Exit code: 255
--------------------------------------------------------------------------
[atl1-01-mic0:190189] 1 more process has sent help
message help-shmem-runtime.txt /
shmem_init:startup:internal-failure
[atl1-01-mic0:190189] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help /
error messages
[atl1-01-mic0:190189] 1 more process has sent help
message help-shmem-api.txt / shmem-abort
[atl1-01-mic0:190189] 1 more process has sent help
message help-shmem-runtime.txt / oshmem shmem
abort:cannot guarantee all killed
[atl1-01-mic0:190189] [[32137,0],0] plm:base:receive
stop comm
On 04/11/2015 07:41 PM,
Ralph Castain wrote:
Got it - thanks. I fixed that
ERROR_LOG issue (I think- please verify). I suspect
the memheap issue relates to something else, but I
probably need to let the OSHMEM folks comment on it
Everything is built on the Xeon
side, with the icc "-mmic" switch. I then
ssh into one of the PHIs, and run shmemrun
from there.
On 04/11/2015
12:00 PM, Ralph Castain wrote:
Let me try to
understand the setup a little better.
Are you running shmemrun on the PHI
itself? Or is it running on the host
processor, and you are trying to spawn a
process onto the Phi?
Hi
Ralph,
Yes, this is attempting to get
OSHMEM to run on the Phi.
I grabbed
openmpi-dev-1484-g033418f.tar.bz2
and configured it with
$ ./configure
--prefix=/home/ariebs/mic/mpi-nightly
CC=icc -mmic CXX=icpc -mmic
\
--build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux \
AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib
LD=x86_64-k1om-linux-ld \
--enable-mpirun-prefix-by-default
--disable-io-romio
--disable-mpi-fortran \
--enable-debug
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
(Note that I had to add
"oob-ud" to the
"--enable-mca-no-build"
option, as the build
complained that mca oob/ud
needed mca common-verbs.)
With that configuration, here
is what I am seeing now...
$ export
SHMEM_SYMMETRIC_HEAP_SIZE=1G
$ shmemrun -H localhost -N 2
--mca sshmem mmap --mca
plm_base_verbose 5
$PWD/mic.out
[atl1-01-mic0:189895]
mca:base:select:( plm)
Querying component [rsh]
[atl1-01-mic0:189895]
[[INVALID],INVALID]
plm:rsh_lookup on agent ssh :
rsh path NULL
[atl1-01-mic0:189895]
mca:base:select:( plm) Query
of component [rsh] set
priority to 10
[atl1-01-mic0:189895]
mca:base:select:( plm)
Querying component [isolated]
[atl1-01-mic0:189895]
mca:base:select:( plm) Query
of component [isolated] set
priority to 0
[atl1-01-mic0:189895]
mca:base:select:( plm)
Querying component [slurm]
[atl1-01-mic0:189895]
mca:base:select:( plm)
Skipping component [slurm].
Query failed to return a
module
[atl1-01-mic0:189895]
mca:base:select:( plm)
Selected component [rsh]
[atl1-01-mic0:189895]
plm:base:set_hnp_name: initial
bias 189895 nodename hash
4121194178
[atl1-01-mic0:189895]
plm:base:set_hnp_name: final
jobfam 32419
[atl1-01-mic0:189895]
[[32419,0],0] plm:rsh_setup on
agent ssh : rsh path NULL
[atl1-01-mic0:189895]
[[32419,0],0] plm:base:receive
start comm
[atl1-01-mic0:189895]
[[32419,0],0]
plm:base:setup_job
[atl1-01-mic0:189895]
[[32419,0],0]
plm:base:setup_vm
[atl1-01-mic0:189895]
[[32419,0],0]
plm:base:setup_vm creating map
[atl1-01-mic0:189895]
[[32419,0],0] setup:vm:
working unmanaged allocation
[atl1-01-mic0:189895]
[[32419,0],0] using dash_host
[atl1-01-mic0:189895]
[[32419,0],0] checking node
atl1-01-mic0
[atl1-01-mic0:189895]
[[32419,0],0] ignoring myself
[atl1-01-mic0:189895]
[[32419,0],0]
plm:base:setup_vm only HNP in
allocation
[atl1-01-mic0:189895]
[[32419,0],0] complete_setup
on job [32419,1]
[atl1-01-mic0:189895]
[[32419,0],0] ORTE_ERROR_LOG:
Not found in file
base/plm_base_launch_support.c
at line 440
[atl1-01-mic0:189895]
[[32419,0],0]
plm:base:launch_apps for job
[32419,1]
[atl1-01-mic0:189895]
[[32419,0],0] plm:base:launch
wiring up iof for job
[32419,1]
[atl1-01-mic0:189895]
[[32419,0],0] plm:base:launch
[32419,1] registered
[atl1-01-mic0:189895]
[[32419,0],0] plm:base:launch
job [32419,1] is not a dynamic
spawn
[atl1-01-mic0:189899] Error:
pshmem_init.c:61 -
shmem_init() SHMEM failed to
initialize - aborting
[atl1-01-mic0:189898] Error:
pshmem_init.c:61 -
shmem_init() SHMEM failed to
initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT
failed for some reason; your
parallel process is
likely to abort. There are
many reasons that a parallel
process can
fail during SHMEM_INIT; some
of which are due to
configuration or environment
problems. This failure
appears to be an internal
failure; here's some
additional information (which
may only be relevant to an
Open SHMEM
developer):
mca_memheap_base_select()
failed
--> Returned "Error" (-1)
instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on
rank 1 (pid 189899,
host=atl1-01-mic0) with
errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at
a time when it cannot
guarantee that all
of its peer processes in the
job will be killed properly.
You should
double check that everything
has shut down cleanly.
Local host: atl1-01-mic0
PID: 189899
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated
normally, but 1 process
returned
a non-zero exit code.. Per
user-direction, the job has
been aborted.
-------------------------------------------------------
[atl1-01-mic0:189895]
[[32419,0],0]
plm:base:orted_cmd sending
orted_exit commands
--------------------------------------------------------------------------
shmemrun detected that one or
more processes exited with
non-zero status, thus causing
the job to be terminated. The
first process to do so was:
Process name: [[32419,1],1]
Exit code: 255
--------------------------------------------------------------------------
[atl1-01-mic0:189895] 1 more
process has sent help message
help-shmem-runtime.txt /
shmem_init:startup:internal-failure
[atl1-01-mic0:189895] Set MCA
parameter
"orte_base_help_aggregate" to
0 to see all help / error
messages
[atl1-01-mic0:189895] 1 more
process has sent help message
help-shmem-api.txt /
shmem-abort
[atl1-01-mic0:189895] 1 more
process has sent help message
help-shmem-runtime.txt /
oshmem shmem abort:cannot
guarantee all killed
[atl1-01-mic0:189895]
[[32419,0],0] plm:base:receive
stop comm
On
04/10/2015 06:37 PM, Ralph
Castain wrote:
Andy -
could you please try the
current 1.8.5 nightly
tarball and see if it helps?
The error log indicates that
it is failing to get the
topology from some daemon,
I�m assuming the one on the
Phi?
You might also
add �enable-debug to that
configure line and then
put -mca plm_base_verbose
on the shmemrun cmd to get
more help
Summary:
MPI jobs work
fine, SHMEM jobs
work just often
enough to be
tantalizing, on an
Intel Xeon Phi/MIC
system.
Longer version
Thanks to the
excellent write-up
last June ( <https://www.open-mpi.org/community/lists/users/2014/06/24711.php>),
I have been able
to build a version
of Open MPI for
the Xeon Phi
coprocessor that
runs MPI jobs on
the Phi
coprocessor with
no problem, but
not SHMEM jobs.
Just at the point
where I was about
to document the
problems I was
having with SHMEM,
my trivial SHMEM
job worked. And
then failed when I
tried to run it
again, immediately
afterwards. I have
a feeling I may be
in uncharted
territory here.
Environment
- RHEL
6.5
- Intel
Composer XE
2015
- Xeon
Phi/MIC
----------------
Configuration
$ export
PATH=/usr/linux-k1om-4.7/bin/:$PATH
$ source
/opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
intel64
$ ./configure
--prefix=/home/ariebs/mic/mpi
\
CC="icc -mmic"
CXX="icpc -mmic" \
--build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux
\
AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib
\
LD=x86_64-k1om-linux-ld
\
--enable-mpirun-prefix-by-default
--disable-io-romio
\
--disable-vt
--disable-mpi-fortran
\
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs
$ make
$ make install
----------------
Test program
#include
<stdio.h>
#include
<stdlib.h>
#include
<shmem.h>
int main(int argc,
char **argv)
{
int me,
num_pe;
shmem_init();
num_pe =
num_pes();
me =
my_pe();
printf("Hello
World from process
%ld of %ld\n", me,
num_pe);
exit(0);
}
----------------
Building the
program
export
PATH=/home/ariebs/mic/mpi/bin:$PATH
export
PATH=/usr/linux-k1om-4.7/bin/:$PATH
source
/opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
intel64
export
LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH
icc -mmic
-std=gnu99
-I/home/ariebs/mic/mpi/include
-pthread \
-Wl,-rpath
-Wl,/home/ariebs/mic/mpi/lib
-Wl,--enable-new-dtags
\
-L/home/ariebs/mic/mpi/lib
-loshmem -lmpi
-lopen-rte
-lopen-pal \
-lm -ldl
-lutil \
-Wl,-rpath
-Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
\
-L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
\
-o
mic.out
shmem_hello.c
----------------
Running the
program
(Note that the
program had been
consistently
failing. Then,
when I logged back
into the system to
capture the
results, it worked
once, and then
immediately failed
when I tried
again, as shown
below. Logging in
and out isn't
sufficient to
correct the
problem. Overall,
I think I had 3
successful runs in
30-40 attempts.)
$ shmemrun -H
localhost -N 2
--mca sshmem mmap
./mic.out
[atl1-01-mic0:189372]
[[30936,0],0]
ORTE_ERROR_LOG:
Not found in file
base/plm_base_launch_support.c
at line 426
Hello World from
process 0 of 2
Hello World from
process 1 of 2
$ shmemrun -H
localhost -N 2
--mca sshmem mmap
./mic.out
[atl1-01-mic0:189381]
[[30881,0],0]
ORTE_ERROR_LOG:
Not found in file
base/plm_base_launch_support.c
at line 426
[atl1-01-mic0:189383]
Error:
pshmem_init.c:61 -
shmem_init() SHMEM
failed to
initialize -
aborting
--------------------------------------------------------------------------
It looks like
SHMEM_INIT failed
for some reason;
your parallel
process is
likely to abort.
There are many
reasons that a
parallel process
can
fail during
SHMEM_INIT; some
of which are due
to configuration
or environment
problems. This
failure appears to
be an internal
failure; here's
some
additional
information (which
may only be
relevant to an
Open SHMEM
developer):
mca_memheap_base_select()
failed
--> Returned
"Error" (-1)
instead of
"Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was
invoked on rank 0
(pid 189383,
host=atl1-01-mic0)
with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is
aborting at a time
when it cannot
guarantee that all
of its peer
processes in the
job will be killed
properly. You
should
double check that
everything has
shut down cleanly.
Local host:
atl1-01-mic0
PID: 189383
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job
terminated
normally, but 1
process returned
a non-zero exit
code.. Per
user-direction,
the job has been
aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
shmemrun detected
that one or more
processes exited
with non-zero
status, thus
causing
the job to be
terminated. The
first process to
do so was:
Process name:
[[30881,1],0]
Exit code:
255
--------------------------------------------------------------------------
Any thoughts about
where to go from
here?
Andy
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26670.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26676.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26678.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26679.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26680.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26682.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26683.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26684.php
|