Gilles and Ralph, thanks!
$ shmemrun -H mic0,mic1 -n 2 -x SHMEM_SYMMETRIC_HEAP_SIZE=1M
$PWD/mic.out
[atl1-01-mic0:192474] [[29886,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 440
Hello World from process 0 of 2
Hello World from process 1 of 2
$
This was built with the openmpi-dev-1487-g9c6d452.tar.bz2 nightly
master. Oddly, -static-intel didn't work. Fortunately, -rpath did.
I'll follow-up in the next day or so with the winning build recipes
for both MPI and the user app to wrap up this note and, one hopes,
save others from some frustration in the future.
Andy
On 04/14/2015 11:10 PM, Ralph Castain
wrote:
I think Gilles may be correct here. In reviewing the code, it
appears we have never (going back to the 1.6 series, at least)
forwarded the local LD_LIBRARY_PATH to the remote node when
exec’ing the orted. The only thing we have done is to set the PATH
and LD_LIBRARY_PATH to support the OMPI prefix - not any
supporting libs.
What we have required, therefore, is that your path
be setup properly in the remote .bashrc (or pick your shell) to
handle the libraries.
As I indicated, the -x option only forwards envars
to the application procs themselves, not the orted. I could try
to add another cmd line option to forward things for the orted,
but the concern we’ve had in the past (and still harbor) is that
the ssh cmd line is limited in length. Thus, adding some
potentially long paths to support this option could overwhelm it
and cause failures.
I’d try the static method first, or perhaps the
LDFLAGS Gilles suggested.
Andy,
what about reconfiguring Open MPI with
LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic"
?
IIRC, an other option is : LDFLAGS="-static-intel"
last but not least, you can always replace orted with a
simple script that sets the LD_LIBRARY_PATH and exec the
original orted
do you have the same behaviour on non MIC hardware when
Open MPI is compiled with intel compilers ?
if it works on non MIC hardware, the root cause could be
in the sshd_config of the MIC that does not
accept to receive LD_LIBRARY_PATH
my 0.02 US$
Gilles
On 4/14/2015 11:20 PM,
Ralph Castain wrote:
Hmmm…certainly looks that way.
I’ll investigate.
Hi Ralph,
Still no happiness... It looks like my
LD_LIBRARY_PATH just isn't getting
propagated?
$ ldd /home/ariebs/mic/mpi-nightly/bin/orted
linux-vdso.so.1 =>
(0x00007fffa1d3b000)
libopen-rte.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0
(0x00002ab6ce464000)
libopen-pal.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0
(0x00002ab6ce7d3000)
libm.so.6 => /lib64/libm.so.6
(0x00002ab6cebbd000)
libdl.so.2 => /lib64/libdl.so.2
(0x00002ab6ceded000)
librt.so.1 => /lib64/librt.so.1
(0x00002ab6ceff1000)
libutil.so.1 =>
/lib64/libutil.so.1 (0x00002ab6cf1f9000)
libgcc_s.so.1 =>
/lib64/libgcc_s.so.1 (0x00002ab6cf3fc000)
libpthread.so.0 =>
/lib64/libpthread.so.0 (0x00002ab6cf60f000)
libc.so.6 => /lib64/libc.so.6
(0x00002ab6cf82c000)
libimf.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
(0x00002ab6cfb84000)
libsvml.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so
(0x00002ab6cffd6000)
libirng.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so
(0x00002ab6d086f000)
libintlc.so.5 =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5
(0x00002ab6d0a82000)
/lib64/ld-linux-k1om.so.2
(0x00002ab6ce243000)
$ echo $LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M
-H mic1 -N 2 --mca spml yoda --mca btl
sm,self,tcp --mca plm_base_verbose 5 --mca
memheap_base_verbose 100
--leave-session-attached --mca
mca_component_show_load_errors 1
$PWD/mic.out
--------------------------------------------------------------------------
A deprecated MCA variable value was
specified in the environment or
on the command line. Deprecated MCA
variables should be avoided;
they may disappear in future releases.
Deprecated variable:
mca_component_show_load_errors
New variable:
mca_base_component_show_load_errors
--------------------------------------------------------------------------
[atl1-02-mic0:16183] mca:base:select:( plm)
Querying component [rsh]
[atl1-02-mic0:16183] [[INVALID],INVALID]
plm:rsh_lookup on agent ssh : rsh path NULL
[atl1-02-mic0:16183] mca:base:select:( plm)
Query of component [rsh] set priority to 10
[atl1-02-mic0:16183] mca:base:select:( plm)
Querying component [isolated]
[atl1-02-mic0:16183] mca:base:select:( plm)
Query of component [isolated] set priority
to 0
[atl1-02-mic0:16183] mca:base:select:( plm)
Querying component [slurm]
[atl1-02-mic0:16183] mca:base:select:( plm)
Skipping component [slurm]. Query failed to
return a module
[atl1-02-mic0:16183] mca:base:select:( plm)
Selected component [rsh]
[atl1-02-mic0:16183] plm:base:set_hnp_name:
initial bias 16183 nodename hash 4238360777
[atl1-02-mic0:16183] plm:base:set_hnp_name:
final jobfam 33630
[atl1-02-mic0:16183] [[33630,0],0]
plm:rsh_setup on agent ssh : rsh path NULL
[atl1-02-mic0:16183] [[33630,0],0]
plm:base:receive start comm
[atl1-02-mic0:16183] [[33630,0],0]
plm:base:setup_job
[atl1-02-mic0:16183] [[33630,0],0]
plm:base:setup_vm
[atl1-02-mic0:16183] [[33630,0],0]
plm:base:setup_vm creating map
[atl1-02-mic0:16183] [[33630,0],0] setup:vm:
working unmanaged allocation
[atl1-02-mic0:16183] [[33630,0],0] using
dash_host
[atl1-02-mic0:16183] [[33630,0],0] checking
node mic1
[atl1-02-mic0:16183] [[33630,0],0]
plm:base:setup_vm add new daemon
[[33630,0],1]
[atl1-02-mic0:16183] [[33630,0],0]
plm:base:setup_vm assigning new daemon
[[33630,0],1] to node mic1
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh:
launching vm
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh:
local shell: 0 (bash)
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh:
assuming same remote shell as local shell
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh:
remote shell: 0 (bash)
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh:
final template argv:
/usr/bin/ssh <template>
PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH
; export PATH ;
LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH
; export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;
/home/ariebs/mic/mpi-nightly/bin/orted -mca
orte_leave_session_attached "1"
--hnp-topo-sig
0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess
"env" -mca orte_ess_jobid "2203975680" -mca
orte_ess_vpid "<template>" -mca
orte_ess_num_procs "2" -mca orte_hnp_uri
"2203975680.0;usock; tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1"
--tree-spawn --mca spml "yoda" --mca btl
"sm,self,tcp" --mca plm_base_verbose "5"
--mca memheap_base_verbose "100" --mca
mca_component_show_load_errors "1" -mca plm
"rsh" -mca rmaps_ppr_n_pernode "2"
[atl1-02-mic0:16183] [[33630,0],0]
plm:rsh:launch daemon 0 not a child of mine
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh:
adding node mic1 to launch list
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh:
activating launch event
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh:
recording launch of daemon [[33630,0],1]
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh:
executing: (/usr/bin/ssh) [/usr/bin/ssh
mic1
PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH
; export PATH ;
LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH
; export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;
/home/ariebs/mic/mpi-nightly/bin/orted -mca
orte_leave_session_attached "1"
--hnp-topo-sig
0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess
"env" -mca orte_ess_jobid "2203975680" -mca
orte_ess_vpid 1 -mca orte_ess_num_procs "2"
-mca orte_hnp_uri "2203975680.0;usock; tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1"
--tree-spawn --mca spml "yoda" --mca btl
"sm,self,tcp" --mca plm_base_verbose "5"
--mca memheap_base_verbose "100" --mca
mca_component_show_load_errors "1" -mca plm
"rsh" -mca rmaps_ppr_n_pernode "2"]
/home/ariebs/mic/mpi-nightly/bin/orted:
error while loading shared libraries:
libimf.so: cannot open shared object file:
No such file or directory
[atl1-02-mic0:16183] [[33630,0],0] daemon 1
failed with status 127
[atl1-02-mic0:16183] [[33630,0],0]
plm:base:orted_cmd sending orted_exit
commands
--------------------------------------------------------------------------
ORTE was unable to reliably start one or
more daemons.
This usually is caused by:
* not finding the required libraries and/or
binaries on
one or more nodes. Please check your PATH
and LD_LIBRARY_PATH
settings, or configure OMPI with
--enable-orterun-prefix-by-default
* lack of authority to execute on one or
more specified nodes.
Please verify your allocation and
authorities.
* the inability to write startup files into
/tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to
determine the correct location to use.
* compilation of the orted with dynamic
libraries when static are required
(e.g., on Cray). Please check your
configure cmd line and consider using
one of the contrib/platform definitions
for your system type.
* an inability to create a connection back
to mpirun due to a
lack of common network interfaces and/or
no route found between
them. Please check network connectivity
(including firewalls
and network routing requirements).
--------------------------------------------------------------------------
[atl1-02-mic0:16183] [[33630,0],0]
plm:base:receive stop comm
On 04/13/2015
07:47 PM, Ralph Castain wrote:
Weird. I’m not sure
what to try at that point - IIRC, building
static won’t resolve this problem (but you
could try and see). You could add the
following to the cmd line and see if it
tells us anything useful:
—leave-session-attached
—mca mca_component_show_load_errors 1
You might also do an ldd on
/home/ariebs/mic/mpi-nightly/bin/orted
and see where it is looking for libimf
since it (and not mic.out) is the one
complaining
Ralph
and Nathan,
The problem may be something
trivial, as I don't typically
use "shmemrun" to start jobs.
With the following, I *think*
I've demonstrated that the
problem library is where it
belongs on the remote system:
$ ldd mic.out
linux-vdso.so.1 =>
(0x00007fffb83ff000)
liboshmem.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/liboshmem.so.0
(0x00002b059cfbb000)
libmpi.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libmpi.so.0
(0x00002b059d35a000)
libopen-rte.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0
(0x00002b059d7e3000)
libopen-pal.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0
(0x00002b059db53000)
libm.so.6 =>
/lib64/libm.so.6
(0x00002b059df3d000)
libdl.so.2 =>
/lib64/libdl.so.2
(0x00002b059e16c000)
libutil.so.1 =>
/lib64/libutil.so.1
(0x00002b059e371000)
libgcc_s.so.1 =>
/lib64/libgcc_s.so.1
(0x00002b059e574000)
libpthread.so.0 =>
/lib64/libpthread.so.0
(0x00002b059e786000)
libc.so.6 =>
/lib64/libc.so.6
(0x00002b059e9a4000)
librt.so.1 =>
/lib64/librt.so.1
(0x00002b059ecfc000)
libimf.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
(0x00002b059ef04000)
libsvml.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so
(0x00002b059f356000)
libirng.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so
(0x00002b059fbef000)
libintlc.so.5 =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5
(0x00002b059fe02000)
/lib64/ld-linux-k1om.so.2
(0x00002b059cd9a000)
$ echo $LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
$ ssh mic1 file /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so:
ELF 64-bit LSB shared object,
Intel Xeon Phi coprocessor
(k1om), version 1 (SYSV),
dynamically linked, not stripped
$ shmemrun -H mic1 -N 2 --mca
btl scif,self $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted:
error while loading
shared libraries: libimf.so:
cannot open shared object file:
No such file or directory
...
On
04/13/2015 04:25 PM, Nathan
Hjelm wrote:
For talking between PHIs on the same system I recommend using the scif
BTL NOT tcp.
That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
system. It looks like it can't find the intel compiler libraries.
-Nathan Hjelm
HPC-5, LANL
On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:
Progress! I can run my trivial program on the local PHI, but not the
other PHI, on the system. Here are the interesting parts:
A pretty good recipe with last night's nightly master:
$ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
CXX="icpc -mmic" \
--build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib
LD=x86_64-k1om-linux-ld \
--enable-mpirun-prefix-by-default --disable-io-romio
--disable-mpi-fortran \
--enable-orterun-prefix-by-default \
--enable-debug
$ make && make install
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
yoda --mca btl sm,self,tcp $PWD/mic.out
Hello World from process 0 of 2
Hello World from process 1 of 2
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
yoda --mca btl openib,sm,self $PWD/mic.out
Hello World from process 0 of 2
Hello World from process 1 of 2
$
However, I can't seem to cross the fabric. I can ssh freely back and forth
between mic0 and mic1. However, running the next 2 tests from mic0, it
certainly seems like the second one should work, too:
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda
--mca btl sm,self,tcp $PWD/mic.out
Hello World from process 0 of 2
Hello World from process 1 of 2
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda
--mca btl sm,self,tcp $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
libraries: libimf.so: cannot open shared object file: No such file or
directory
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to
use.
* compilation of the orted with dynamic libraries when static are
required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
...
$
(Note that I get the same results with "--mca btl openib,sm,self"....)
$ ssh mic1 file
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF
64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1
(SYSV), dynamically linked, not stripped
$ shmemrun -x
LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
-H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
libraries: libimf.so: cannot open shared object file: No such file or
directory
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to
use.
* compilation of the orted with dynamic libraries when static are
required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
Following here is
- IB information
- Running the failing case with lots of debugging information. (As you
might imagine, I've tried 17 ways from Sunday to try to ensure that
libimf.so is found.)
$ ibv_devices
device node GUID
------ ----------------
mlx4_0 24be05ffffa57160
scif0 4c79bafffe4402b6
$ ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.1250
node_guid: 24be:05ff:ffa5:7160
sys_image_guid: 24be:05ff:ffa5:7163
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 8
port_lid: 86
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
hca_id: scif0
transport: SCIF (2)
fw_ver: 0.0.1
node_guid: 4c79:baff:fe44:02b6
sys_image_guid: 4c79:baff:fe44:02b6
vendor_id: 0x8086
vendor_part_id: 0
hw_ver: 0x1
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1001
port_lmc: 0x00
link_layer: SCIF
$ shmemrun -x
LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
-H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose
5 --mca memheap_base_verbose 100 $PWD/mic.out
[atl1-01-mic0:191024] mca:base:select:( plm) Querying component [rsh]
[atl1-01-mic0:191024] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
rsh path NULL
[atl1-01-mic0:191024] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[atl1-01-mic0:191024] mca:base:select:( plm) Querying component
[isolated]
[atl1-01-mic0:191024] mca:base:select:( plm) Query of component
[isolated] set priority to 0
[atl1-01-mic0:191024] mca:base:select:( plm) Querying component [slurm]
[atl1-01-mic0:191024] mca:base:select:( plm) Skipping component [slurm].
Query failed to return a module
[atl1-01-mic0:191024] mca:base:select:( plm) Selected component [rsh]
[atl1-01-mic0:191024] plm:base:set_hnp_name: initial bias 191024 nodename
hash 4121194178
[atl1-01-mic0:191024] plm:base:set_hnp_name: final jobfam 29012
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh_setup on agent ssh : rsh path
NULL
[atl1-01-mic0:191024] [[29012,0],0] plm:base:receive start comm
[atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_job
[atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm
[atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:191024] [[29012,0],0] setup:vm: working unmanaged allocation
[atl1-01-mic0:191024] [[29012,0],0] using dash_host
[atl1-01-mic0:191024] [[29012,0],0] checking node mic1
[atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm add new daemon
[[29012,0],1]
[atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm assigning new daemon
[[29012,0],1] to node mic1
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: launching vm
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: local shell: 0 (bash)
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: assuming same remote shell as
local shell
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: remote shell: 0 (bash)
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template>
PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted
--hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
orte_ess_jobid "1901330432" -mca orte_ess_vpid "<template>" -mca
orte_ess_num_procs "2" -mca orte_hnp_uri
"1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
--tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
rmaps_ppr_n_pernode "2"
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a child of
mine
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to launch
list
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of daemon
[[29012,0],1]
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing: (/usr/bin/ssh)
[/usr/bin/ssh mic1 PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ;
export PATH ;
LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted
--hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs
"2" -mca orte_hnp_uri
"1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
--tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
rmaps_ppr_n_pernode "2"]
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
libraries: libimf.so: cannot open shared object file: No such file or
directory
[atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127
[atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending orted_exit
commands
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to
use.
* compilation of the orted with dynamic libraries when static are
required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
[atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm
On 04/13/2015 08:50 AM, Andy Riebs wrote:
Hi Ralph,
Here are the results with last night's "master" nightly,
openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose
option (yes, it looks like the "ERROR_LOG" problem has gone away):
$ cat /proc/sys/kernel/shmmax
33554432
$ cat /proc/sys/kernel/shmall
2097152
$ cat /proc/sys/kernel/shmmni
4096
$ export SHMEM_SYMMETRIC_HEAP=1M
$ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5
--mca memheap_base_verbose 100 $PWD/mic.out
[atl1-01-mic0:190439] mca:base:select:( plm) Querying component [rsh]
[atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
rsh path NULL
[atl1-01-mic0:190439] mca:base:select:( plm) Query of component [rsh]
set priority to 10
[atl1-01-mic0:190439] mca:base:select:( plm) Querying component
[isolated]
[atl1-01-mic0:190439] mca:base:select:( plm) Query of component
[isolated] set priority to 0
[atl1-01-mic0:190439] mca:base:select:( plm) Querying component [slurm]
[atl1-01-mic0:190439] mca:base:select:( plm) Skipping component
[slurm]. Query failed to return a module
[atl1-01-mic0:190439] mca:base:select:( plm) Selected component [rsh]
[atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439
nodename hash 4121194178
[atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
[atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh
path NULL
[atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged
allocation
[atl1-01-mic0:190439] [[31875,0],0] using dash_host
[atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
[atl1-01-mic0:190439] [[31875,0],0] ignoring myself
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in
allocation
[atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job
[31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof for
job [31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] registered
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is not
a dynamic spawn
[atl1-01-mic0:190441] mca: base: components_register: registering
memheap components
[atl1-01-mic0:190441] mca: base: components_register: found loaded
component buddy
[atl1-01-mic0:190441] mca: base: components_register: component buddy
has no register or open function
[atl1-01-mic0:190442] mca: base: components_register: registering
memheap components
[atl1-01-mic0:190442] mca: base: components_register: found loaded
component buddy
[atl1-01-mic0:190442] mca: base: components_register: component buddy
has no register or open function
[atl1-01-mic0:190442] mca: base: components_register: found loaded
component ptmalloc
[atl1-01-mic0:190442] mca: base: components_register: component ptmalloc
has no register or open function
[atl1-01-mic0:190441] mca: base: components_register: found loaded
component ptmalloc
[atl1-01-mic0:190441] mca: base: components_register: component ptmalloc
has no register or open function
[atl1-01-mic0:190441] mca: base: components_open: opening memheap
components
[atl1-01-mic0:190441] mca: base: components_open: found loaded component
buddy
[atl1-01-mic0:190441] mca: base: components_open: component buddy open
function successful
[atl1-01-mic0:190441] mca: base: components_open: found loaded component
ptmalloc
[atl1-01-mic0:190441] mca: base: components_open: component ptmalloc
open function successful
[atl1-01-mic0:190442] mca: base: components_open: opening memheap
components
[atl1-01-mic0:190442] mca: base: components_open: found loaded component
buddy
[atl1-01-mic0:190442] mca: base: components_open: component buddy open
function successful
[atl1-01-mic0:190442] mca: base: components_open: found loaded component
ptmalloc
[atl1-01-mic0:190442] mca: base: components_open: component ptmalloc
open function successful
[atl1-01-mic0:190442] base/memheap_base_alloc.c:38 -
mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
segments by method: 1
[atl1-01-mic0:190441] base/memheap_base_alloc.c:38 -
mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
segments by method: 1
[atl1-01-mic0:190442] base/memheap_base_static.c:205 - _load_segments()
add: 00600000-00601000 rw-p 00000000 00:11
6029314 /home/ariebs/bench/hello/mic.out
[atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments()
add: 00600000-00601000 rw-p 00000000 00:11
6029314 /home/ariebs/bench/hello/mic.out
[atl1-01-mic0:190442] base/memheap_base_static.c:75 -
mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
segments
[atl1-01-mic0:190442] base/memheap_base_register.c:39 -
mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
270532608 bytes type=0x1 id=0xFFFFFFFF
[atl1-01-mic0:190441] base/memheap_base_static.c:75 -
mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
segments
[atl1-01-mic0:190441] base/memheap_base_register.c:39 -
mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
270532608 bytes type=0x1 id=0xFFFFFFFF
[atl1-01-mic0:190442] Error base/memheap_base_register.c:130 -
_reg_segment() Failed to register segment
[atl1-01-mic0:190441] Error base/memheap_base_register.c:130 -
_reg_segment() Failed to register segment
[atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
[atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process
is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):
mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with
errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Local host: atl1-01-mic0
PID: 190441
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending
orted_exit commands
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[31875,1],0]
Exit code: 255
--------------------------------------------------------------------------
[atl1-01-mic0:190439] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages
[atl1-01-mic0:190439] 1 more process has sent help message
help-shmem-api.txt / shmem-abort
[atl1-01-mic0:190439] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
[atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm
On 04/12/2015 03:09 PM, Ralph Castain wrote:
Sorry about that - I hadn't brought it over to the 1.8 branch yet.
I've done so now, which means the ERROR_LOG shouldn't show up any
more. It won't fix the memheap problem, though.
You might try adding "--mca memheap_base_verbose 100" to your cmd line
so we can see why none of the memheap components are being selected.
On Apr 12, 2015, at 11:30 AM, Andy Riebs <andy.ri...@hp.com> wrote:
Hi Ralph,
Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:
$ shmemrun -H localhost -N 2 --mca sshmem mmap --mca
plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:190189] mca:base:select:( plm) Querying component
[rsh]
[atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent
ssh : rsh path NULL
[atl1-01-mic0:190189] mca:base:select:( plm) Query of component
[rsh] set priority to 10
[atl1-01-mic0:190189] mca:base:select:( plm) Querying component
[isolated]
[atl1-01-mic0:190189] mca:base:select:( plm) Query of component
[isolated] set priority to 0
[atl1-01-mic0:190189] mca:base:select:( plm) Querying component
[slurm]
[atl1-01-mic0:190189] mca:base:select:( plm) Skipping component
[slurm]. Query failed to return a module
[atl1-01-mic0:190189] mca:base:select:( plm) Selected component
[rsh]
[atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189
nodename hash 4121194178
[atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
[atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh
path NULL
[atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged
allocation
[atl1-01-mic0:190189] [[32137,0],0] using dash_host
[atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
[atl1-01-mic0:190189] [[32137,0],0] ignoring myself
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in
allocation
[atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 440
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job
[32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof
for job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1]
registered
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is
not a dynamic spawn
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel
process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure; here's
some
additional information (which may only be relevant to an Open SHMEM
developer):
mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
[atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0)
with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that
all
of its peer processes in the job will be killed properly. You
should
double check that everything has shut down cleanly.
Local host: atl1-01-mic0
PID: 190192
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending
orted_exit commands
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[32137,1],0]
Exit code: 255
--------------------------------------------------------------------------
[atl1-01-mic0:190189] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages
[atl1-01-mic0:190189] 1 more process has sent help message
help-shmem-api.txt / shmem-abort
[atl1-01-mic0:190189] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
[atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm
On 04/11/2015 07:41 PM, Ralph Castain wrote:
Got it - thanks. I fixed that ERROR_LOG issue (I think- please
verify). I suspect the memheap issue relates to something else,
but I probably need to let the OSHMEM folks comment on it
On Apr 11, 2015, at 9:52 AM, Andy Riebs <andy.ri...@hp.com>
wrote:
Everything is built on the Xeon side, with the icc "-mmic"
switch. I then ssh into one of the PHIs, and run shmemrun from
there.
On 04/11/2015 12:00 PM, Ralph Castain wrote:
Let me try to understand the setup a little better. Are you
running shmemrun on the PHI itself? Or is it running on the
host processor, and you are trying to spawn a process onto the
Phi?
On Apr 11, 2015, at 7:55 AM, Andy Riebs <andy.ri...@hp.com>
wrote:
Hi Ralph,
Yes, this is attempting to get OSHMEM to run on the Phi.
I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured
it with
$ ./configure --prefix=/home/ariebs/mic/mpi-nightly
CC=icc -mmic CXX=icpc -mmic \
--build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux \
AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld \
--enable-mpirun-prefix-by-default
--disable-io-romio --disable-mpi-fortran \
--enable-debug
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
(Note that I had to add "oob-ud" to the
"--enable-mca-no-build" option, as the build complained that
mca oob/ud needed mca common-verbs.)
With that configuration, here is what I am seeing now...
$ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
$ shmemrun -H localhost -N 2 --mca sshmem mmap --mca
plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:189895] mca:base:select:( plm) Querying
component [rsh]
[atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on
agent ssh : rsh path NULL
[atl1-01-mic0:189895] mca:base:select:( plm) Query of
component [rsh] set priority to 10
[atl1-01-mic0:189895] mca:base:select:( plm) Querying
component [isolated]
[atl1-01-mic0:189895] mca:base:select:( plm) Query of
component [isolated] set priority to 0
[atl1-01-mic0:189895] mca:base:select:( plm) Querying
component [slurm]
[atl1-01-mic0:189895] mca:base:select:( plm) Skipping
component [slurm]. Query failed to return a module
[atl1-01-mic0:189895] mca:base:select:( plm) Selected
component [rsh]
[atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias
189895 nodename hash 4121194178
[atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam
32419
[atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent
ssh : rsh path NULL
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start
comm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
creating map
[atl1-01-mic0:189895] [[32419,0],0] setup:vm: working
unmanaged allocation
[atl1-01-mic0:189895] [[32419,0],0] using dash_host
[atl1-01-mic0:189895] [[32419,0],0] checking node
atl1-01-mic0
[atl1-01-mic0:189895] [[32419,0],0] ignoring myself
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only
HNP in allocation
[atl1-01-mic0:189895] [[32419,0],0] complete_setup on job
[32419,1]
[atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not
found in file base/plm_base_launch_support.c at line 440
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for
job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring
up iof for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
[32419,1] registered
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job
[32419,1] is not a dynamic spawn
[atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init()
SHMEM failed to initialize - aborting
[atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init()
SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your
parallel process is
likely to abort. There are many reasons that a parallel
process can
fail during SHMEM_INIT; some of which are due to
configuration or environment
problems. This failure appears to be an internal failure;
here's some
additional information (which may only be relevant to an
Open SHMEM
developer):
mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 1 (pid 189899,
host=atl1-01-mic0) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot
guarantee that all
of its peer processes in the job will be killed properly.
You should
double check that everything has shut down cleanly.
Local host: atl1-01-mic0
PID: 189899
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been
aborted.
-------------------------------------------------------
[atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd
sending orted_exit commands
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with
non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[32419,1],1]
Exit code: 255
--------------------------------------------------------------------------
[atl1-01-mic0:189895] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:189895] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error
messages
[atl1-01-mic0:189895] 1 more process has sent help message
help-shmem-api.txt / shmem-abort
[atl1-01-mic0:189895] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee
all killed
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop
comm
On 04/10/2015 06:37 PM, Ralph Castain wrote:
Andy - could you please try the current 1.8.5 nightly
tarball and see if it helps? The error log indicates that
it is failing to get the topology from some daemon, I**m
assuming the one on the Phi?
You might also add **enable-debug to that configure line
and then put -mca plm_base_verbose on the shmemrun cmd to
get more help
On Apr 10, 2015, at 11:55 AM, Andy Riebs
<andy.ri...@hp.com> wrote:
Summary: MPI jobs work fine, SHMEM jobs work just often
enough to be tantalizing, on an Intel Xeon Phi/MIC
system.
Longer version
Thanks to the excellent write-up last June
(<https://www.open-mpi.org/community/lists/users/2014/06/24711.php>),
I have been able to build a version of Open MPI for the
Xeon Phi coprocessor that runs MPI jobs on the Phi
coprocessor with no problem, but not SHMEM jobs. Just
at the point where I was about to document the problems
I was having with SHMEM, my trivial SHMEM job worked.
And then failed when I tried to run it again,
immediately afterwards. I have a feeling I may be in
uncharted territory here.
Environment
* RHEL 6.5
* Intel Composer XE 2015
* Xeon Phi/MIC
----------------
Configuration
$ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
$ source
/opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
intel64
$ ./configure --prefix=/home/ariebs/mic/mpi \
CC="icc -mmic" CXX="icpc -mmic" \
--build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux \
AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib \
LD=x86_64-k1om-linux-ld \
--enable-mpirun-prefix-by-default --disable-io-romio
\
--disable-vt --disable-mpi-fortran \
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs
$ make
$ make install
----------------
Test program
#include <stdio.h>
#include <stdlib.h>
#include <shmem.h>
int main(int argc, char **argv)
{
int me, num_pe;
shmem_init();
num_pe = num_pes();
me = my_pe();
printf("Hello World from process %ld of %ld\n",
me, num_pe);
exit(0);
}
----------------
Building the program
export PATH=/home/ariebs/mic/mpi/bin:$PATH
export PATH=/usr/linux-k1om-4.7/bin/:$PATH
source
/opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
intel64
export
LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH
icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include
-pthread \
-Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib
-Wl,--enable-new-dtags \
-L/home/ariebs/mic/mpi/lib -loshmem -lmpi
-lopen-rte -lopen-pal \
-lm -ldl -lutil \
-Wl,-rpath
-Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
\
-L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
\
-o mic.out shmem_hello.c
----------------
Running the program
(Note that the program had been consistently failing.
Then, when I logged back into the system to capture the
results, it worked once, and then immediately failed
when I tried again, as shown below. Logging in and out
isn't sufficient to correct the problem. Overall, I
think I had 3 successful runs in 30-40 attempts.)
$ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
[atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not
found in file base/plm_base_launch_support.c at line 426
Hello World from process 0 of 2
Hello World from process 1 of 2
$ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
[atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not
found in file base/plm_base_launch_support.c at line 426
[atl1-01-mic0:189383] Error: pshmem_init.c:61 -
shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your
parallel process is
likely to abort. There are many reasons that a parallel
process can
fail during SHMEM_INIT; some of which are due to
configuration or environment
problems. This failure appears to be an internal
failure; here's some
additional information (which may only be relevant to an
Open SHMEM
developer):
mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 189383,
host=atl1-01-mic0) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot
guarantee that all
of its peer processes in the job will be killed
properly. You should
double check that everything has shut down cleanly.
Local host: atl1-01-mic0
PID: 189383
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has
been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with
non-zero status, thus causing
the job to be terminated. The first process to do so
was:
Process name: [[30881,1],0]
Exit code: 255
--------------------------------------------------------------------------
Any thoughts about where to go from here?
Andy
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/04/26670.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26676.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/04/26678.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26679.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/04/26680.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26682.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/04/26683.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26684.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26697.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26699.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26700.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26706.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26716.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26718.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/04/26731.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26732.php
|