Gilles and Ralph, thanks!

$ shmemrun -H mic0,mic1 -n 2 -x SHMEM_SYMMETRIC_HEAP_SIZE=1M $PWD/mic.out
[atl1-01-mic0:192474] [[29886,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 440
Hello World from process 0 of 2
Hello World from process 1 of 2
$

This was built with the openmpi-dev-1487-g9c6d452.tar.bz2 nightly master. Oddly, -static-intel didn't work. Fortunately, -rpath did. I'll follow-up in the next day or so with the winning build recipes for both MPI and the user app to wrap up this note and, one hopes, save others from some frustration in the future.

Andy


On 04/14/2015 11:10 PM, Ralph Castain wrote:
I think Gilles may be correct here. In reviewing the code, it appears we have never (going back to the 1.6 series, at least) forwarded the local LD_LIBRARY_PATH to the remote node when exec’ing the orted. The only thing we have done is to set the PATH and LD_LIBRARY_PATH to support the OMPI prefix - not any supporting libs.

What we have required, therefore, is that your path be setup properly in the remote .bashrc (or pick your shell) to handle the libraries.

As I indicated, the -x option only forwards envars to the application procs themselves, not the orted. I could try to add another cmd line option to forward things for the orted, but the concern we’ve had in the past (and still harbor) is that the ssh cmd line is limited in length. Thus, adding some potentially long paths to support this option could overwhelm it and cause failures.

I’d try the static method first, or perhaps the LDFLAGS Gilles suggested.


On Apr 14, 2015, at 5:11 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:

Andy,

what about reconfiguring Open MPI with LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic" ?

IIRC, an other option is : LDFLAGS="-static-intel"

last but not least, you can always replace orted with a simple script that sets the LD_LIBRARY_PATH and exec the original orted

do you have the same behaviour on non MIC hardware when Open MPI is compiled with intel compilers ?
if it works on non MIC hardware, the root cause could be in the sshd_config of the MIC that does not
accept to receive LD_LIBRARY_PATH

my 0.02 US$

Gilles

On 4/14/2015 11:20 PM, Ralph Castain wrote:
Hmmm…certainly looks that way. I’ll investigate.

On Apr 14, 2015, at 6:06 AM, Andy Riebs <andy.ri...@hp.com> wrote:

Hi Ralph,

Still no happiness... It looks like my LD_LIBRARY_PATH just isn't getting propagated?

$ ldd /home/ariebs/mic/mpi-nightly/bin/orted
        linux-vdso.so.1 =>  (0x00007fffa1d3b000)
        libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x00002ab6ce464000)
        libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x00002ab6ce7d3000)
        libm.so.6 => /lib64/libm.so.6 (0x00002ab6cebbd000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ab6ceded000)
        librt.so.1 => /lib64/librt.so.1 (0x00002ab6ceff1000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002ab6cf1f9000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ab6cf3fc000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ab6cf60f000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ab6cf82c000)
        libimf.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so (0x00002ab6cfb84000)
        libsvml.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so (0x00002ab6cffd6000)
        libirng.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so (0x00002ab6d086f000)
        libintlc.so.5 => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 (0x00002ab6d0a82000)
        /lib64/ld-linux-k1om.so.2 (0x00002ab6ce243000)

$ echo $LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib

$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose 5 --mca memheap_base_verbose 100 --leave-session-attached --mca mca_component_show_load_errors 1 $PWD/mic.out
--------------------------------------------------------------------------
A deprecated MCA variable value was specified in the environment or
on the command line.  Deprecated MCA variables should be avoided;
they may disappear in future releases.

  Deprecated variable: mca_component_show_load_errors
  New variable:        mca_base_component_show_load_errors
--------------------------------------------------------------------------
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [rsh]
[atl1-02-mic0:16183] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[atl1-02-mic0:16183] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [isolated]
[atl1-02-mic0:16183] mca:base:select:(  plm) Query of component [isolated] set priority to 0
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [slurm]
[atl1-02-mic0:16183] mca:base:select:(  plm) Skipping component [slurm]. Query failed to return a module
[atl1-02-mic0:16183] mca:base:select:(  plm) Selected component [rsh]
[atl1-02-mic0:16183] plm:base:set_hnp_name: initial bias 16183 nodename hash 4238360777
[atl1-02-mic0:16183] plm:base:set_hnp_name: final jobfam 33630
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[atl1-02-mic0:16183] [[33630,0],0] plm:base:receive start comm
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_job
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm creating map
[atl1-02-mic0:16183] [[33630,0],0] setup:vm: working unmanaged allocation
[atl1-02-mic0:16183] [[33630,0],0] using dash_host
[atl1-02-mic0:16183] [[33630,0],0] checking node mic1
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm add new daemon [[33630,0],1]
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm assigning new daemon [[33630,0],1] to node mic1
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: launching vm
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: local shell: 0 (bash)
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: assuming same remote shell as local shell
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: remote shell: 0 (bash)
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: final template argv:
        /usr/bin/ssh <template>     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted -mca orte_leave_session_attached "1" --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid "2203975680" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "2" -mca orte_hnp_uri "2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1" --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca memheap_base_verbose "100" --mca mca_component_show_load_errors "1" -mca plm "rsh" -mca rmaps_ppr_n_pernode "2"
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh:launch daemon 0 not a child of mine
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: adding node mic1 to launch list
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: activating launch event
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: recording launch of daemon [[33630,0],1]
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh mic1     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted -mca orte_leave_session_attached "1" --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid "2203975680" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca orte_hnp_uri "2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1" --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca memheap_base_verbose "100" --mca mca_component_show_load_errors "1" -mca plm "rsh" -mca rmaps_ppr_n_pernode "2"]
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
[atl1-02-mic0:16183] [[33630,0],0] daemon 1 failed with status 127
[atl1-02-mic0:16183] [[33630,0],0] plm:base:orted_cmd sending orted_exit commands
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
[atl1-02-mic0:16183] [[33630,0],0] plm:base:receive stop comm


On 04/13/2015 07:47 PM, Ralph Castain wrote:
Weird. I’m not sure what to try at that point - IIRC, building static won’t resolve this problem (but you could try and see). You could add the following to the cmd line and see if it tells us anything useful:

—leave-session-attached —mca mca_component_show_load_errors 1

You might also do an ldd on /home/ariebs/mic/mpi-nightly/bin/orted and see where it is looking for libimf since it (and not mic.out) is the one complaining


On Apr 13, 2015, at 1:58 PM, Andy Riebs <andy.ri...@hp.com> wrote:

Ralph and Nathan,

The problem may be something trivial, as I don't typically use "shmemrun" to start jobs. With the following, I *think* I've  demonstrated that the problem library is where it belongs on the remote system:

$ ldd mic.out
        linux-vdso.so.1 =>  (0x00007fffb83ff000)
        liboshmem.so.0 => /home/ariebs/mic/mpi-nightly/lib/liboshmem.so.0 (0x00002b059cfbb000)
        libmpi.so.0 => /home/ariebs/mic/mpi-nightly/lib/libmpi.so.0 (0x00002b059d35a000)
        libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x00002b059d7e3000)
        libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x00002b059db53000)
        libm.so.6 => /lib64/libm.so.6 (0x00002b059df3d000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002b059e16c000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002b059e371000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b059e574000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b059e786000)
        libc.so.6 => /lib64/libc.so.6 (0x00002b059e9a4000)
        librt.so.1 => /lib64/librt.so.1 (0x00002b059ecfc000)
        libimf.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so (0x00002b059ef04000)
        libsvml.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so (0x00002b059f356000)
        libirng.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so (0x00002b059fbef000)
        libintlc.so.5 => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 (0x00002b059fe02000)
        /lib64/ld-linux-k1om.so.2 (0x00002b059cd9a000)
$ echo $LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
$ ssh mic1 file /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF 64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1 (SYSV), dynamically linked, not stripped
$ shmemrun -H mic1 -N 2 --mca btl scif,self $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
...


On 04/13/2015 04:25 PM, Nathan Hjelm wrote:
For talking between PHIs on the same system I recommend using the scif
BTL NOT tcp.

That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
system. It looks like it can't find the intel compiler libraries.

-Nathan Hjelm
HPC-5, LANL

On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:
   Progress!  I can run my trivial program on the local PHI, but not the
   other PHI, on the system. Here are the interesting parts:

   A pretty good recipe with last night's nightly master:

   $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
   CXX="icpc -mmic" \
       --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
        AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
   LD=x86_64-k1om-linux-ld \
        --enable-mpirun-prefix-by-default --disable-io-romio
   --disable-mpi-fortran \
        --enable-orterun-prefix-by-default \
        --enable-debug
   $ make && make install
   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
   yoda --mca btl sm,self,tcp $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
   yoda --mca btl openib,sm,self $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $

   However, I can't seem to cross the fabric. I can ssh freely back and forth
   between mic0 and mic1. However, running the next 2 tests from mic0, it 
   certainly seems like the second one should work, too:

   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda
   --mca btl sm,self,tcp $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda
   --mca btl sm,self,tcp $PWD/mic.out
   /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
   libraries: libimf.so: cannot open shared object file: No such file or
   directory
   --------------------------------------------------------------------------
   ORTE was unable to reliably start one or more daemons.
   This usually is caused by:

   * not finding the required libraries and/or binaries on
     one or more nodes. Please check your PATH and LD_LIBRARY_PATH
     settings, or configure OMPI with --enable-orterun-prefix-by-default

   * lack of authority to execute on one or more specified nodes.
     Please verify your allocation and authorities.

   * the inability to write startup files into /tmp
   (--tmpdir/orte_tmpdir_base).
     Please check with your sys admin to determine the correct location to
   use.

   *  compilation of the orted with dynamic libraries when static are
   required
     (e.g., on Cray). Please check your configure cmd line and consider using
     one of the contrib/platform definitions for your system type.

   * an inability to create a connection back to mpirun due to a
     lack of common network interfaces and/or no route found between
     them. Please check network connectivity (including firewalls
     and network routing requirements).
    ...
   $

   (Note that I get the same results with "--mca btl openib,sm,self"....)

   $ ssh mic1 file
   /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
   /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF
   64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1
   (SYSV), dynamically linked, not stripped
   $ shmemrun -x
   LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
   -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
   /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
   libraries: libimf.so: cannot open shared object file: No such file or
   directory
   --------------------------------------------------------------------------
   ORTE was unable to reliably start one or more daemons.
   This usually is caused by:

   * not finding the required libraries and/or binaries on
     one or more nodes. Please check your PATH and LD_LIBRARY_PATH
     settings, or configure OMPI with --enable-orterun-prefix-by-default

   * lack of authority to execute on one or more specified nodes.
     Please verify your allocation and authorities.

   * the inability to write startup files into /tmp
   (--tmpdir/orte_tmpdir_base).
     Please check with your sys admin to determine the correct location to
   use.

   *  compilation of the orted with dynamic libraries when static are
   required
     (e.g., on Cray). Please check your configure cmd line and consider using
     one of the contrib/platform definitions for your system type.

   * an inability to create a connection back to mpirun due to a
     lack of common network interfaces and/or no route found between
     them. Please check network connectivity (including firewalls
     and network routing requirements).

   Following here is
   - IB information
   - Running the failing case with lots of debugging information. (As you
   might imagine, I've tried 17 ways from Sunday to try to ensure that
   libimf.so is found.)

   $ ibv_devices
       device                 node GUID
       ------              ----------------
       mlx4_0              24be05ffffa57160
       scif0               4c79bafffe4402b6
   $ ibv_devinfo
   hca_id: mlx4_0
           transport:                      InfiniBand (0)
           fw_ver:                         2.11.1250
           node_guid:                      24be:05ff:ffa5:7160
           sys_image_guid:                 24be:05ff:ffa5:7163
           vendor_id:                      0x02c9
           vendor_part_id:                 4099
           hw_ver:                         0x0
           phys_port_cnt:                  2
                   port:   1
                           state:                  PORT_ACTIVE (4)
                           max_mtu:                2048 (4)
                           active_mtu:             2048 (4)
                           sm_lid:                 8
                           port_lid:               86
                           port_lmc:               0x00
                           link_layer:             InfiniBand

                   port:   2
                           state:                  PORT_DOWN (1)
                           max_mtu:                2048 (4)
                           active_mtu:             2048 (4)
                           sm_lid:                 0
                           port_lid:               0
                           port_lmc:               0x00
                           link_layer:             InfiniBand

   hca_id: scif0
           transport:                      SCIF (2)
           fw_ver:                         0.0.1
           node_guid:                      4c79:baff:fe44:02b6
           sys_image_guid:                 4c79:baff:fe44:02b6
           vendor_id:                      0x8086
           vendor_part_id:                 0
           hw_ver:                         0x1
           phys_port_cnt:                  1
                   port:   1
                           state:                  PORT_ACTIVE (4)
                           max_mtu:                4096 (5)
                           active_mtu:             4096 (5)
                           sm_lid:                 1
                           port_lid:               1001
                           port_lmc:               0x00
                           link_layer:             SCIF

   $ shmemrun -x
   LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
   -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose
   5 --mca memheap_base_verbose 100 $PWD/mic.out
   [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [rsh]
   [atl1-01-mic0:191024] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
   rsh path NULL
   [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component [rsh] set
   priority to 10
   [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component
   [isolated]
   [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component
   [isolated] set priority to 0
   [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [slurm]
   [atl1-01-mic0:191024] mca:base:select:(  plm) Skipping component [slurm].
   Query failed to return a module
   [atl1-01-mic0:191024] mca:base:select:(  plm) Selected component [rsh]
   [atl1-01-mic0:191024] plm:base:set_hnp_name: initial bias 191024 nodename
   hash 4121194178
   [atl1-01-mic0:191024] plm:base:set_hnp_name: final jobfam 29012
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh_setup on agent ssh : rsh path
   NULL
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive start comm
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_job
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm creating map
   [atl1-01-mic0:191024] [[29012,0],0] setup:vm: working unmanaged allocation
   [atl1-01-mic0:191024] [[29012,0],0] using dash_host
   [atl1-01-mic0:191024] [[29012,0],0] checking node mic1
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm add new daemon
   [[29012,0],1]
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm assigning new daemon
   [[29012,0],1] to node mic1
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: launching vm
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: local shell: 0 (bash)
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: assuming same remote shell as
   local shell
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: remote shell: 0 (bash)
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: final template argv:
           /usr/bin/ssh <template>    
   PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
   LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
   LD_LIBRARY_PATH ;
   DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
   export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
   --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
   orte_ess_jobid "1901330432" -mca orte_ess_vpid "<template>" -mca
   orte_ess_num_procs "2" -mca orte_hnp_uri
   "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
   --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
   plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
   rmaps_ppr_n_pernode "2"
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a child of
   mine
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to launch
   list
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of daemon
   [[29012,0],1]
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing: (/usr/bin/ssh)
   [/usr/bin/ssh mic1     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ;
   export PATH ;
   LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
   LD_LIBRARY_PATH ;
   DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
   export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
   --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
   orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs
   "2" -mca orte_hnp_uri
   "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
   --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
   plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
   rmaps_ppr_n_pernode "2"]
   /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
   libraries: libimf.so: cannot open shared object file: No such file or
   directory
   [atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending orted_exit
   commands
   --------------------------------------------------------------------------
   ORTE was unable to reliably start one or more daemons.
   This usually is caused by:

   * not finding the required libraries and/or binaries on
     one or more nodes. Please check your PATH and LD_LIBRARY_PATH
     settings, or configure OMPI with --enable-orterun-prefix-by-default

   * lack of authority to execute on one or more specified nodes.
     Please verify your allocation and authorities.

   * the inability to write startup files into /tmp
   (--tmpdir/orte_tmpdir_base).
     Please check with your sys admin to determine the correct location to
   use.

   *  compilation of the orted with dynamic libraries when static are
   required
     (e.g., on Cray). Please check your configure cmd line and consider using
     one of the contrib/platform definitions for your system type.

   * an inability to create a connection back to mpirun due to a
     lack of common network interfaces and/or no route found between
     them. Please check network connectivity (including firewalls
     and network routing requirements).
   --------------------------------------------------------------------------
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm

   On 04/13/2015 08:50 AM, Andy Riebs wrote:

     Hi Ralph,

     Here are the results with last night's "master" nightly,
     openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose
     option (yes, it looks like the "ERROR_LOG" problem has gone away):

     $ cat /proc/sys/kernel/shmmax
     33554432
     $ cat /proc/sys/kernel/shmall
     2097152
     $ cat /proc/sys/kernel/shmmni
     4096
     $ export SHMEM_SYMMETRIC_HEAP=1M
     $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5
     --mca memheap_base_verbose 100 $PWD/mic.out
     [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [rsh]
     [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
     rsh path NULL
     [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component [rsh]
     set priority to 10
     [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
     [isolated]
     [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component
     [isolated] set priority to 0
     [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [slurm]
     [atl1-01-mic0:190439] mca:base:select:(  plm) Skipping component
     [slurm]. Query failed to return a module
     [atl1-01-mic0:190439] mca:base:select:(  plm) Selected component [rsh]
     [atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439
     nodename hash 4121194178
     [atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
     [atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh
     path NULL
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
     [atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged
     allocation
     [atl1-01-mic0:190439] [[31875,0],0] using dash_host
     [atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
     [atl1-01-mic0:190439] [[31875,0],0] ignoring myself
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in
     allocation
     [atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1]
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job
     [31875,1]
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof for
     job [31875,1]
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] registered
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is not
     a dynamic spawn
     [atl1-01-mic0:190441] mca: base: components_register: registering
     memheap components
     [atl1-01-mic0:190441] mca: base: components_register: found loaded
     component buddy
     [atl1-01-mic0:190441] mca: base: components_register: component buddy
     has no register or open function
     [atl1-01-mic0:190442] mca: base: components_register: registering
     memheap components
     [atl1-01-mic0:190442] mca: base: components_register: found loaded
     component buddy
     [atl1-01-mic0:190442] mca: base: components_register: component buddy
     has no register or open function
     [atl1-01-mic0:190442] mca: base: components_register: found loaded
     component ptmalloc
     [atl1-01-mic0:190442] mca: base: components_register: component ptmalloc
     has no register or open function
     [atl1-01-mic0:190441] mca: base: components_register: found loaded
     component ptmalloc
     [atl1-01-mic0:190441] mca: base: components_register: component ptmalloc
     has no register or open function
     [atl1-01-mic0:190441] mca: base: components_open: opening memheap
     components
     [atl1-01-mic0:190441] mca: base: components_open: found loaded component
     buddy
     [atl1-01-mic0:190441] mca: base: components_open: component buddy open
     function successful
     [atl1-01-mic0:190441] mca: base: components_open: found loaded component
     ptmalloc
     [atl1-01-mic0:190441] mca: base: components_open: component ptmalloc
     open function successful
     [atl1-01-mic0:190442] mca: base: components_open: opening memheap
     components
     [atl1-01-mic0:190442] mca: base: components_open: found loaded component
     buddy
     [atl1-01-mic0:190442] mca: base: components_open: component buddy open
     function successful
     [atl1-01-mic0:190442] mca: base: components_open: found loaded component
     ptmalloc
     [atl1-01-mic0:190442] mca: base: components_open: component ptmalloc
     open function successful
     [atl1-01-mic0:190442] base/memheap_base_alloc.c:38 -
     mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
     segments by method: 1
     [atl1-01-mic0:190441] base/memheap_base_alloc.c:38 -
     mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
     segments by method: 1
     [atl1-01-mic0:190442] base/memheap_base_static.c:205 - _load_segments()
     add: 00600000-00601000 rw-p 00000000 00:11
     6029314                            /home/ariebs/bench/hello/mic.out
     [atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments()
     add: 00600000-00601000 rw-p 00000000 00:11
     6029314                            /home/ariebs/bench/hello/mic.out
     [atl1-01-mic0:190442] base/memheap_base_static.c:75 -
     mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
     segments
     [atl1-01-mic0:190442] base/memheap_base_register.c:39 -
     mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
     270532608 bytes type=0x1 id=0xFFFFFFFF
     [atl1-01-mic0:190441] base/memheap_base_static.c:75 -
     mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
     segments
     [atl1-01-mic0:190441] base/memheap_base_register.c:39 -
     mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
     270532608 bytes type=0x1 id=0xFFFFFFFF
     [atl1-01-mic0:190442] Error base/memheap_base_register.c:130 -
     _reg_segment() Failed to register segment
     [atl1-01-mic0:190441] Error base/memheap_base_register.c:130 -
     _reg_segment() Failed to register segment
     [atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM
     failed to initialize - aborting
     [atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM
     failed to initialize - aborting
     --------------------------------------------------------------------------
     It looks like SHMEM_INIT failed for some reason; your parallel process
     is
     likely to abort.  There are many reasons that a parallel process can
     fail during SHMEM_INIT; some of which are due to configuration or
     environment
     problems.  This failure appears to be an internal failure; here's some
     additional information (which may only be relevant to an Open SHMEM
     developer):

       mca_memheap_base_select() failed
       --> Returned "Error" (-1) instead of "Success" (0)
     --------------------------------------------------------------------------
     --------------------------------------------------------------------------
     SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with
     errorcode -1.
     --------------------------------------------------------------------------
     --------------------------------------------------------------------------
     A SHMEM process is aborting at a time when it cannot guarantee that all
     of its peer processes in the job will be killed properly.  You should
     double check that everything has shut down cleanly.

     Local host: atl1-01-mic0
     PID:        190441
     --------------------------------------------------------------------------
     -------------------------------------------------------
     Primary job  terminated normally, but 1 process returned
     a non-zero exit code.. Per user-direction, the job has been aborted.
     -------------------------------------------------------
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending
     orted_exit commands
     --------------------------------------------------------------------------
     shmemrun detected that one or more processes exited with non-zero
     status, thus causing
     the job to be terminated. The first process to do so was:

       Process name: [[31875,1],0]
       Exit code:    255
     --------------------------------------------------------------------------
     [atl1-01-mic0:190439] 1 more process has sent help message
     help-shmem-runtime.txt / shmem_init:startup:internal-failure
     [atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0
     to see all help / error messages
     [atl1-01-mic0:190439] 1 more process has sent help message
     help-shmem-api.txt / shmem-abort
     [atl1-01-mic0:190439] 1 more process has sent help message
     help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm

     On 04/12/2015 03:09 PM, Ralph Castain wrote:

       Sorry about that - I hadn't brought it over to the 1.8 branch yet.
       I've done so now, which means the ERROR_LOG shouldn't show up any
       more. It won't fix the memheap problem, though.
       You might try adding "--mca memheap_base_verbose 100" to your cmd line
       so we can see why none of the memheap components are being selected.

         On Apr 12, 2015, at 11:30 AM, Andy Riebs <andy.ri...@hp.com> wrote:
         Hi Ralph,

         Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:

         $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
         plm_base_verbose 5 $PWD/mic.out
         [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
         [rsh]
         [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent
         ssh : rsh path NULL
         [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
         [rsh] set priority to 10
         [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
         [isolated]
         [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
         [isolated] set priority to 0
         [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
         [slurm]
         [atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component
         [slurm]. Query failed to return a module
         [atl1-01-mic0:190189] mca:base:select:(  plm) Selected component
         [rsh]
         [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189
         nodename hash 4121194178
         [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
         [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh
         path NULL
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
         [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged
         allocation
         [atl1-01-mic0:190189] [[32137,0],0] using dash_host
         [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
         [atl1-01-mic0:190189] [[32137,0],0] ignoring myself
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in
         allocation
         [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
         [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in
         file base/plm_base_launch_support.c at line 440
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job
         [32137,1]
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof
         for job [32137,1]
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1]
         registered
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is
         not a dynamic spawn
         --------------------------------------------------------------------------
         It looks like SHMEM_INIT failed for some reason; your parallel
         process is
         likely to abort.  There are many reasons that a parallel process can
         fail during SHMEM_INIT; some of which are due to configuration or
         environment
         problems.  This failure appears to be an internal failure; here's
         some
         additional information (which may only be relevant to an Open SHMEM
         developer):

           mca_memheap_base_select() failed
           --> Returned "Error" (-1) instead of "Success" (0)
         --------------------------------------------------------------------------
         [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM
         failed to initialize - aborting
         [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM
         failed to initialize - aborting
         --------------------------------------------------------------------------
         SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0)
         with errorcode -1.
         --------------------------------------------------------------------------
         --------------------------------------------------------------------------
         A SHMEM process is aborting at a time when it cannot guarantee that
         all
         of its peer processes in the job will be killed properly.  You
         should
         double check that everything has shut down cleanly.

         Local host: atl1-01-mic0
         PID:        190192
         --------------------------------------------------------------------------
         -------------------------------------------------------
         Primary job  terminated normally, but 1 process returned
         a non-zero exit code.. Per user-direction, the job has been aborted.
         -------------------------------------------------------
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending
         orted_exit commands
         --------------------------------------------------------------------------
         shmemrun detected that one or more processes exited with non-zero
         status, thus causing
         the job to be terminated. The first process to do so was:

           Process name: [[32137,1],0]
           Exit code:    255
         --------------------------------------------------------------------------
         [atl1-01-mic0:190189] 1 more process has sent help message
         help-shmem-runtime.txt / shmem_init:startup:internal-failure
         [atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate"
         to 0 to see all help / error messages
         [atl1-01-mic0:190189] 1 more process has sent help message
         help-shmem-api.txt / shmem-abort
         [atl1-01-mic0:190189] 1 more process has sent help message
         help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
         killed
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm

         On 04/11/2015 07:41 PM, Ralph Castain wrote:

           Got it - thanks. I fixed that ERROR_LOG issue (I think- please
           verify). I suspect the memheap issue relates to something else,
           but I probably need to let the OSHMEM folks comment on it

             On Apr 11, 2015, at 9:52 AM, Andy Riebs <andy.ri...@hp.com>
             wrote:
             Everything is built on the Xeon side, with the icc "-mmic"
             switch. I then ssh into one of the PHIs, and run shmemrun from
             there.

             On 04/11/2015 12:00 PM, Ralph Castain wrote:

               Let me try to understand the setup a little better. Are you
               running shmemrun on the PHI itself? Or is it running on the
               host processor, and you are trying to spawn a process onto the
               Phi?

                 On Apr 11, 2015, at 7:55 AM, Andy Riebs <andy.ri...@hp.com>
                 wrote:
                 Hi Ralph,

                 Yes, this is attempting to get OSHMEM to run on the Phi.

                 I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured
                 it with

                 $ ./configure --prefix=/home/ariebs/mic/mpi-nightly   
                 CC=icc -mmic CXX=icpc -mmic    \
                     --build=x86_64-unknown-linux-gnu
                 --host=x86_64-k1om-linux    \
                      AR=x86_64-k1om-linux-ar
                 RANLIB=x86_64-k1om-linux-ranlib  LD=x86_64-k1om-linux-ld   \
                      --enable-mpirun-prefix-by-default
                 --disable-io-romio     --disable-mpi-fortran    \
                      --enable-debug    
                 --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud

                 (Note that I had to add "oob-ud" to the
                 "--enable-mca-no-build" option, as the build complained that
                 mca oob/ud needed mca common-verbs.)

                 With that configuration, here is what I am seeing now...

                 $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
                 $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
                 plm_base_verbose 5 $PWD/mic.out
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                 component [rsh]
                 [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on
                 agent ssh : rsh path NULL
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
                 component [rsh] set priority to 10
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                 component [isolated]
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
                 component [isolated] set priority to 0
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                 component [slurm]
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Skipping
                 component [slurm]. Query failed to return a module
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Selected
                 component [rsh]
                 [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias
                 189895 nodename hash 4121194178
                 [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam
                 32419
                 [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent
                 ssh : rsh path NULL
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start
                 comm
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
                 creating map
                 [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working
                 unmanaged allocation
                 [atl1-01-mic0:189895] [[32419,0],0] using dash_host
                 [atl1-01-mic0:189895] [[32419,0],0] checking node
                 atl1-01-mic0
                 [atl1-01-mic0:189895] [[32419,0],0] ignoring myself
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only
                 HNP in allocation
                 [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job
                 [32419,1]
                 [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not
                 found in file base/plm_base_launch_support.c at line 440
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for
                 job [32419,1]
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring
                 up iof for job [32419,1]
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
                 [32419,1] registered
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job
                 [32419,1] is not a dynamic spawn
                 [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init()
                 SHMEM failed to initialize - aborting
                 [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init()
                 SHMEM failed to initialize - aborting
                 --------------------------------------------------------------------------
                 It looks like SHMEM_INIT failed for some reason; your
                 parallel process is
                 likely to abort.  There are many reasons that a parallel
                 process can
                 fail during SHMEM_INIT; some of which are due to
                 configuration or environment
                 problems.  This failure appears to be an internal failure;
                 here's some
                 additional information (which may only be relevant to an
                 Open SHMEM
                 developer):

                   mca_memheap_base_select() failed
                   --> Returned "Error" (-1) instead of "Success" (0)
                 --------------------------------------------------------------------------
                 --------------------------------------------------------------------------
                 SHMEM_ABORT was invoked on rank 1 (pid 189899,
                 host=atl1-01-mic0) with errorcode -1.
                 --------------------------------------------------------------------------
                 --------------------------------------------------------------------------
                 A SHMEM process is aborting at a time when it cannot
                 guarantee that all
                 of its peer processes in the job will be killed properly. 
                 You should
                 double check that everything has shut down cleanly.

                 Local host: atl1-01-mic0
                 PID:        189899
                 --------------------------------------------------------------------------
                 -------------------------------------------------------
                 Primary job  terminated normally, but 1 process returned
                 a non-zero exit code.. Per user-direction, the job has been
                 aborted.
                 -------------------------------------------------------
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd
                 sending orted_exit commands
                 --------------------------------------------------------------------------
                 shmemrun detected that one or more processes exited with
                 non-zero status, thus causing
                 the job to be terminated. The first process to do so was:

                   Process name: [[32419,1],1]
                   Exit code:    255
                 --------------------------------------------------------------------------
                 [atl1-01-mic0:189895] 1 more process has sent help message
                 help-shmem-runtime.txt / shmem_init:startup:internal-failure
                 [atl1-01-mic0:189895] Set MCA parameter
                 "orte_base_help_aggregate" to 0 to see all help / error
                 messages
                 [atl1-01-mic0:189895] 1 more process has sent help message
                 help-shmem-api.txt / shmem-abort
                 [atl1-01-mic0:189895] 1 more process has sent help message
                 help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee
                 all killed
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop
                 comm

                 On 04/10/2015 06:37 PM, Ralph Castain wrote:

                   Andy - could you please try the current 1.8.5 nightly
                   tarball and see if it helps? The error log indicates that
                   it is failing to get the topology from some daemon, I**m
                   assuming the one on the Phi?
                   You might also add **enable-debug to that configure line
                   and then put -mca plm_base_verbose on the shmemrun cmd to
                   get more help

                     On Apr 10, 2015, at 11:55 AM, Andy Riebs
                     <andy.ri...@hp.com> wrote:
                     Summary: MPI jobs work fine, SHMEM jobs work just often
                     enough to be tantalizing, on an Intel Xeon Phi/MIC
                     system.

                     Longer version

                     Thanks to the excellent write-up last June
                     (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php>),
                     I have been able to build a version of Open MPI for the
                     Xeon Phi coprocessor that runs MPI jobs on the Phi
                     coprocessor with no problem, but not SHMEM jobs.  Just
                     at the point where I was about to document the problems
                     I was having with SHMEM, my trivial SHMEM job worked.
                     And then failed when I tried to run it again,
                     immediately afterwards. I have a feeling I may be in
                     uncharted  territory here.

                     Environment
                       * RHEL 6.5
                       * Intel Composer XE 2015
                       * Xeon Phi/MIC
                     ----------------

                     Configuration

                     $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
                     $ source
                     /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
                     intel64
                     $ ./configure --prefix=/home/ariebs/mic/mpi \
                        CC="icc -mmic" CXX="icpc -mmic" \
                        --build=x86_64-unknown-linux-gnu
                     --host=x86_64-k1om-linux \
                         AR=x86_64-k1om-linux-ar
                     RANLIB=x86_64-k1om-linux-ranlib \
                         LD=x86_64-k1om-linux-ld \
                         --enable-mpirun-prefix-by-default --disable-io-romio
                     \
                         --disable-vt --disable-mpi-fortran \

                     --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
                     $ make
                     $ make install

                     ----------------

                     Test program

                     #include <stdio.h>
                     #include <stdlib.h>
                     #include <shmem.h>
                     int main(int argc, char **argv)
                     {
                             int me, num_pe;
                             shmem_init();
                             num_pe = num_pes();
                             me = my_pe();
                             printf("Hello World from process %ld of %ld\n",
                     me, num_pe);
                             exit(0);
                     }

                     ----------------

                     Building the program

                     export PATH=/home/ariebs/mic/mpi/bin:$PATH
                     export PATH=/usr/linux-k1om-4.7/bin/:$PATH
                     source
                     /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
                     intel64
                     export
                     LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH

                     icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include
                     -pthread \
                             -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib
                     -Wl,--enable-new-dtags \
                             -L/home/ariebs/mic/mpi/lib -loshmem -lmpi
                     -lopen-rte -lopen-pal \
                             -lm -ldl -lutil \
                             -Wl,-rpath
                     -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
                     \

                     -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
                     \
                             -o mic.out  shmem_hello.c

                     ----------------

                     Running the program

                     (Note that the program had been consistently failing.
                     Then, when I logged back into the system to capture the
                     results, it worked once,  and then immediately failed
                     when I tried again, as shown below. Logging in and out
                     isn't sufficient to correct the problem. Overall, I
                     think I had 3 successful runs in 30-40 attempts.)

                     $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
                     [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not
                     found in file base/plm_base_launch_support.c at line 426
                     Hello World from process 0 of 2
                     Hello World from process 1 of 2
                     $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
                     [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not
                     found in file base/plm_base_launch_support.c at line 426
                     [atl1-01-mic0:189383] Error: pshmem_init.c:61 -
                     shmem_init() SHMEM failed to initialize - aborting
                     --------------------------------------------------------------------------
                     It looks like SHMEM_INIT failed for some reason; your
                     parallel process is
                     likely to abort.  There are many reasons that a parallel
                     process can
                     fail during SHMEM_INIT; some of which are due to
                     configuration or environment
                     problems.  This failure appears to be an internal
                     failure; here's some
                     additional information (which may only be relevant to an
                     Open SHMEM
                     developer):

                       mca_memheap_base_select() failed
                       --> Returned "Error" (-1) instead of "Success" (0)
                     --------------------------------------------------------------------------
                     --------------------------------------------------------------------------
                     SHMEM_ABORT was invoked on rank 0 (pid 189383,
                     host=atl1-01-mic0) with errorcode -1.
                     --------------------------------------------------------------------------
                     --------------------------------------------------------------------------
                     A SHMEM process is aborting at a time when it cannot
                     guarantee that all
                     of its peer processes in the job will be killed
                     properly.  You should
                     double check that everything has shut down cleanly.

                     Local host: atl1-01-mic0
                     PID:        189383
                     --------------------------------------------------------------------------
                     -------------------------------------------------------
                     Primary job  terminated normally, but 1 process returned
                     a non-zero exit code.. Per user-direction, the job has
                     been aborted.
                     -------------------------------------------------------
                     --------------------------------------------------------------------------
                     shmemrun detected that one or more processes exited with
                     non-zero status, thus causing
                     the job to be terminated. The first process to do so
                     was:

                       Process name: [[30881,1],0]
                       Exit code:    255
                     --------------------------------------------------------------------------

                     Any thoughts about where to go from here?

                     Andy

 --
 Andy Riebs
 Hewlett-Packard Company
 High Performance Computing
 +1 404 648 9024
 My opinions are not necessarily those of HP

                     _______________________________________________
                     users mailing list
                     us...@open-mpi.org
                     Subscription:
                     http://www.open-mpi.org/mailman/listinfo.cgi/users
                     Link to this post:
                     http://www.open-mpi.org/community/lists/users/2015/04/26670.php

 _______________________________________________
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26676.php

                 _______________________________________________
                 users mailing list
                 us...@open-mpi.org
                 Subscription:
                 http://www.open-mpi.org/mailman/listinfo.cgi/users
                 Link to this post:
                 http://www.open-mpi.org/community/lists/users/2015/04/26678.php

 _______________________________________________
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26679.php

             _______________________________________________
             users mailing list
             us...@open-mpi.org
             Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
             Link to this post:
             http://www.open-mpi.org/community/lists/users/2015/04/26680.php

 _______________________________________________
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26682.php

         _______________________________________________
         users mailing list
         us...@open-mpi.org
         Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
         Link to this post:
         http://www.open-mpi.org/community/lists/users/2015/04/26683.php

 _______________________________________________
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26684.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26697.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26699.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26700.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26706.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26716.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26718.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26731.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26732.php

Reply via email to