Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Andy Riebs Mon, 13 Apr 2015 16:59:01 -0400 (EDT)

Ralph and Nathan,

The problem may be something trivial, as I don't typically use "shmemrun" to start jobs. With the following, I *think* I've demonstrated that the problem library is where it belongs on the remote system:

$ ldd mic.out
        linux-vdso.so.1 => (0x00007fffb83ff000)
        liboshmem.so.0 => /home/ariebs/mic/mpi-nightly/lib/liboshmem.so.0 (0x00002b059cfbb000)
        libmpi.so.0 => /home/ariebs/mic/mpi-nightly/lib/libmpi.so.0 (0x00002b059d35a000)
        libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x00002b059d7e3000)
        libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x00002b059db53000)
        libm.so.6 => /lib64/libm.so.6 (0x00002b059df3d000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002b059e16c000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002b059e371000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b059e574000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b059e786000)
        libc.so.6 => /lib64/libc.so.6 (0x00002b059e9a4000)
        librt.so.1 => /lib64/librt.so.1 (0x00002b059ecfc000)
        libimf.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so (0x00002b059ef04000)
        libsvml.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so (0x00002b059f356000)
        libirng.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so (0x00002b059fbef000)
        libintlc.so.5 => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 (0x00002b059fe02000)
        /lib64/ld-linux-k1om.so.2 (0x00002b059cd9a000)
$ echo $LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
$ ssh mic1 file /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF 64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1 (SYSV), dynamically linked, not stripped
$ shmemrun -H mic1 -N 2 --mca btl scif,self $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
...

On 04/13/2015 04:25 PM, Nathan Hjelm wrote:

For talking between PHIs on the same system I recommend using the scif
BTL NOT tcp.


That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
system. It looks like it can't find the intel compiler libraries.

-Nathan Hjelm
HPC-5, LANL

On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:

   Progress!  I can run my trivial program on the local PHI, but not the
   other PHI, on the system. Here are the interesting parts:

   A pretty good recipe with last night's nightly master:

   $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
   CXX="icpc -mmic" \
       --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
        AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
   LD=x86_64-k1om-linux-ld \
        --enable-mpirun-prefix-by-default --disable-io-romio
   --disable-mpi-fortran \
        --enable-orterun-prefix-by-default \
        --enable-debug
   $ make && make install
   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
   yoda --mca btl sm,self,tcp $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
   yoda --mca btl openib,sm,self $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $

   However, I can't seem to cross the fabric. I can ssh freely back and forth
   between mic0 and mic1. However, running the next 2 tests from mic0, it 
   certainly seems like the second one should work, too:

   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda
   --mca btl sm,self,tcp $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda
   --mca btl sm,self,tcp $PWD/mic.out
   /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
   libraries: libimf.so: cannot open shared object file: No such file or
   directory
   --------------------------------------------------------------------------
   ORTE was unable to reliably start one or more daemons.
   This usually is caused by:

   * not finding the required libraries and/or binaries on
     one or more nodes. Please check your PATH and LD_LIBRARY_PATH
     settings, or configure OMPI with --enable-orterun-prefix-by-default

   * lack of authority to execute on one or more specified nodes.
     Please verify your allocation and authorities.

   * the inability to write startup files into /tmp
   (--tmpdir/orte_tmpdir_base).
     Please check with your sys admin to determine the correct location to
   use.

   *  compilation of the orted with dynamic libraries when static are
   required
     (e.g., on Cray). Please check your configure cmd line and consider using
     one of the contrib/platform definitions for your system type.

   * an inability to create a connection back to mpirun due to a
     lack of common network interfaces and/or no route found between
     them. Please check network connectivity (including firewalls
     and network routing requirements).
    ...
   $

   (Note that I get the same results with "--mca btl openib,sm,self"....)

   $ ssh mic1 file
   /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
   /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF
   64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1
   (SYSV), dynamically linked, not stripped
   $ shmemrun -x
   LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
   -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
   /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
   libraries: libimf.so: cannot open shared object file: No such file or
   directory
   --------------------------------------------------------------------------
   ORTE was unable to reliably start one or more daemons.
   This usually is caused by:

   * not finding the required libraries and/or binaries on
     one or more nodes. Please check your PATH and LD_LIBRARY_PATH
     settings, or configure OMPI with --enable-orterun-prefix-by-default

   * lack of authority to execute on one or more specified nodes.
     Please verify your allocation and authorities.

   * the inability to write startup files into /tmp
   (--tmpdir/orte_tmpdir_base).
     Please check with your sys admin to determine the correct location to
   use.

   *  compilation of the orted with dynamic libraries when static are
   required
     (e.g., on Cray). Please check your configure cmd line and consider using
     one of the contrib/platform definitions for your system type.

   * an inability to create a connection back to mpirun due to a
     lack of common network interfaces and/or no route found between
     them. Please check network connectivity (including firewalls
     and network routing requirements).

   Following here is
   - IB information
   - Running the failing case with lots of debugging information. (As you
   might imagine, I've tried 17 ways from Sunday to try to ensure that
   libimf.so is found.)

   $ ibv_devices
       device                 node GUID
       ------              ----------------
       mlx4_0              24be05ffffa57160
       scif0               4c79bafffe4402b6
   $ ibv_devinfo
   hca_id: mlx4_0
           transport:                      InfiniBand (0)
           fw_ver:                         2.11.1250
           node_guid:                      24be:05ff:ffa5:7160
           sys_image_guid:                 24be:05ff:ffa5:7163
           vendor_id:                      0x02c9
           vendor_part_id:                 4099
           hw_ver:                         0x0
           phys_port_cnt:                  2
                   port:   1
                           state:                  PORT_ACTIVE (4)
                           max_mtu:                2048 (4)
                           active_mtu:             2048 (4)
                           sm_lid:                 8
                           port_lid:               86
                           port_lmc:               0x00
                           link_layer:             InfiniBand

                   port:   2
                           state:                  PORT_DOWN (1)
                           max_mtu:                2048 (4)
                           active_mtu:             2048 (4)
                           sm_lid:                 0
                           port_lid:               0
                           port_lmc:               0x00
                           link_layer:             InfiniBand

   hca_id: scif0
           transport:                      SCIF (2)
           fw_ver:                         0.0.1
           node_guid:                      4c79:baff:fe44:02b6
           sys_image_guid:                 4c79:baff:fe44:02b6
           vendor_id:                      0x8086
           vendor_part_id:                 0
           hw_ver:                         0x1
           phys_port_cnt:                  1
                   port:   1
                           state:                  PORT_ACTIVE (4)
                           max_mtu:                4096 (5)
                           active_mtu:             4096 (5)
                           sm_lid:                 1
                           port_lid:               1001
                           port_lmc:               0x00
                           link_layer:             SCIF

   $ shmemrun -x
   LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
   -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose
   5 --mca memheap_base_verbose 100 $PWD/mic.out
   [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [rsh]
   [atl1-01-mic0:191024] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
   rsh path NULL
   [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component [rsh] set
   priority to 10
   [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component
   [isolated]
   [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component
   [isolated] set priority to 0
   [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [slurm]
   [atl1-01-mic0:191024] mca:base:select:(  plm) Skipping component [slurm].
   Query failed to return a module
   [atl1-01-mic0:191024] mca:base:select:(  plm) Selected component [rsh]
   [atl1-01-mic0:191024] plm:base:set_hnp_name: initial bias 191024 nodename
   hash 4121194178
   [atl1-01-mic0:191024] plm:base:set_hnp_name: final jobfam 29012
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh_setup on agent ssh : rsh path
   NULL
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive start comm
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_job
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm creating map
   [atl1-01-mic0:191024] [[29012,0],0] setup:vm: working unmanaged allocation
   [atl1-01-mic0:191024] [[29012,0],0] using dash_host
   [atl1-01-mic0:191024] [[29012,0],0] checking node mic1
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm add new daemon
   [[29012,0],1]
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm assigning new daemon
   [[29012,0],1] to node mic1
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: launching vm
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: local shell: 0 (bash)
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: assuming same remote shell as
   local shell
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: remote shell: 0 (bash)
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: final template argv:
           /usr/bin/ssh <template>    
   PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
   LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
   LD_LIBRARY_PATH ;
   DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
   export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
   --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
   orte_ess_jobid "1901330432" -mca orte_ess_vpid "<template>" -mca
   orte_ess_num_procs "2" -mca orte_hnp_uri
   "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
   --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
   plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
   rmaps_ppr_n_pernode "2"
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a child of
   mine
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to launch
   list
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of daemon
   [[29012,0],1]
   [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing: (/usr/bin/ssh)
   [/usr/bin/ssh mic1     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ;
   export PATH ;
   LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
   LD_LIBRARY_PATH ;
   DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
   export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
   --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
   orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs
   "2" -mca orte_hnp_uri
   "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
   --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
   plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
   rmaps_ppr_n_pernode "2"]
   /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
   libraries: libimf.so: cannot open shared object file: No such file or
   directory
   [atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending orted_exit
   commands
   --------------------------------------------------------------------------
   ORTE was unable to reliably start one or more daemons.
   This usually is caused by:

   * not finding the required libraries and/or binaries on
     one or more nodes. Please check your PATH and LD_LIBRARY_PATH
     settings, or configure OMPI with --enable-orterun-prefix-by-default

   * lack of authority to execute on one or more specified nodes.
     Please verify your allocation and authorities.

   * the inability to write startup files into /tmp
   (--tmpdir/orte_tmpdir_base).
     Please check with your sys admin to determine the correct location to
   use.

   *  compilation of the orted with dynamic libraries when static are
   required
     (e.g., on Cray). Please check your configure cmd line and consider using
     one of the contrib/platform definitions for your system type.

   * an inability to create a connection back to mpirun due to a
     lack of common network interfaces and/or no route found between
     them. Please check network connectivity (including firewalls
     and network routing requirements).
   --------------------------------------------------------------------------
   [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm

   On 04/13/2015 08:50 AM, Andy Riebs wrote:

     Hi Ralph,

     Here are the results with last night's "master" nightly,
     openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose
     option (yes, it looks like the "ERROR_LOG" problem has gone away):

     $ cat /proc/sys/kernel/shmmax
     33554432
     $ cat /proc/sys/kernel/shmall
     2097152
     $ cat /proc/sys/kernel/shmmni
     4096
     $ export SHMEM_SYMMETRIC_HEAP=1M
     $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5
     --mca memheap_base_verbose 100 $PWD/mic.out
     [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [rsh]
     [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
     rsh path NULL
     [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component [rsh]
     set priority to 10
     [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
     [isolated]
     [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component
     [isolated] set priority to 0
     [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [slurm]
     [atl1-01-mic0:190439] mca:base:select:(  plm) Skipping component
     [slurm]. Query failed to return a module
     [atl1-01-mic0:190439] mca:base:select:(  plm) Selected component [rsh]
     [atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439
     nodename hash 4121194178
     [atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
     [atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh
     path NULL
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
     [atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged
     allocation
     [atl1-01-mic0:190439] [[31875,0],0] using dash_host
     [atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
     [atl1-01-mic0:190439] [[31875,0],0] ignoring myself
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in
     allocation
     [atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1]
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job
     [31875,1]
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof for
     job [31875,1]
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] registered
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is not
     a dynamic spawn
     [atl1-01-mic0:190441] mca: base: components_register: registering
     memheap components
     [atl1-01-mic0:190441] mca: base: components_register: found loaded
     component buddy
     [atl1-01-mic0:190441] mca: base: components_register: component buddy
     has no register or open function
     [atl1-01-mic0:190442] mca: base: components_register: registering
     memheap components
     [atl1-01-mic0:190442] mca: base: components_register: found loaded
     component buddy
     [atl1-01-mic0:190442] mca: base: components_register: component buddy
     has no register or open function
     [atl1-01-mic0:190442] mca: base: components_register: found loaded
     component ptmalloc
     [atl1-01-mic0:190442] mca: base: components_register: component ptmalloc
     has no register or open function
     [atl1-01-mic0:190441] mca: base: components_register: found loaded
     component ptmalloc
     [atl1-01-mic0:190441] mca: base: components_register: component ptmalloc
     has no register or open function
     [atl1-01-mic0:190441] mca: base: components_open: opening memheap
     components
     [atl1-01-mic0:190441] mca: base: components_open: found loaded component
     buddy
     [atl1-01-mic0:190441] mca: base: components_open: component buddy open
     function successful
     [atl1-01-mic0:190441] mca: base: components_open: found loaded component
     ptmalloc
     [atl1-01-mic0:190441] mca: base: components_open: component ptmalloc
     open function successful
     [atl1-01-mic0:190442] mca: base: components_open: opening memheap
     components
     [atl1-01-mic0:190442] mca: base: components_open: found loaded component
     buddy
     [atl1-01-mic0:190442] mca: base: components_open: component buddy open
     function successful
     [atl1-01-mic0:190442] mca: base: components_open: found loaded component
     ptmalloc
     [atl1-01-mic0:190442] mca: base: components_open: component ptmalloc
     open function successful
     [atl1-01-mic0:190442] base/memheap_base_alloc.c:38 -
     mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
     segments by method: 1
     [atl1-01-mic0:190441] base/memheap_base_alloc.c:38 -
     mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
     segments by method: 1
     [atl1-01-mic0:190442] base/memheap_base_static.c:205 - _load_segments()
     add: 00600000-00601000 rw-p 00000000 00:11
     6029314                            /home/ariebs/bench/hello/mic.out
     [atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments()
     add: 00600000-00601000 rw-p 00000000 00:11
     6029314                            /home/ariebs/bench/hello/mic.out
     [atl1-01-mic0:190442] base/memheap_base_static.c:75 -
     mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
     segments
     [atl1-01-mic0:190442] base/memheap_base_register.c:39 -
     mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
     270532608 bytes type=0x1 id=0xFFFFFFFF
     [atl1-01-mic0:190441] base/memheap_base_static.c:75 -
     mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
     segments
     [atl1-01-mic0:190441] base/memheap_base_register.c:39 -
     mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
     270532608 bytes type=0x1 id=0xFFFFFFFF
     [atl1-01-mic0:190442] Error base/memheap_base_register.c:130 -
     _reg_segment() Failed to register segment
     [atl1-01-mic0:190441] Error base/memheap_base_register.c:130 -
     _reg_segment() Failed to register segment
     [atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM
     failed to initialize - aborting
     [atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM
     failed to initialize - aborting
     --------------------------------------------------------------------------
     It looks like SHMEM_INIT failed for some reason; your parallel process
     is
     likely to abort.  There are many reasons that a parallel process can
     fail during SHMEM_INIT; some of which are due to configuration or
     environment
     problems.  This failure appears to be an internal failure; here's some
     additional information (which may only be relevant to an Open SHMEM
     developer):

       mca_memheap_base_select() failed
       --> Returned "Error" (-1) instead of "Success" (0)
     --------------------------------------------------------------------------
     --------------------------------------------------------------------------
     SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with
     errorcode -1.
     --------------------------------------------------------------------------
     --------------------------------------------------------------------------
     A SHMEM process is aborting at a time when it cannot guarantee that all
     of its peer processes in the job will be killed properly.  You should
     double check that everything has shut down cleanly.

     Local host: atl1-01-mic0
     PID:        190441
     --------------------------------------------------------------------------
     -------------------------------------------------------
     Primary job  terminated normally, but 1 process returned
     a non-zero exit code.. Per user-direction, the job has been aborted.
     -------------------------------------------------------
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending
     orted_exit commands
     --------------------------------------------------------------------------
     shmemrun detected that one or more processes exited with non-zero
     status, thus causing
     the job to be terminated. The first process to do so was:

       Process name: [[31875,1],0]
       Exit code:    255
     --------------------------------------------------------------------------
     [atl1-01-mic0:190439] 1 more process has sent help message
     help-shmem-runtime.txt / shmem_init:startup:internal-failure
     [atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0
     to see all help / error messages
     [atl1-01-mic0:190439] 1 more process has sent help message
     help-shmem-api.txt / shmem-abort
     [atl1-01-mic0:190439] 1 more process has sent help message
     help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
     [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm

     On 04/12/2015 03:09 PM, Ralph Castain wrote:

       Sorry about that - I hadn't brought it over to the 1.8 branch yet.
       I've done so now, which means the ERROR_LOG shouldn't show up any
       more. It won't fix the memheap problem, though.
       You might try adding "--mca memheap_base_verbose 100" to your cmd line
       so we can see why none of the memheap components are being selected.

         On Apr 12, 2015, at 11:30 AM, Andy Riebs <[email protected]> wrote:
         Hi Ralph,

         Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:

         $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
         plm_base_verbose 5 $PWD/mic.out
         [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
         [rsh]
         [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent
         ssh : rsh path NULL
         [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
         [rsh] set priority to 10
         [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
         [isolated]
         [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
         [isolated] set priority to 0
         [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
         [slurm]
         [atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component
         [slurm]. Query failed to return a module
         [atl1-01-mic0:190189] mca:base:select:(  plm) Selected component
         [rsh]
         [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189
         nodename hash 4121194178
         [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
         [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh
         path NULL
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
         [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged
         allocation
         [atl1-01-mic0:190189] [[32137,0],0] using dash_host
         [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
         [atl1-01-mic0:190189] [[32137,0],0] ignoring myself
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in
         allocation
         [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
         [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in
         file base/plm_base_launch_support.c at line 440
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job
         [32137,1]
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof
         for job [32137,1]
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1]
         registered
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is
         not a dynamic spawn
         --------------------------------------------------------------------------
         It looks like SHMEM_INIT failed for some reason; your parallel
         process is
         likely to abort.  There are many reasons that a parallel process can
         fail during SHMEM_INIT; some of which are due to configuration or
         environment
         problems.  This failure appears to be an internal failure; here's
         some
         additional information (which may only be relevant to an Open SHMEM
         developer):

           mca_memheap_base_select() failed
           --> Returned "Error" (-1) instead of "Success" (0)
         --------------------------------------------------------------------------
         [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM
         failed to initialize - aborting
         [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM
         failed to initialize - aborting
         --------------------------------------------------------------------------
         SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0)
         with errorcode -1.
         --------------------------------------------------------------------------
         --------------------------------------------------------------------------
         A SHMEM process is aborting at a time when it cannot guarantee that
         all
         of its peer processes in the job will be killed properly.  You
         should
         double check that everything has shut down cleanly.

         Local host: atl1-01-mic0
         PID:        190192
         --------------------------------------------------------------------------
         -------------------------------------------------------
         Primary job  terminated normally, but 1 process returned
         a non-zero exit code.. Per user-direction, the job has been aborted.
         -------------------------------------------------------
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending
         orted_exit commands
         --------------------------------------------------------------------------
         shmemrun detected that one or more processes exited with non-zero
         status, thus causing
         the job to be terminated. The first process to do so was:

           Process name: [[32137,1],0]
           Exit code:    255
         --------------------------------------------------------------------------
         [atl1-01-mic0:190189] 1 more process has sent help message
         help-shmem-runtime.txt / shmem_init:startup:internal-failure
         [atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate"
         to 0 to see all help / error messages
         [atl1-01-mic0:190189] 1 more process has sent help message
         help-shmem-api.txt / shmem-abort
         [atl1-01-mic0:190189] 1 more process has sent help message
         help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
         killed
         [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm

         On 04/11/2015 07:41 PM, Ralph Castain wrote:

           Got it - thanks. I fixed that ERROR_LOG issue (I think- please
           verify). I suspect the memheap issue relates to something else,
           but I probably need to let the OSHMEM folks comment on it

             On Apr 11, 2015, at 9:52 AM, Andy Riebs <[email protected]>
             wrote:
             Everything is built on the Xeon side, with the icc "-mmic"
             switch. I then ssh into one of the PHIs, and run shmemrun from
             there.

             On 04/11/2015 12:00 PM, Ralph Castain wrote:

               Let me try to understand the setup a little better. Are you
               running shmemrun on the PHI itself? Or is it running on the
               host processor, and you are trying to spawn a process onto the
               Phi?

                 On Apr 11, 2015, at 7:55 AM, Andy Riebs <[email protected]>
                 wrote:
                 Hi Ralph,

                 Yes, this is attempting to get OSHMEM to run on the Phi.

                 I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured
                 it with

                 $ ./configure --prefix=/home/ariebs/mic/mpi-nightly   
                 CC=icc -mmic CXX=icpc -mmic    \
                     --build=x86_64-unknown-linux-gnu
                 --host=x86_64-k1om-linux    \
                      AR=x86_64-k1om-linux-ar
                 RANLIB=x86_64-k1om-linux-ranlib  LD=x86_64-k1om-linux-ld   \
                      --enable-mpirun-prefix-by-default
                 --disable-io-romio     --disable-mpi-fortran    \
                      --enable-debug    
                 --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud

                 (Note that I had to add "oob-ud" to the
                 "--enable-mca-no-build" option, as the build complained that
                 mca oob/ud needed mca common-verbs.)

                 With that configuration, here is what I am seeing now...

                 $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
                 $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
                 plm_base_verbose 5 $PWD/mic.out
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                 component [rsh]
                 [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on
                 agent ssh : rsh path NULL
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
                 component [rsh] set priority to 10
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                 component [isolated]
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
                 component [isolated] set priority to 0
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                 component [slurm]
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Skipping
                 component [slurm]. Query failed to return a module
                 [atl1-01-mic0:189895] mca:base:select:(  plm) Selected
                 component [rsh]
                 [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias
                 189895 nodename hash 4121194178
                 [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam
                 32419
                 [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent
                 ssh : rsh path NULL
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start
                 comm
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
                 creating map
                 [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working
                 unmanaged allocation
                 [atl1-01-mic0:189895] [[32419,0],0] using dash_host
                 [atl1-01-mic0:189895] [[32419,0],0] checking node
                 atl1-01-mic0
                 [atl1-01-mic0:189895] [[32419,0],0] ignoring myself
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only
                 HNP in allocation
                 [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job
                 [32419,1]
                 [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not
                 found in file base/plm_base_launch_support.c at line 440
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for
                 job [32419,1]
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring
                 up iof for job [32419,1]
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
                 [32419,1] registered
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job
                 [32419,1] is not a dynamic spawn
                 [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init()
                 SHMEM failed to initialize - aborting
                 [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init()
                 SHMEM failed to initialize - aborting
                 --------------------------------------------------------------------------
                 It looks like SHMEM_INIT failed for some reason; your
                 parallel process is
                 likely to abort.  There are many reasons that a parallel
                 process can
                 fail during SHMEM_INIT; some of which are due to
                 configuration or environment
                 problems.  This failure appears to be an internal failure;
                 here's some
                 additional information (which may only be relevant to an
                 Open SHMEM
                 developer):

                   mca_memheap_base_select() failed
                   --> Returned "Error" (-1) instead of "Success" (0)
                 --------------------------------------------------------------------------
                 --------------------------------------------------------------------------
                 SHMEM_ABORT was invoked on rank 1 (pid 189899,
                 host=atl1-01-mic0) with errorcode -1.
                 --------------------------------------------------------------------------
                 --------------------------------------------------------------------------
                 A SHMEM process is aborting at a time when it cannot
                 guarantee that all
                 of its peer processes in the job will be killed properly. 
                 You should
                 double check that everything has shut down cleanly.

                 Local host: atl1-01-mic0
                 PID:        189899
                 --------------------------------------------------------------------------
                 -------------------------------------------------------
                 Primary job  terminated normally, but 1 process returned
                 a non-zero exit code.. Per user-direction, the job has been
                 aborted.
                 -------------------------------------------------------
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd
                 sending orted_exit commands
                 --------------------------------------------------------------------------
                 shmemrun detected that one or more processes exited with
                 non-zero status, thus causing
                 the job to be terminated. The first process to do so was:

                   Process name: [[32419,1],1]
                   Exit code:    255
                 --------------------------------------------------------------------------
                 [atl1-01-mic0:189895] 1 more process has sent help message
                 help-shmem-runtime.txt / shmem_init:startup:internal-failure
                 [atl1-01-mic0:189895] Set MCA parameter
                 "orte_base_help_aggregate" to 0 to see all help / error
                 messages
                 [atl1-01-mic0:189895] 1 more process has sent help message
                 help-shmem-api.txt / shmem-abort
                 [atl1-01-mic0:189895] 1 more process has sent help message
                 help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee
                 all killed
                 [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop
                 comm

                 On 04/10/2015 06:37 PM, Ralph Castain wrote:

                   Andy - could you please try the current 1.8.5 nightly
                   tarball and see if it helps? The error log indicates that
                   it is failing to get the topology from some daemon, I**m
                   assuming the one on the Phi?
                   You might also add **enable-debug to that configure line
                   and then put -mca plm_base_verbose on the shmemrun cmd to
                   get more help

                     On Apr 10, 2015, at 11:55 AM, Andy Riebs
                     <[email protected]> wrote:
                     Summary: MPI jobs work fine, SHMEM jobs work just often
                     enough to be tantalizing, on an Intel Xeon Phi/MIC
                     system.

                     Longer version

                     Thanks to the excellent write-up last June
                     (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php>),
                     I have been able to build a version of Open MPI for the
                     Xeon Phi coprocessor that runs MPI jobs on the Phi
                     coprocessor with no problem, but not SHMEM jobs.  Just
                     at the point where I was about to document the problems
                     I was having with SHMEM, my trivial SHMEM job worked.
                     And then failed when I tried to run it again,
                     immediately afterwards. I have a feeling I may be in
                     uncharted  territory here.

                     Environment
                       * RHEL 6.5
                       * Intel Composer XE 2015
                       * Xeon Phi/MIC
                     ----------------

                     Configuration

                     $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
                     $ source
                     /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
                     intel64
                     $ ./configure --prefix=/home/ariebs/mic/mpi \
                        CC="icc -mmic" CXX="icpc -mmic" \
                        --build=x86_64-unknown-linux-gnu
                     --host=x86_64-k1om-linux \
                         AR=x86_64-k1om-linux-ar
                     RANLIB=x86_64-k1om-linux-ranlib \
                         LD=x86_64-k1om-linux-ld \
                         --enable-mpirun-prefix-by-default --disable-io-romio
                     \
                         --disable-vt --disable-mpi-fortran \

                     --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
                     $ make
                     $ make install

                     ----------------

                     Test program

                     #include <stdio.h>
                     #include <stdlib.h>
                     #include <shmem.h>
                     int main(int argc, char **argv)
                     {
                             int me, num_pe;
                             shmem_init();
                             num_pe = num_pes();
                             me = my_pe();
                             printf("Hello World from process %ld of %ld\n",
                     me, num_pe);
                             exit(0);
                     }

                     ----------------

                     Building the program

                     export PATH=/home/ariebs/mic/mpi/bin:$PATH
                     export PATH=/usr/linux-k1om-4.7/bin/:$PATH
                     source
                     /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
                     intel64
                     export
                     LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH

                     icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include
                     -pthread \
                             -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib
                     -Wl,--enable-new-dtags \
                             -L/home/ariebs/mic/mpi/lib -loshmem -lmpi
                     -lopen-rte -lopen-pal \
                             -lm -ldl -lutil \
                             -Wl,-rpath
                     -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
                     \

                     -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
                     \
                             -o mic.out  shmem_hello.c

                     ----------------

                     Running the program

                     (Note that the program had been consistently failing.
                     Then, when I logged back into the system to capture the
                     results, it worked once,  and then immediately failed
                     when I tried again, as shown below. Logging in and out
                     isn't sufficient to correct the problem. Overall, I
                     think I had 3 successful runs in 30-40 attempts.)

                     $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
                     [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not
                     found in file base/plm_base_launch_support.c at line 426
                     Hello World from process 0 of 2
                     Hello World from process 1 of 2
                     $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
                     [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not
                     found in file base/plm_base_launch_support.c at line 426
                     [atl1-01-mic0:189383] Error: pshmem_init.c:61 -
                     shmem_init() SHMEM failed to initialize - aborting
                     --------------------------------------------------------------------------
                     It looks like SHMEM_INIT failed for some reason; your
                     parallel process is
                     likely to abort.  There are many reasons that a parallel
                     process can
                     fail during SHMEM_INIT; some of which are due to
                     configuration or environment
                     problems.  This failure appears to be an internal
                     failure; here's some
                     additional information (which may only be relevant to an
                     Open SHMEM
                     developer):

                       mca_memheap_base_select() failed
                       --> Returned "Error" (-1) instead of "Success" (0)
                     --------------------------------------------------------------------------
                     --------------------------------------------------------------------------
                     SHMEM_ABORT was invoked on rank 0 (pid 189383,
                     host=atl1-01-mic0) with errorcode -1.
                     --------------------------------------------------------------------------
                     --------------------------------------------------------------------------
                     A SHMEM process is aborting at a time when it cannot
                     guarantee that all
                     of its peer processes in the job will be killed
                     properly.  You should
                     double check that everything has shut down cleanly.

                     Local host: atl1-01-mic0
                     PID:        189383
                     --------------------------------------------------------------------------
                     -------------------------------------------------------
                     Primary job  terminated normally, but 1 process returned
                     a non-zero exit code.. Per user-direction, the job has
                     been aborted.
                     -------------------------------------------------------
                     --------------------------------------------------------------------------
                     shmemrun detected that one or more processes exited with
                     non-zero status, thus causing
                     the job to be terminated. The first process to do so
                     was:

                       Process name: [[30881,1],0]
                       Exit code:    255
                     --------------------------------------------------------------------------

                     Any thoughts about where to go from here?

                     Andy

 --
 Andy Riebs
 Hewlett-Packard Company
 High Performance Computing
 +1 404 648 9024
 My opinions are not necessarily those of HP

                     _______________________________________________
                     users mailing list
                     [email protected]
                     Subscription:
                     http://www.open-mpi.org/mailman/listinfo.cgi/users
                     Link to this post:
                     http://www.open-mpi.org/community/lists/users/2015/04/26670.php

 _______________________________________________
 users mailing list
 [email protected]
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26676.php

                 _______________________________________________
                 users mailing list
                 [email protected]
                 Subscription:
                 http://www.open-mpi.org/mailman/listinfo.cgi/users
                 Link to this post:
                 http://www.open-mpi.org/community/lists/users/2015/04/26678.php

 _______________________________________________
 users mailing list
 [email protected]
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26679.php

             _______________________________________________
             users mailing list
             [email protected]
             Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
             Link to this post:
             http://www.open-mpi.org/community/lists/users/2015/04/26680.php

 _______________________________________________
 users mailing list
 [email protected]
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26682.php

         _______________________________________________
         users mailing list
         [email protected]
         Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
         Link to this post:
         http://www.open-mpi.org/community/lists/users/2015/04/26683.php

 _______________________________________________
 users mailing list
 [email protected]
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26684.php

_______________________________________________
users mailing list
[email protected]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26697.php

_______________________________________________
users mailing list
[email protected]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26699.php

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Reply via email to