Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Nathan Hjelm Mon, 13 Apr 2015 16:25:21 -0400 (EDT)

For talking between PHIs on the same system I recommend using the scif
BTL NOT tcp.


That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
system. It looks like it can't find the intel compiler libraries.

-Nathan Hjelm
HPC-5, LANL

On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:
>    Progress!  I can run my trivial program on the local PHI, but not the
>    other PHI, on the system. Here are the interesting parts:
> 
>    A pretty good recipe with last night's nightly master:
> 
>    $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
>    CXX="icpc -mmic" \
>        --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
>         AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
>    LD=x86_64-k1om-linux-ld \
>         --enable-mpirun-prefix-by-default --disable-io-romio
>    --disable-mpi-fortran \
>         --enable-orterun-prefix-by-default \
>         --enable-debug
>    $ make && make install
>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
>    yoda --mca btl sm,self,tcp $PWD/mic.out
>    Hello World from process 0 of 2
>    Hello World from process 1 of 2
>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
>    yoda --mca btl openib,sm,self $PWD/mic.out
>    Hello World from process 0 of 2
>    Hello World from process 1 of 2
>    $
> 
>    However, I can't seem to cross the fabric. I can ssh freely back and forth
>    between mic0 and mic1. However, running the next 2 tests from mic0, it 
>    certainly seems like the second one should work, too:
> 
>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda
>    --mca btl sm,self,tcp $PWD/mic.out
>    Hello World from process 0 of 2
>    Hello World from process 1 of 2
>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda
>    --mca btl sm,self,tcp $PWD/mic.out
>    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
>    libraries: libimf.so: cannot open shared object file: No such file or
>    directory
>    --------------------------------------------------------------------------
>    ORTE was unable to reliably start one or more daemons.
>    This usually is caused by:
> 
>    * not finding the required libraries and/or binaries on
>      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>      settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
>    * lack of authority to execute on one or more specified nodes.
>      Please verify your allocation and authorities.
> 
>    * the inability to write startup files into /tmp
>    (--tmpdir/orte_tmpdir_base).
>      Please check with your sys admin to determine the correct location to
>    use.
> 
>    *  compilation of the orted with dynamic libraries when static are
>    required
>      (e.g., on Cray). Please check your configure cmd line and consider using
>      one of the contrib/platform definitions for your system type.
> 
>    * an inability to create a connection back to mpirun due to a
>      lack of common network interfaces and/or no route found between
>      them. Please check network connectivity (including firewalls
>      and network routing requirements).
>     ...
>    $
> 
>    (Note that I get the same results with "--mca btl openib,sm,self"....)
> 
>    $ ssh mic1 file
>    /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>    /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF
>    64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1
>    (SYSV), dynamically linked, not stripped
>    $ shmemrun -x
>    
> LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>    -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
>    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
>    libraries: libimf.so: cannot open shared object file: No such file or
>    directory
>    --------------------------------------------------------------------------
>    ORTE was unable to reliably start one or more daemons.
>    This usually is caused by:
> 
>    * not finding the required libraries and/or binaries on
>      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>      settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
>    * lack of authority to execute on one or more specified nodes.
>      Please verify your allocation and authorities.
> 
>    * the inability to write startup files into /tmp
>    (--tmpdir/orte_tmpdir_base).
>      Please check with your sys admin to determine the correct location to
>    use.
> 
>    *  compilation of the orted with dynamic libraries when static are
>    required
>      (e.g., on Cray). Please check your configure cmd line and consider using
>      one of the contrib/platform definitions for your system type.
> 
>    * an inability to create a connection back to mpirun due to a
>      lack of common network interfaces and/or no route found between
>      them. Please check network connectivity (including firewalls
>      and network routing requirements).
> 
>    Following here is
>    - IB information
>    - Running the failing case with lots of debugging information. (As you
>    might imagine, I've tried 17 ways from Sunday to try to ensure that
>    libimf.so is found.)
> 
>    $ ibv_devices
>        device                 node GUID
>        ------              ----------------
>        mlx4_0              24be05ffffa57160
>        scif0               4c79bafffe4402b6
>    $ ibv_devinfo
>    hca_id: mlx4_0
>            transport:                      InfiniBand (0)
>            fw_ver:                         2.11.1250
>            node_guid:                      24be:05ff:ffa5:7160
>            sys_image_guid:                 24be:05ff:ffa5:7163
>            vendor_id:                      0x02c9
>            vendor_part_id:                 4099
>            hw_ver:                         0x0
>            phys_port_cnt:                  2
>                    port:   1
>                            state:                  PORT_ACTIVE (4)
>                            max_mtu:                2048 (4)
>                            active_mtu:             2048 (4)
>                            sm_lid:                 8
>                            port_lid:               86
>                            port_lmc:               0x00
>                            link_layer:             InfiniBand
> 
>                    port:   2
>                            state:                  PORT_DOWN (1)
>                            max_mtu:                2048 (4)
>                            active_mtu:             2048 (4)
>                            sm_lid:                 0
>                            port_lid:               0
>                            port_lmc:               0x00
>                            link_layer:             InfiniBand
> 
>    hca_id: scif0
>            transport:                      SCIF (2)
>            fw_ver:                         0.0.1
>            node_guid:                      4c79:baff:fe44:02b6
>            sys_image_guid:                 4c79:baff:fe44:02b6
>            vendor_id:                      0x8086
>            vendor_part_id:                 0
>            hw_ver:                         0x1
>            phys_port_cnt:                  1
>                    port:   1
>                            state:                  PORT_ACTIVE (4)
>                            max_mtu:                4096 (5)
>                            active_mtu:             4096 (5)
>                            sm_lid:                 1
>                            port_lid:               1001
>                            port_lmc:               0x00
>                            link_layer:             SCIF
> 
>    $ shmemrun -x
>    
> LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>    -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose
>    5 --mca memheap_base_verbose 100 $PWD/mic.out
>    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [rsh]
>    [atl1-01-mic0:191024] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
>    rsh path NULL
>    [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component [rsh] set
>    priority to 10
>    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component
>    [isolated]
>    [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component
>    [isolated] set priority to 0
>    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [slurm]
>    [atl1-01-mic0:191024] mca:base:select:(  plm) Skipping component [slurm].
>    Query failed to return a module
>    [atl1-01-mic0:191024] mca:base:select:(  plm) Selected component [rsh]
>    [atl1-01-mic0:191024] plm:base:set_hnp_name: initial bias 191024 nodename
>    hash 4121194178
>    [atl1-01-mic0:191024] plm:base:set_hnp_name: final jobfam 29012
>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh_setup on agent ssh : rsh path
>    NULL
>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive start comm
>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_job
>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm
>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm creating map
>    [atl1-01-mic0:191024] [[29012,0],0] setup:vm: working unmanaged allocation
>    [atl1-01-mic0:191024] [[29012,0],0] using dash_host
>    [atl1-01-mic0:191024] [[29012,0],0] checking node mic1
>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm add new daemon
>    [[29012,0],1]
>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm assigning new daemon
>    [[29012,0],1] to node mic1
>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: launching vm
>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: local shell: 0 (bash)
>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: assuming same remote shell as
>    local shell
>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: remote shell: 0 (bash)
>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: final template argv:
>            /usr/bin/ssh <template>    
>    PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
>    LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
>    LD_LIBRARY_PATH ;
>    DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
>    export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
>    --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
>    orte_ess_jobid "1901330432" -mca orte_ess_vpid "<template>" -mca
>    orte_ess_num_procs "2" -mca orte_hnp_uri
>    
> "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
>    --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
>    plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
>    rmaps_ppr_n_pernode "2"
>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a child of
>    mine
>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to launch
>    list
>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event
>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of daemon
>    [[29012,0],1]
>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing: (/usr/bin/ssh)
>    [/usr/bin/ssh mic1     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ;
>    export PATH ;
>    LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
>    LD_LIBRARY_PATH ;
>    DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
>    export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
>    --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
>    orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs
>    "2" -mca orte_hnp_uri
>    
> "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
>    --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
>    plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
>    rmaps_ppr_n_pernode "2"]
>    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
>    libraries: libimf.so: cannot open shared object file: No such file or
>    directory
>    [atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127
>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending orted_exit
>    commands
>    --------------------------------------------------------------------------
>    ORTE was unable to reliably start one or more daemons.
>    This usually is caused by:
> 
>    * not finding the required libraries and/or binaries on
>      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>      settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
>    * lack of authority to execute on one or more specified nodes.
>      Please verify your allocation and authorities.
> 
>    * the inability to write startup files into /tmp
>    (--tmpdir/orte_tmpdir_base).
>      Please check with your sys admin to determine the correct location to
>    use.
> 
>    *  compilation of the orted with dynamic libraries when static are
>    required
>      (e.g., on Cray). Please check your configure cmd line and consider using
>      one of the contrib/platform definitions for your system type.
> 
>    * an inability to create a connection back to mpirun due to a
>      lack of common network interfaces and/or no route found between
>      them. Please check network connectivity (including firewalls
>      and network routing requirements).
>    --------------------------------------------------------------------------
>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm
> 
>    On 04/13/2015 08:50 AM, Andy Riebs wrote:
> 
>      Hi Ralph,
> 
>      Here are the results with last night's "master" nightly,
>      openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose
>      option (yes, it looks like the "ERROR_LOG" problem has gone away):
> 
>      $ cat /proc/sys/kernel/shmmax
>      33554432
>      $ cat /proc/sys/kernel/shmall
>      2097152
>      $ cat /proc/sys/kernel/shmmni
>      4096
>      $ export SHMEM_SYMMETRIC_HEAP=1M
>      $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5
>      --mca memheap_base_verbose 100 $PWD/mic.out
>      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [rsh]
>      [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
>      rsh path NULL
>      [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component [rsh]
>      set priority to 10
>      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
>      [isolated]
>      [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component
>      [isolated] set priority to 0
>      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [slurm]
>      [atl1-01-mic0:190439] mca:base:select:(  plm) Skipping component
>      [slurm]. Query failed to return a module
>      [atl1-01-mic0:190439] mca:base:select:(  plm) Selected component [rsh]
>      [atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439
>      nodename hash 4121194178
>      [atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
>      [atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh
>      path NULL
>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
>      [atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged
>      allocation
>      [atl1-01-mic0:190439] [[31875,0],0] using dash_host
>      [atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
>      [atl1-01-mic0:190439] [[31875,0],0] ignoring myself
>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in
>      allocation
>      [atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1]
>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job
>      [31875,1]
>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof for
>      job [31875,1]
>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] registered
>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is not
>      a dynamic spawn
>      [atl1-01-mic0:190441] mca: base: components_register: registering
>      memheap components
>      [atl1-01-mic0:190441] mca: base: components_register: found loaded
>      component buddy
>      [atl1-01-mic0:190441] mca: base: components_register: component buddy
>      has no register or open function
>      [atl1-01-mic0:190442] mca: base: components_register: registering
>      memheap components
>      [atl1-01-mic0:190442] mca: base: components_register: found loaded
>      component buddy
>      [atl1-01-mic0:190442] mca: base: components_register: component buddy
>      has no register or open function
>      [atl1-01-mic0:190442] mca: base: components_register: found loaded
>      component ptmalloc
>      [atl1-01-mic0:190442] mca: base: components_register: component ptmalloc
>      has no register or open function
>      [atl1-01-mic0:190441] mca: base: components_register: found loaded
>      component ptmalloc
>      [atl1-01-mic0:190441] mca: base: components_register: component ptmalloc
>      has no register or open function
>      [atl1-01-mic0:190441] mca: base: components_open: opening memheap
>      components
>      [atl1-01-mic0:190441] mca: base: components_open: found loaded component
>      buddy
>      [atl1-01-mic0:190441] mca: base: components_open: component buddy open
>      function successful
>      [atl1-01-mic0:190441] mca: base: components_open: found loaded component
>      ptmalloc
>      [atl1-01-mic0:190441] mca: base: components_open: component ptmalloc
>      open function successful
>      [atl1-01-mic0:190442] mca: base: components_open: opening memheap
>      components
>      [atl1-01-mic0:190442] mca: base: components_open: found loaded component
>      buddy
>      [atl1-01-mic0:190442] mca: base: components_open: component buddy open
>      function successful
>      [atl1-01-mic0:190442] mca: base: components_open: found loaded component
>      ptmalloc
>      [atl1-01-mic0:190442] mca: base: components_open: component ptmalloc
>      open function successful
>      [atl1-01-mic0:190442] base/memheap_base_alloc.c:38 -
>      mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
>      segments by method: 1
>      [atl1-01-mic0:190441] base/memheap_base_alloc.c:38 -
>      mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
>      segments by method: 1
>      [atl1-01-mic0:190442] base/memheap_base_static.c:205 - _load_segments()
>      add: 00600000-00601000 rw-p 00000000 00:11
>      6029314                            /home/ariebs/bench/hello/mic.out
>      [atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments()
>      add: 00600000-00601000 rw-p 00000000 00:11
>      6029314                            /home/ariebs/bench/hello/mic.out
>      [atl1-01-mic0:190442] base/memheap_base_static.c:75 -
>      mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
>      segments
>      [atl1-01-mic0:190442] base/memheap_base_register.c:39 -
>      mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
>      270532608 bytes type=0x1 id=0xFFFFFFFF
>      [atl1-01-mic0:190441] base/memheap_base_static.c:75 -
>      mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
>      segments
>      [atl1-01-mic0:190441] base/memheap_base_register.c:39 -
>      mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
>      270532608 bytes type=0x1 id=0xFFFFFFFF
>      [atl1-01-mic0:190442] Error base/memheap_base_register.c:130 -
>      _reg_segment() Failed to register segment
>      [atl1-01-mic0:190441] Error base/memheap_base_register.c:130 -
>      _reg_segment() Failed to register segment
>      [atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM
>      failed to initialize - aborting
>      [atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM
>      failed to initialize - aborting
>      
> --------------------------------------------------------------------------
>      It looks like SHMEM_INIT failed for some reason; your parallel process
>      is
>      likely to abort.  There are many reasons that a parallel process can
>      fail during SHMEM_INIT; some of which are due to configuration or
>      environment
>      problems.  This failure appears to be an internal failure; here's some
>      additional information (which may only be relevant to an Open SHMEM
>      developer):
> 
>        mca_memheap_base_select() failed
>        --> Returned "Error" (-1) instead of "Success" (0)
>      
> --------------------------------------------------------------------------
>      
> --------------------------------------------------------------------------
>      SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with
>      errorcode -1.
>      
> --------------------------------------------------------------------------
>      
> --------------------------------------------------------------------------
>      A SHMEM process is aborting at a time when it cannot guarantee that all
>      of its peer processes in the job will be killed properly.  You should
>      double check that everything has shut down cleanly.
> 
>      Local host: atl1-01-mic0
>      PID:        190441
>      
> --------------------------------------------------------------------------
>      -------------------------------------------------------
>      Primary job  terminated normally, but 1 process returned
>      a non-zero exit code.. Per user-direction, the job has been aborted.
>      -------------------------------------------------------
>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending
>      orted_exit commands
>      
> --------------------------------------------------------------------------
>      shmemrun detected that one or more processes exited with non-zero
>      status, thus causing
>      the job to be terminated. The first process to do so was:
> 
>        Process name: [[31875,1],0]
>        Exit code:    255
>      
> --------------------------------------------------------------------------
>      [atl1-01-mic0:190439] 1 more process has sent help message
>      help-shmem-runtime.txt / shmem_init:startup:internal-failure
>      [atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0
>      to see all help / error messages
>      [atl1-01-mic0:190439] 1 more process has sent help message
>      help-shmem-api.txt / shmem-abort
>      [atl1-01-mic0:190439] 1 more process has sent help message
>      help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm
> 
>      On 04/12/2015 03:09 PM, Ralph Castain wrote:
> 
>        Sorry about that - I hadn't brought it over to the 1.8 branch yet.
>        I've done so now, which means the ERROR_LOG shouldn't show up any
>        more. It won't fix the memheap problem, though.
>        You might try adding "--mca memheap_base_verbose 100" to your cmd line
>        so we can see why none of the memheap components are being selected.
> 
>          On Apr 12, 2015, at 11:30 AM, Andy Riebs <andy.ri...@hp.com> wrote:
>          Hi Ralph,
> 
>          Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:
> 
>          $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
>          plm_base_verbose 5 $PWD/mic.out
>          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
>          [rsh]
>          [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent
>          ssh : rsh path NULL
>          [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
>          [rsh] set priority to 10
>          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
>          [isolated]
>          [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
>          [isolated] set priority to 0
>          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
>          [slurm]
>          [atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component
>          [slurm]. Query failed to return a module
>          [atl1-01-mic0:190189] mca:base:select:(  plm) Selected component
>          [rsh]
>          [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189
>          nodename hash 4121194178
>          [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
>          [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh
>          path NULL
>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
>          [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged
>          allocation
>          [atl1-01-mic0:190189] [[32137,0],0] using dash_host
>          [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
>          [atl1-01-mic0:190189] [[32137,0],0] ignoring myself
>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in
>          allocation
>          [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
>          [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in
>          file base/plm_base_launch_support.c at line 440
>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job
>          [32137,1]
>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof
>          for job [32137,1]
>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1]
>          registered
>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is
>          not a dynamic spawn
>          
> --------------------------------------------------------------------------
>          It looks like SHMEM_INIT failed for some reason; your parallel
>          process is
>          likely to abort.  There are many reasons that a parallel process can
>          fail during SHMEM_INIT; some of which are due to configuration or
>          environment
>          problems.  This failure appears to be an internal failure; here's
>          some
>          additional information (which may only be relevant to an Open SHMEM
>          developer):
> 
>            mca_memheap_base_select() failed
>            --> Returned "Error" (-1) instead of "Success" (0)
>          
> --------------------------------------------------------------------------
>          [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM
>          failed to initialize - aborting
>          [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM
>          failed to initialize - aborting
>          
> --------------------------------------------------------------------------
>          SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0)
>          with errorcode -1.
>          
> --------------------------------------------------------------------------
>          
> --------------------------------------------------------------------------
>          A SHMEM process is aborting at a time when it cannot guarantee that
>          all
>          of its peer processes in the job will be killed properly.  You
>          should
>          double check that everything has shut down cleanly.
> 
>          Local host: atl1-01-mic0
>          PID:        190192
>          
> --------------------------------------------------------------------------
>          -------------------------------------------------------
>          Primary job  terminated normally, but 1 process returned
>          a non-zero exit code.. Per user-direction, the job has been aborted.
>          -------------------------------------------------------
>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending
>          orted_exit commands
>          
> --------------------------------------------------------------------------
>          shmemrun detected that one or more processes exited with non-zero
>          status, thus causing
>          the job to be terminated. The first process to do so was:
> 
>            Process name: [[32137,1],0]
>            Exit code:    255
>          
> --------------------------------------------------------------------------
>          [atl1-01-mic0:190189] 1 more process has sent help message
>          help-shmem-runtime.txt / shmem_init:startup:internal-failure
>          [atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate"
>          to 0 to see all help / error messages
>          [atl1-01-mic0:190189] 1 more process has sent help message
>          help-shmem-api.txt / shmem-abort
>          [atl1-01-mic0:190189] 1 more process has sent help message
>          help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
>          killed
>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm
> 
>          On 04/11/2015 07:41 PM, Ralph Castain wrote:
> 
>            Got it - thanks. I fixed that ERROR_LOG issue (I think- please
>            verify). I suspect the memheap issue relates to something else,
>            but I probably need to let the OSHMEM folks comment on it
> 
>              On Apr 11, 2015, at 9:52 AM, Andy Riebs <andy.ri...@hp.com>
>              wrote:
>              Everything is built on the Xeon side, with the icc "-mmic"
>              switch. I then ssh into one of the PHIs, and run shmemrun from
>              there.
> 
>              On 04/11/2015 12:00 PM, Ralph Castain wrote:
> 
>                Let me try to understand the setup a little better. Are you
>                running shmemrun on the PHI itself? Or is it running on the
>                host processor, and you are trying to spawn a process onto the
>                Phi?
> 
>                  On Apr 11, 2015, at 7:55 AM, Andy Riebs <andy.ri...@hp.com>
>                  wrote:
>                  Hi Ralph,
> 
>                  Yes, this is attempting to get OSHMEM to run on the Phi.
> 
>                  I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured
>                  it with
> 
>                  $ ./configure --prefix=/home/ariebs/mic/mpi-nightly   
>                  CC=icc -mmic CXX=icpc -mmic    \
>                      --build=x86_64-unknown-linux-gnu
>                  --host=x86_64-k1om-linux    \
>                       AR=x86_64-k1om-linux-ar
>                  RANLIB=x86_64-k1om-linux-ranlib  LD=x86_64-k1om-linux-ld   \
>                       --enable-mpirun-prefix-by-default
>                  --disable-io-romio     --disable-mpi-fortran    \
>                       --enable-debug    
>                  
> --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
> 
>                  (Note that I had to add "oob-ud" to the
>                  "--enable-mca-no-build" option, as the build complained that
>                  mca oob/ud needed mca common-verbs.)
> 
>                  With that configuration, here is what I am seeing now...
> 
>                  $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
>                  $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
>                  plm_base_verbose 5 $PWD/mic.out
>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
>                  component [rsh]
>                  [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on
>                  agent ssh : rsh path NULL
>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
>                  component [rsh] set priority to 10
>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
>                  component [isolated]
>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
>                  component [isolated] set priority to 0
>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
>                  component [slurm]
>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Skipping
>                  component [slurm]. Query failed to return a module
>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Selected
>                  component [rsh]
>                  [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias
>                  189895 nodename hash 4121194178
>                  [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam
>                  32419
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent
>                  ssh : rsh path NULL
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start
>                  comm
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
>                  creating map
>                  [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working
>                  unmanaged allocation
>                  [atl1-01-mic0:189895] [[32419,0],0] using dash_host
>                  [atl1-01-mic0:189895] [[32419,0],0] checking node
>                  atl1-01-mic0
>                  [atl1-01-mic0:189895] [[32419,0],0] ignoring myself
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only
>                  HNP in allocation
>                  [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job
>                  [32419,1]
>                  [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not
>                  found in file base/plm_base_launch_support.c at line 440
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for
>                  job [32419,1]
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring
>                  up iof for job [32419,1]
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
>                  [32419,1] registered
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job
>                  [32419,1] is not a dynamic spawn
>                  [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init()
>                  SHMEM failed to initialize - aborting
>                  [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init()
>                  SHMEM failed to initialize - aborting
>                  
> --------------------------------------------------------------------------
>                  It looks like SHMEM_INIT failed for some reason; your
>                  parallel process is
>                  likely to abort.  There are many reasons that a parallel
>                  process can
>                  fail during SHMEM_INIT; some of which are due to
>                  configuration or environment
>                  problems.  This failure appears to be an internal failure;
>                  here's some
>                  additional information (which may only be relevant to an
>                  Open SHMEM
>                  developer):
> 
>                    mca_memheap_base_select() failed
>                    --> Returned "Error" (-1) instead of "Success" (0)
>                  
> --------------------------------------------------------------------------
>                  
> --------------------------------------------------------------------------
>                  SHMEM_ABORT was invoked on rank 1 (pid 189899,
>                  host=atl1-01-mic0) with errorcode -1.
>                  
> --------------------------------------------------------------------------
>                  
> --------------------------------------------------------------------------
>                  A SHMEM process is aborting at a time when it cannot
>                  guarantee that all
>                  of its peer processes in the job will be killed properly. 
>                  You should
>                  double check that everything has shut down cleanly.
> 
>                  Local host: atl1-01-mic0
>                  PID:        189899
>                  
> --------------------------------------------------------------------------
>                  -------------------------------------------------------
>                  Primary job  terminated normally, but 1 process returned
>                  a non-zero exit code.. Per user-direction, the job has been
>                  aborted.
>                  -------------------------------------------------------
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd
>                  sending orted_exit commands
>                  
> --------------------------------------------------------------------------
>                  shmemrun detected that one or more processes exited with
>                  non-zero status, thus causing
>                  the job to be terminated. The first process to do so was:
> 
>                    Process name: [[32419,1],1]
>                    Exit code:    255
>                  
> --------------------------------------------------------------------------
>                  [atl1-01-mic0:189895] 1 more process has sent help message
>                  help-shmem-runtime.txt / shmem_init:startup:internal-failure
>                  [atl1-01-mic0:189895] Set MCA parameter
>                  "orte_base_help_aggregate" to 0 to see all help / error
>                  messages
>                  [atl1-01-mic0:189895] 1 more process has sent help message
>                  help-shmem-api.txt / shmem-abort
>                  [atl1-01-mic0:189895] 1 more process has sent help message
>                  help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee
>                  all killed
>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop
>                  comm
> 
>                  On 04/10/2015 06:37 PM, Ralph Castain wrote:
> 
>                    Andy - could you please try the current 1.8.5 nightly
>                    tarball and see if it helps? The error log indicates that
>                    it is failing to get the topology from some daemon, I**m
>                    assuming the one on the Phi?
>                    You might also add **enable-debug to that configure line
>                    and then put -mca plm_base_verbose on the shmemrun cmd to
>                    get more help
> 
>                      On Apr 10, 2015, at 11:55 AM, Andy Riebs
>                      <andy.ri...@hp.com> wrote:
>                      Summary: MPI jobs work fine, SHMEM jobs work just often
>                      enough to be tantalizing, on an Intel Xeon Phi/MIC
>                      system.
> 
>                      Longer version
> 
>                      Thanks to the excellent write-up last June
>                      
> (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php>),
>                      I have been able to build a version of Open MPI for the
>                      Xeon Phi coprocessor that runs MPI jobs on the Phi
>                      coprocessor with no problem, but not SHMEM jobs.  Just
>                      at the point where I was about to document the problems
>                      I was having with SHMEM, my trivial SHMEM job worked.
>                      And then failed when I tried to run it again,
>                      immediately afterwards. I have a feeling I may be in
>                      uncharted  territory here.
> 
>                      Environment
>                        * RHEL 6.5
>                        * Intel Composer XE 2015
>                        * Xeon Phi/MIC
>                      ----------------
> 
>                      Configuration
> 
>                      $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
>                      $ source
>                      /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
>                      intel64
>                      $ ./configure --prefix=/home/ariebs/mic/mpi \
>                         CC="icc -mmic" CXX="icpc -mmic" \
>                         --build=x86_64-unknown-linux-gnu
>                      --host=x86_64-k1om-linux \
>                          AR=x86_64-k1om-linux-ar
>                      RANLIB=x86_64-k1om-linux-ranlib \
>                          LD=x86_64-k1om-linux-ld \
>                          --enable-mpirun-prefix-by-default --disable-io-romio
>                      \
>                          --disable-vt --disable-mpi-fortran \
>                         
>                      --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
>                      $ make
>                      $ make install
> 
>                      ----------------
> 
>                      Test program
> 
>                      #include <stdio.h>
>                      #include <stdlib.h>
>                      #include <shmem.h>
>                      int main(int argc, char **argv)
>                      {
>                              int me, num_pe;
>                              shmem_init();
>                              num_pe = num_pes();
>                              me = my_pe();
>                              printf("Hello World from process %ld of %ld\n",
>                      me, num_pe);
>                              exit(0);
>                      }
> 
>                      ----------------
> 
>                      Building the program
> 
>                      export PATH=/home/ariebs/mic/mpi/bin:$PATH
>                      export PATH=/usr/linux-k1om-4.7/bin/:$PATH
>                      source
>                      /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
>                      intel64
>                      export
>                      
> LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH
> 
>                      icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include
>                      -pthread \
>                              -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib
>                      -Wl,--enable-new-dtags \
>                              -L/home/ariebs/mic/mpi/lib -loshmem -lmpi
>                      -lopen-rte -lopen-pal \
>                              -lm -ldl -lutil \
>                              -Wl,-rpath
>                      
> -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
>                      \
>                             
>                      -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
>                      \
>                              -o mic.out  shmem_hello.c
> 
>                      ----------------
> 
>                      Running the program
> 
>                      (Note that the program had been consistently failing.
>                      Then, when I logged back into the system to capture the
>                      results, it worked once,  and then immediately failed
>                      when I tried again, as shown below. Logging in and out
>                      isn't sufficient to correct the problem. Overall, I
>                      think I had 3 successful runs in 30-40 attempts.)
> 
>                      $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
>                      [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not
>                      found in file base/plm_base_launch_support.c at line 426
>                      Hello World from process 0 of 2
>                      Hello World from process 1 of 2
>                      $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
>                      [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not
>                      found in file base/plm_base_launch_support.c at line 426
>                      [atl1-01-mic0:189383] Error: pshmem_init.c:61 -
>                      shmem_init() SHMEM failed to initialize - aborting
>                      
> --------------------------------------------------------------------------
>                      It looks like SHMEM_INIT failed for some reason; your
>                      parallel process is
>                      likely to abort.  There are many reasons that a parallel
>                      process can
>                      fail during SHMEM_INIT; some of which are due to
>                      configuration or environment
>                      problems.  This failure appears to be an internal
>                      failure; here's some
>                      additional information (which may only be relevant to an
>                      Open SHMEM
>                      developer):
> 
>                        mca_memheap_base_select() failed
>                        --> Returned "Error" (-1) instead of "Success" (0)
>                      
> --------------------------------------------------------------------------
>                      
> --------------------------------------------------------------------------
>                      SHMEM_ABORT was invoked on rank 0 (pid 189383,
>                      host=atl1-01-mic0) with errorcode -1.
>                      
> --------------------------------------------------------------------------
>                      
> --------------------------------------------------------------------------
>                      A SHMEM process is aborting at a time when it cannot
>                      guarantee that all
>                      of its peer processes in the job will be killed
>                      properly.  You should
>                      double check that everything has shut down cleanly.
> 
>                      Local host: atl1-01-mic0
>                      PID:        189383
>                      
> --------------------------------------------------------------------------
>                      -------------------------------------------------------
>                      Primary job  terminated normally, but 1 process returned
>                      a non-zero exit code.. Per user-direction, the job has
>                      been aborted.
>                      -------------------------------------------------------
>                      
> --------------------------------------------------------------------------
>                      shmemrun detected that one or more processes exited with
>                      non-zero status, thus causing
>                      the job to be terminated. The first process to do so
>                      was:
> 
>                        Process name: [[30881,1],0]
>                        Exit code:    255
>                      
> --------------------------------------------------------------------------
> 
>                      Any thoughts about where to go from here?
> 
>                      Andy
> 
>  --
>  Andy Riebs
>  Hewlett-Packard Company
>  High Performance Computing
>  +1 404 648 9024
>  My opinions are not necessarily those of HP
> 
>                      _______________________________________________
>                      users mailing list
>                      us...@open-mpi.org
>                      Subscription:
>                      http://www.open-mpi.org/mailman/listinfo.cgi/users
>                      Link to this post:
>                      
> http://www.open-mpi.org/community/lists/users/2015/04/26670.php
> 
>  _______________________________________________
>  users mailing list
>  us...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>  Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26676.php
> 
>                  _______________________________________________
>                  users mailing list
>                  us...@open-mpi.org
>                  Subscription:
>                  http://www.open-mpi.org/mailman/listinfo.cgi/users
>                  Link to this post:
>                  
> http://www.open-mpi.org/community/lists/users/2015/04/26678.php
> 
>  _______________________________________________
>  users mailing list
>  us...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>  Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26679.php
> 
>              _______________________________________________
>              users mailing list
>              us...@open-mpi.org
>              Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>              Link to this post:
>              http://www.open-mpi.org/community/lists/users/2015/04/26680.php
> 
>  _______________________________________________
>  users mailing list
>  us...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>  Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26682.php
> 
>          _______________________________________________
>          users mailing list
>          us...@open-mpi.org
>          Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>          Link to this post:
>          http://www.open-mpi.org/community/lists/users/2015/04/26683.php
> 
>  _______________________________________________
>  users mailing list
>  us...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>  Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26684.php

> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26697.php

pgp6ikuBxYvsk.pgp
Description: PGP signature

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Reply via email to