I don’t see that LD_PRELOAD showing up on the ssh path, Andy

> /usr/bin/ssh mic1     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export 
> PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; 
> export LD_LIBRARY_PATH ; 
> DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; 
> export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted 
> --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca 
> orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" 
> -mca orte_hnp_uri 
> "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1" 
> --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose 
> "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca rmaps_ppr_n_pernode 
> “2"

The -x option doesn’t impact the ssh line - it only forwards the value to the 
application’s environment. You’ll need to include the path in your 
LD_LIBRARY_PATH


> On Apr 13, 2015, at 1:06 PM, Andy Riebs <andy.ri...@hp.com> wrote:
> 
> Progress!  I can run my trivial program on the local PHI, but not the other 
> PHI, on the system. Here are the interesting parts:
> 
> A pretty good recipe with last night's nightly master:
> 
> $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic" CXX="icpc 
> -mmic" \
>     --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
>      AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib  
> LD=x86_64-k1om-linux-ld \
>      --enable-mpirun-prefix-by-default --disable-io-romio 
> --disable-mpi-fortran \
>      --enable-orterun-prefix-by-default \
>      --enable-debug
> $ make && make install
> $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml yoda 
> --mca btl sm,self,tcp $PWD/mic.out
> Hello World from process 0 of 2
> Hello World from process 1 of 2
> $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml yoda 
> --mca btl openib,sm,self $PWD/mic.out
> Hello World from process 0 of 2
> Hello World from process 1 of 2
> $ 
> 
> However, I can't seem to cross the fabric. I can ssh freely back and forth 
> between mic0 and mic1. However, running the next 2 tests from mic0, it  
> certainly seems like the second one should work, too:
> 
> $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda --mca 
> btl sm,self,tcp $PWD/mic.out
> Hello World from process 0 of 2
> Hello World from process 1 of 2
> $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda --mca 
> btl sm,self,tcp $PWD/mic.out
> /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: 
> libimf.so: cannot open shared object file: No such file or directory
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
>  ...
> $
> 
> (Note that I get the same results with "--mca btl openib,sm,self"....)
> 
> 
> $ ssh mic1 file 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF 64-bit 
> LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1 (SYSV), 
> dynamically linked, not stripped
> $ shmemrun -x 
> LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so 
> -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
> /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: 
> libimf.so: cannot open shared object file: No such file or directory
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> 
> Following here is 
> - IB information
> - Running the failing case with lots of debugging information. (As you might 
> imagine, I've tried 17 ways from Sunday to try to ensure that libimf.so is 
> found.)
> 
> $ ibv_devices
>     device                 node GUID
>     ------              ----------------
>     mlx4_0              24be05ffffa57160
>     scif0               4c79bafffe4402b6
> $ ibv_devinfo
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.11.1250
>         node_guid:                      24be:05ff:ffa5:7160
>         sys_image_guid:                 24be:05ff:ffa5:7163
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x0
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 8
>                         port_lid:               86
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
> 
>                 port:   2
>                         state:                  PORT_DOWN (1)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
> 
> hca_id: scif0
>         transport:                      SCIF (2)
>         fw_ver:                         0.0.1
>         node_guid:                      4c79:baff:fe44:02b6
>         sys_image_guid:                 4c79:baff:fe44:02b6
>         vendor_id:                      0x8086
>         vendor_part_id:                 0
>         hw_ver:                         0x1
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 1
>                         port_lid:               1001
>                         port_lmc:               0x00
>                         link_layer:             SCIF
> 
> $ shmemrun -x 
> LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so 
> -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose 5 
> --mca memheap_base_verbose 100 $PWD/mic.out
> [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [rsh]
> [atl1-01-mic0:191024] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
> path NULL
> [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [isolated]
> [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component [isolated] 
> set priority to 0
> [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [slurm]
> [atl1-01-mic0:191024] mca:base:select:(  plm) Skipping component [slurm]. 
> Query failed to return a module
> [atl1-01-mic0:191024] mca:base:select:(  plm) Selected component [rsh]
> [atl1-01-mic0:191024] plm:base:set_hnp_name: initial bias 191024 nodename 
> hash 4121194178
> [atl1-01-mic0:191024] plm:base:set_hnp_name: final jobfam 29012
> [atl1-01-mic0:191024] [[29012,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive start comm
> [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_job
> [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm
> [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm creating map
> [atl1-01-mic0:191024] [[29012,0],0] setup:vm: working unmanaged allocation
> [atl1-01-mic0:191024] [[29012,0],0] using dash_host
> [atl1-01-mic0:191024] [[29012,0],0] checking node mic1
> [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm add new daemon 
> [[29012,0],1]
> [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm assigning new daemon 
> [[29012,0],1] to node mic1
> [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: launching vm
> [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: local shell: 0 (bash)
> [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: assuming same remote shell as 
> local shell
> [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: remote shell: 0 (bash)
> [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: final template argv:
>         /usr/bin/ssh <template>     
> PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; 
> LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export 
> LD_LIBRARY_PATH ; 
> DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; 
> export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted 
> --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca 
> orte_ess_jobid "1901330432" -mca orte_ess_vpid "<template>" -mca 
> orte_ess_num_procs "2" -mca orte_hnp_uri 
> "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1" 
> --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose 
> "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca rmaps_ppr_n_pernode 
> "2"
> [atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a child of 
> mine
> [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to launch list
> [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event
> [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of daemon 
> [[29012,0],1]
> [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing: (/usr/bin/ssh) 
> [/usr/bin/ssh mic1     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export 
> PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; 
> export LD_LIBRARY_PATH ; 
> DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; 
> export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted 
> --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca 
> orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" 
> -mca orte_hnp_uri 
> "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1" 
> --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose 
> "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca rmaps_ppr_n_pernode 
> "2"]
> /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: 
> libimf.so: cannot open shared object file: No such file or directory
> [atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127
> [atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending orted_exit 
> commands
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --------------------------------------------------------------------------
> [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm
> 
> 
> 
> On 04/13/2015 08:50 AM, Andy Riebs wrote:
>> Hi Ralph,
>> 
>> Here are the results with last night's "master" nightly, 
>> openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose 
>> option (yes, it looks like the "ERROR_LOG" problem has gone away):
>> 
>> $ cat /proc/sys/kernel/shmmax
>> 33554432
>> $ cat /proc/sys/kernel/shmall
>> 2097152
>> $ cat /proc/sys/kernel/shmmni
>> 4096
>> $ export SHMEM_SYMMETRIC_HEAP=1M
>> $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5 
>> --mca memheap_base_verbose 100 $PWD/mic.out
>> [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [rsh]
>> [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
>> path NULL
>> [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component [rsh] set 
>> priority to 10
>> [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [isolated]
>> [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component [isolated] 
>> set priority to 0
>> [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [slurm]
>> [atl1-01-mic0:190439] mca:base:select:(  plm) Skipping component [slurm]. 
>> Query failed to return a module
>> [atl1-01-mic0:190439] mca:base:select:(  plm) Selected component [rsh]
>> [atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439 nodename 
>> hash 4121194178
>> [atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
>> [atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh path 
>> NULL
>> [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
>> [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
>> [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
>> [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
>> [atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged allocation
>> [atl1-01-mic0:190439] [[31875,0],0] using dash_host
>> [atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
>> [atl1-01-mic0:190439] [[31875,0],0] ignoring myself
>> [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in allocation
>> [atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1]
>> [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job [31875,1]
>> [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof for job 
>> [31875,1]
>> [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] registered
>> [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is not a 
>> dynamic spawn
>> [atl1-01-mic0:190441] mca: base: components_register: registering memheap 
>> components
>> [atl1-01-mic0:190441] mca: base: components_register: found loaded component 
>> buddy
>> [atl1-01-mic0:190441] mca: base: components_register: component buddy has no 
>> register or open function
>> [atl1-01-mic0:190442] mca: base: components_register: registering memheap 
>> components
>> [atl1-01-mic0:190442] mca: base: components_register: found loaded component 
>> buddy
>> [atl1-01-mic0:190442] mca: base: components_register: component buddy has no 
>> register or open function
>> [atl1-01-mic0:190442] mca: base: components_register: found loaded component 
>> ptmalloc
>> [atl1-01-mic0:190442] mca: base: components_register: component ptmalloc has 
>> no register or open function
>> [atl1-01-mic0:190441] mca: base: components_register: found loaded component 
>> ptmalloc
>> [atl1-01-mic0:190441] mca: base: components_register: component ptmalloc has 
>> no register or open function
>> [atl1-01-mic0:190441] mca: base: components_open: opening memheap components
>> [atl1-01-mic0:190441] mca: base: components_open: found loaded component 
>> buddy
>> [atl1-01-mic0:190441] mca: base: components_open: component buddy open 
>> function successful
>> [atl1-01-mic0:190441] mca: base: components_open: found loaded component 
>> ptmalloc
>> [atl1-01-mic0:190441] mca: base: components_open: component ptmalloc open 
>> function successful
>> [atl1-01-mic0:190442] mca: base: components_open: opening memheap components
>> [atl1-01-mic0:190442] mca: base: components_open: found loaded component 
>> buddy
>> [atl1-01-mic0:190442] mca: base: components_open: component buddy open 
>> function successful
>> [atl1-01-mic0:190442] mca: base: components_open: found loaded component 
>> ptmalloc
>> [atl1-01-mic0:190442] mca: base: components_open: component ptmalloc open 
>> function successful
>> [atl1-01-mic0:190442] base/memheap_base_alloc.c:38 - 
>> mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1 
>> segments by method: 1
>> [atl1-01-mic0:190441] base/memheap_base_alloc.c:38 - 
>> mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1 
>> segments by method: 1
>> [atl1-01-mic0:190442] base/memheap_base_static.c:205 - _load_segments() add: 
>> 00600000-00601000 rw-p 00000000 00:11 6029314                                
>>   /home/ariebs/bench/hello/mic.out
>> [atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments() add: 
>> 00600000-00601000 rw-p 00000000 00:11 6029314                                
>>   /home/ariebs/bench/hello/mic.out
>> [atl1-01-mic0:190442] base/memheap_base_static.c:75 - 
>> mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2 
>> segments
>> [atl1-01-mic0:190442] base/memheap_base_register.c:39 - 
>> mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000 
>> 270532608 bytes type=0x1 id=0xFFFFFFFF
>> [atl1-01-mic0:190441] base/memheap_base_static.c:75 - 
>> mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2 
>> segments
>> [atl1-01-mic0:190441] base/memheap_base_register.c:39 - 
>> mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000 
>> 270532608 bytes type=0x1 id=0xFFFFFFFF
>> [atl1-01-mic0:190442] Error base/memheap_base_register.c:130 - 
>> _reg_segment() Failed to register segment
>> [atl1-01-mic0:190441] Error base/memheap_base_register.c:130 - 
>> _reg_segment() Failed to register segment
>> [atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
>> initialize - aborting
>> [atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
>> initialize - aborting
>> --------------------------------------------------------------------------
>> It looks like SHMEM_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during SHMEM_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open SHMEM
>> developer):
>> 
>>   mca_memheap_base_select() failed
>>   --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with 
>> errorcode -1.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> A SHMEM process is aborting at a time when it cannot guarantee that all
>> of its peer processes in the job will be killed properly.  You should
>> double check that everything has shut down cleanly.
>> 
>> Local host: atl1-01-mic0
>> PID:        190441
>> --------------------------------------------------------------------------
>> -------------------------------------------------------
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> [atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending orted_exit 
>> commands
>> --------------------------------------------------------------------------
>> shmemrun detected that one or more processes exited with non-zero status, 
>> thus causing
>> the job to be terminated. The first process to do so was:
>> 
>>   Process name: [[31875,1],0]
>>   Exit code:    255
>> --------------------------------------------------------------------------
>> [atl1-01-mic0:190439] 1 more process has sent help message 
>> help-shmem-runtime.txt / shmem_init:startup:internal-failure
>> [atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0 to 
>> see all help / error messages
>> [atl1-01-mic0:190439] 1 more process has sent help message 
>> help-shmem-api.txt / shmem-abort
>> [atl1-01-mic0:190439] 1 more process has sent help message 
>> help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
>> [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm
>> 
>> 
>> 
>> On 04/12/2015 03:09 PM, Ralph Castain wrote:
>>> Sorry about that - I hadn’t brought it over to the 1.8 branch yet. I’ve 
>>> done so now, which means the ERROR_LOG shouldn’t show up any more. It won’t 
>>> fix the memheap problem, though.
>>> 
>>> You might try adding “--mca memheap_base_verbose 100” to your cmd line so 
>>> we can see why none of the memheap components are being selected.
>>> 
>>> 
>>>> On Apr 12, 2015, at 11:30 AM, Andy Riebs <andy.ri...@hp.com 
>>>> <mailto:andy.ri...@hp.com>> wrote:
>>>> 
>>>> Hi Ralph,
>>>> 
>>>> Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2 
>>>> <https://www.open-mpi.org/nightly/v1.8/openmpi-v1.8.4-202-gc2da6a5.tar.bz2>:
>>>> 
>>>> $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5 
>>>> $PWD/mic.out
>>>> [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [rsh]
>>>> [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : 
>>>> rsh path NULL
>>>> [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component [rsh] set 
>>>> priority to 10
>>>> [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [isolated]
>>>> [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component 
>>>> [isolated] set priority to 0
>>>> [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [slurm]
>>>> [atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component [slurm]. 
>>>> Query failed to return a module
>>>> [atl1-01-mic0:190189] mca:base:select:(  plm) Selected component [rsh]
>>>> [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189 nodename 
>>>> hash 4121194178
>>>> [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh path 
>>>> NULL
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
>>>> [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged allocation
>>>> [atl1-01-mic0:190189] [[32137,0],0] using dash_host
>>>> [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
>>>> [atl1-01-mic0:190189] [[32137,0],0] ignoring myself
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in 
>>>> allocation
>>>> [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
>>>> [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in file 
>>>> base/plm_base_launch_support.c at line 440
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job [32137,1]
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof for job 
>>>> [32137,1]
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1] registered
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is not a 
>>>> dynamic spawn
>>>> --------------------------------------------------------------------------
>>>> It looks like SHMEM_INIT failed for some reason; your parallel process is
>>>> likely to abort.  There are many reasons that a parallel process can
>>>> fail during SHMEM_INIT; some of which are due to configuration or 
>>>> environment
>>>> problems.  This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open SHMEM
>>>> developer):
>>>> 
>>>>   mca_memheap_base_select() failed
>>>>   --> Returned "Error" (-1) instead of "Success" (0)
>>>> --------------------------------------------------------------------------
>>>> [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM failed 
>>>> to initialize - aborting
>>>> [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM failed 
>>>> to initialize - aborting
>>>> --------------------------------------------------------------------------
>>>> SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0) with 
>>>> errorcode -1.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> A SHMEM process is aborting at a time when it cannot guarantee that all
>>>> of its peer processes in the job will be killed properly.  You should
>>>> double check that everything has shut down cleanly.
>>>> 
>>>> Local host: atl1-01-mic0
>>>> PID:        190192
>>>> --------------------------------------------------------------------------
>>>> -------------------------------------------------------
>>>> Primary job  terminated normally, but 1 process returned
>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending orted_exit 
>>>> commands
>>>> --------------------------------------------------------------------------
>>>> shmemrun detected that one or more processes exited with non-zero status, 
>>>> thus causing
>>>> the job to be terminated. The first process to do so was:
>>>> 
>>>>   Process name: [[32137,1],0]
>>>>   Exit code:    255
>>>> --------------------------------------------------------------------------
>>>> [atl1-01-mic0:190189] 1 more process has sent help message 
>>>> help-shmem-runtime.txt / shmem_init:startup:internal-failure
>>>> [atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate" to 0 to 
>>>> see all help / error messages
>>>> [atl1-01-mic0:190189] 1 more process has sent help message 
>>>> help-shmem-api.txt / shmem-abort
>>>> [atl1-01-mic0:190189] 1 more process has sent help message 
>>>> help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
>>>> [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm
>>>> 
>>>> 
>>>> On 04/11/2015 07:41 PM, Ralph Castain wrote:
>>>>> Got it - thanks. I fixed that ERROR_LOG issue (I think- please verify). I 
>>>>> suspect the memheap issue relates to something else, but I probably need 
>>>>> to let the OSHMEM folks comment on it
>>>>> 
>>>>> 
>>>>>> On Apr 11, 2015, at 9:52 AM, Andy Riebs <andy.ri...@hp.com 
>>>>>> <mailto:andy.ri...@hp.com>> wrote:
>>>>>> 
>>>>>> Everything is built on the Xeon side, with the icc "-mmic" switch. I 
>>>>>> then ssh into one of the PHIs, and run shmemrun from there.
>>>>>> 
>>>>>> 
>>>>>> On 04/11/2015 12:00 PM, Ralph Castain wrote:
>>>>>>> Let me try to understand the setup a little better. Are you running 
>>>>>>> shmemrun on the PHI itself? Or is it running on the host processor, and 
>>>>>>> you are trying to spawn a process onto the Phi?
>>>>>>> 
>>>>>>> 
>>>>>>>> On Apr 11, 2015, at 7:55 AM, Andy Riebs <andy.ri...@hp.com 
>>>>>>>> <mailto:andy.ri...@hp.com>> wrote:
>>>>>>>> 
>>>>>>>> Hi Ralph,
>>>>>>>> 
>>>>>>>> Yes, this is attempting to get OSHMEM to run on the Phi.
>>>>>>>> 
>>>>>>>> I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured it with
>>>>>>>> 
>>>>>>>> $ ./configure --prefix=/home/ariebs/mic/mpi-nightly    CC=icc -mmic 
>>>>>>>> CXX=icpc -mmic    \
>>>>>>>>     --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux    \
>>>>>>>>      AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib  
>>>>>>>> LD=x86_64-k1om-linux-ld   \
>>>>>>>>      --enable-mpirun-prefix-by-default --disable-io-romio     
>>>>>>>> --disable-mpi-fortran    \
>>>>>>>>      --enable-debug     
>>>>>>>> --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
>>>>>>>> 
>>>>>>>> (Note that I had to add "oob-ud" to the "--enable-mca-no-build" 
>>>>>>>> option, as the build complained that mca oob/ud needed mca 
>>>>>>>> common-verbs.)
>>>>>>>> 
>>>>>>>> With that configuration, here is what I am seeing now...
>>>>>>>> 
>>>>>>>> $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
>>>>>>>> $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 
>>>>>>>> 5 $PWD/mic.out
>>>>>>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [rsh]
>>>>>>>> [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on agent ssh 
>>>>>>>> : rsh path NULL
>>>>>>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Query of component [rsh] 
>>>>>>>> set priority to 10
>>>>>>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component 
>>>>>>>> [isolated]
>>>>>>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Query of component 
>>>>>>>> [isolated] set priority to 0
>>>>>>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component 
>>>>>>>> [slurm]
>>>>>>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Skipping component 
>>>>>>>> [slurm]. Query failed to return a module
>>>>>>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Selected component [rsh]
>>>>>>>> [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias 189895 
>>>>>>>> nodename hash 4121194178
>>>>>>>> [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam 32419
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent ssh : rsh 
>>>>>>>> path NULL
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start comm
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm creating map
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working unmanaged 
>>>>>>>> allocation
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] using dash_host
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] checking node atl1-01-mic0
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] ignoring myself
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only HNP in 
>>>>>>>> allocation
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job [32419,1]
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not found in file 
>>>>>>>> base/plm_base_launch_support.c at line 440
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for job 
>>>>>>>> [32419,1]
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring up iof for 
>>>>>>>> job [32419,1]
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch [32419,1] 
>>>>>>>> registered
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job [32419,1] is 
>>>>>>>> not a dynamic spawn
>>>>>>>> [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init() SHMEM 
>>>>>>>> failed to initialize - aborting
>>>>>>>> [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init() SHMEM 
>>>>>>>> failed to initialize - aborting
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> It looks like SHMEM_INIT failed for some reason; your parallel process 
>>>>>>>> is
>>>>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>>>>> fail during SHMEM_INIT; some of which are due to configuration or 
>>>>>>>> environment
>>>>>>>> problems.  This failure appears to be an internal failure; here's some
>>>>>>>> additional information (which may only be relevant to an Open SHMEM
>>>>>>>> developer):
>>>>>>>> 
>>>>>>>>   mca_memheap_base_select() failed
>>>>>>>>   --> Returned "Error" (-1) instead of "Success" (0)
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> SHMEM_ABORT was invoked on rank 1 (pid 189899, host=atl1-01-mic0) with 
>>>>>>>> errorcode -1.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> A SHMEM process is aborting at a time when it cannot guarantee that all
>>>>>>>> of its peer processes in the job will be killed properly.  You should
>>>>>>>> double check that everything has shut down cleanly.
>>>>>>>> 
>>>>>>>> Local host: atl1-01-mic0
>>>>>>>> PID:        189899
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> -------------------------------------------------------
>>>>>>>> Primary job  terminated normally, but 1 process returned
>>>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>>>>> -------------------------------------------------------
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd sending 
>>>>>>>> orted_exit commands
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> shmemrun detected that one or more processes exited with non-zero 
>>>>>>>> status, thus causing
>>>>>>>> the job to be terminated. The first process to do so was:
>>>>>>>> 
>>>>>>>>   Process name: [[32419,1],1]
>>>>>>>>   Exit code:    255
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> [atl1-01-mic0:189895] 1 more process has sent help message 
>>>>>>>> help-shmem-runtime.txt / shmem_init:startup:internal-failure
>>>>>>>> [atl1-01-mic0:189895] Set MCA parameter "orte_base_help_aggregate" to 
>>>>>>>> 0 to see all help / error messages
>>>>>>>> [atl1-01-mic0:189895] 1 more process has sent help message 
>>>>>>>> help-shmem-api.txt / shmem-abort
>>>>>>>> [atl1-01-mic0:189895] 1 more process has sent help message 
>>>>>>>> help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
>>>>>>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop comm
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 04/10/2015 06:37 PM, Ralph Castain wrote:
>>>>>>>>> Andy - could you please try the current 1.8.5 nightly tarball and see 
>>>>>>>>> if it helps? The error log indicates that it is failing to get the 
>>>>>>>>> topology from some daemon, I�m assuming the one on the Phi?
>>>>>>>>> 
>>>>>>>>> You might also add �enable-debug to that configure line and then put 
>>>>>>>>> -mca plm_base_verbose on the shmemrun cmd to get more help
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Apr 10, 2015, at 11:55 AM, Andy Riebs <andy.ri...@hp.com 
>>>>>>>>>> <mailto:andy.ri...@hp.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Summary: MPI jobs work fine, SHMEM jobs work just often enough to be 
>>>>>>>>>> tantalizing, on an Intel Xeon Phi/MIC system.
>>>>>>>>>> 
>>>>>>>>>> Longer version
>>>>>>>>>> 
>>>>>>>>>> Thanks to the excellent write-up last June 
>>>>>>>>>> (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php> 
>>>>>>>>>> <https://www.open-mpi.org/community/lists/users/2014/06/24711.php>), 
>>>>>>>>>> I have been able to build a version of Open MPI for the Xeon Phi 
>>>>>>>>>> coprocessor that runs MPI jobs on the Phi coprocessor with no 
>>>>>>>>>> problem, but not SHMEM jobs.  Just at the point where I was about to 
>>>>>>>>>> document the problems I was having with SHMEM, my trivial SHMEM job 
>>>>>>>>>> worked. And then failed when I tried to run it again, immediately 
>>>>>>>>>> afterwards. I have a feeling I may be in uncharted  territory here.
>>>>>>>>>> 
>>>>>>>>>> Environment
>>>>>>>>>> RHEL 6.5
>>>>>>>>>> Intel Composer XE 2015
>>>>>>>>>> Xeon Phi/MIC
>>>>>>>>>> ----------------
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Configuration
>>>>>>>>>> 
>>>>>>>>>> $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
>>>>>>>>>> $ source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
>>>>>>>>>> $ ./configure --prefix=/home/ariebs/mic/mpi \
>>>>>>>>>>    CC="icc -mmic" CXX="icpc -mmic" \
>>>>>>>>>>    --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
>>>>>>>>>>     AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib \
>>>>>>>>>>     LD=x86_64-k1om-linux-ld \
>>>>>>>>>>     --enable-mpirun-prefix-by-default --disable-io-romio \
>>>>>>>>>>     --disable-vt --disable-mpi-fortran \
>>>>>>>>>>     --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
>>>>>>>>>> $ make
>>>>>>>>>> $ make install
>>>>>>>>>> 
>>>>>>>>>> ----------------
>>>>>>>>>> 
>>>>>>>>>> Test program
>>>>>>>>>> 
>>>>>>>>>> #include <stdio.h>
>>>>>>>>>> #include <stdlib.h>
>>>>>>>>>> #include <shmem.h>
>>>>>>>>>> int main(int argc, char **argv)
>>>>>>>>>> {
>>>>>>>>>>         int me, num_pe;
>>>>>>>>>>         shmem_init();
>>>>>>>>>>         num_pe = num_pes();
>>>>>>>>>>         me = my_pe();
>>>>>>>>>>         printf("Hello World from process %ld of %ld\n", me, num_pe);
>>>>>>>>>>         exit(0);
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> ----------------
>>>>>>>>>> 
>>>>>>>>>> Building the program
>>>>>>>>>> 
>>>>>>>>>> export PATH=/home/ariebs/mic/mpi/bin:$PATH
>>>>>>>>>> export PATH=/usr/linux-k1om-4.7/bin/:$PATH
>>>>>>>>>> source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
>>>>>>>>>> export 
>>>>>>>>>> LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH
>>>>>>>>>> 
>>>>>>>>>> icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include -pthread \
>>>>>>>>>>         -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib 
>>>>>>>>>> -Wl,--enable-new-dtags \
>>>>>>>>>>         -L/home/ariebs/mic/mpi/lib -loshmem -lmpi -lopen-rte 
>>>>>>>>>> -lopen-pal \
>>>>>>>>>>         -lm -ldl -lutil \
>>>>>>>>>>         -Wl,-rpath 
>>>>>>>>>> -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
>>>>>>>>>>         -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
>>>>>>>>>>         -o mic.out  shmem_hello.c
>>>>>>>>>> 
>>>>>>>>>> ----------------
>>>>>>>>>> 
>>>>>>>>>> Running the program
>>>>>>>>>> 
>>>>>>>>>> (Note that the program had been consistently failing. Then, when I 
>>>>>>>>>> logged back into the system to capture the results, it worked once,  
>>>>>>>>>> and then immediately failed when I tried again, as shown below. 
>>>>>>>>>> Logging in and out isn't sufficient to correct the problem. Overall, 
>>>>>>>>>> I think I had 3 successful runs in 30-40 attempts.)
>>>>>>>>>> 
>>>>>>>>>> $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
>>>>>>>>>> [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not found in 
>>>>>>>>>> file base/plm_base_launch_support.c at line 426
>>>>>>>>>> Hello World from process 0 of 2
>>>>>>>>>> Hello World from process 1 of 2
>>>>>>>>>> $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
>>>>>>>>>> [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not found in 
>>>>>>>>>> file base/plm_base_launch_support.c at line 426
>>>>>>>>>> [atl1-01-mic0:189383] Error: pshmem_init.c:61 - shmem_init() SHMEM 
>>>>>>>>>> failed to initialize - aborting
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> It looks like SHMEM_INIT failed for some reason; your parallel 
>>>>>>>>>> process is
>>>>>>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>>>>>>> fail during SHMEM_INIT; some of which are due to configuration or 
>>>>>>>>>> environment
>>>>>>>>>> problems.  This failure appears to be an internal failure; here's 
>>>>>>>>>> some
>>>>>>>>>> additional information (which may only be relevant to an Open SHMEM
>>>>>>>>>> developer):
>>>>>>>>>> 
>>>>>>>>>>   mca_memheap_base_select() failed
>>>>>>>>>>   --> Returned "Error" (-1) instead of "Success" (0)
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> SHMEM_ABORT was invoked on rank 0 (pid 189383, host=atl1-01-mic0) 
>>>>>>>>>> with errorcode -1.
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> A SHMEM process is aborting at a time when it cannot guarantee that 
>>>>>>>>>> all
>>>>>>>>>> of its peer processes in the job will be killed properly.  You should
>>>>>>>>>> double check that everything has shut down cleanly.
>>>>>>>>>> 
>>>>>>>>>> Local host: atl1-01-mic0
>>>>>>>>>> PID:        189383
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>> Primary job  terminated normally, but 1 process returned
>>>>>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> shmemrun detected that one or more processes exited with non-zero 
>>>>>>>>>> status, thus causing
>>>>>>>>>> the job to be terminated. The first process to do so was:
>>>>>>>>>> 
>>>>>>>>>>   Process name: [[30881,1],0]
>>>>>>>>>>   Exit code:    255
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> Any thoughts about where to go from here?
>>>>>>>>>> 
>>>>>>>>>> Andy
>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> Andy Riebs
>>>>>>>>>> Hewlett-Packard Company
>>>>>>>>>> High Performance Computing
>>>>>>>>>> +1 404 648 9024
>>>>>>>>>> My opinions are not necessarily those of HP
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>>>> Link to this post: 
>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26670.php 
>>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26670.php>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>>> Link to this post: 
>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26676.php 
>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26676.php>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26678.php 
>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26678.php>
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26679.php 
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26679.php>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26680.php 
>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26680.php>
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26682.php 
>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26682.php>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/04/26683.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26683.php>
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/04/26684.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26684.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26697.php

Reply via email to