For talking between PHIs on the same system I recommend using the scif BTL NOT tcp.
That said, it looks like the LD_LIBRARY_PATH is wrong on the remote system. It looks like it can't find the intel compiler libraries. -Nathan Hjelm HPC-5, LANL On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote: > Progress! I can run my trivial program on the local PHI, but not the > other PHI, on the system. Here are the interesting parts: > > A pretty good recipe with last night's nightly master: > > $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic" > CXX="icpc -mmic" \ > --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \ > AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib > LD=x86_64-k1om-linux-ld \ > --enable-mpirun-prefix-by-default --disable-io-romio > --disable-mpi-fortran \ > --enable-orterun-prefix-by-default \ > --enable-debug > $ make && make install > $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml > yoda --mca btl sm,self,tcp $PWD/mic.out > Hello World from process 0 of 2 > Hello World from process 1 of 2 > $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml > yoda --mca btl openib,sm,self $PWD/mic.out > Hello World from process 0 of 2 > Hello World from process 1 of 2 > $ > > However, I can't seem to cross the fabric. I can ssh freely back and forth > between mic0 and mic1. However, running the next 2 tests from mic0, it > certainly seems like the second one should work, too: > > $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda > --mca btl sm,self,tcp $PWD/mic.out > Hello World from process 0 of 2 > Hello World from process 1 of 2 > $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda > --mca btl sm,self,tcp $PWD/mic.out > /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared > libraries: libimf.so: cannot open shared object file: No such file or > directory > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to > use. > > * compilation of the orted with dynamic libraries when static are > required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > ... > $ > > (Note that I get the same results with "--mca btl openib,sm,self"....) > > $ ssh mic1 file > /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so > /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF > 64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1 > (SYSV), dynamically linked, not stripped > $ shmemrun -x > > LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so > -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out > /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared > libraries: libimf.so: cannot open shared object file: No such file or > directory > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to > use. > > * compilation of the orted with dynamic libraries when static are > required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > > Following here is > - IB information > - Running the failing case with lots of debugging information. (As you > might imagine, I've tried 17 ways from Sunday to try to ensure that > libimf.so is found.) > > $ ibv_devices > device node GUID > ------ ---------------- > mlx4_0 24be05ffffa57160 > scif0 4c79bafffe4402b6 > $ ibv_devinfo > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.11.1250 > node_guid: 24be:05ff:ffa5:7160 > sys_image_guid: 24be:05ff:ffa5:7163 > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x0 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 8 > port_lid: 86 > port_lmc: 0x00 > link_layer: InfiniBand > > port: 2 > state: PORT_DOWN (1) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: InfiniBand > > hca_id: scif0 > transport: SCIF (2) > fw_ver: 0.0.1 > node_guid: 4c79:baff:fe44:02b6 > sys_image_guid: 4c79:baff:fe44:02b6 > vendor_id: 0x8086 > vendor_part_id: 0 > hw_ver: 0x1 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 1 > port_lid: 1001 > port_lmc: 0x00 > link_layer: SCIF > > $ shmemrun -x > > LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so > -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose > 5 --mca memheap_base_verbose 100 $PWD/mic.out > [atl1-01-mic0:191024] mca:base:select:( plm) Querying component [rsh] > [atl1-01-mic0:191024] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : > rsh path NULL > [atl1-01-mic0:191024] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [atl1-01-mic0:191024] mca:base:select:( plm) Querying component > [isolated] > [atl1-01-mic0:191024] mca:base:select:( plm) Query of component > [isolated] set priority to 0 > [atl1-01-mic0:191024] mca:base:select:( plm) Querying component [slurm] > [atl1-01-mic0:191024] mca:base:select:( plm) Skipping component [slurm]. > Query failed to return a module > [atl1-01-mic0:191024] mca:base:select:( plm) Selected component [rsh] > [atl1-01-mic0:191024] plm:base:set_hnp_name: initial bias 191024 nodename > hash 4121194178 > [atl1-01-mic0:191024] plm:base:set_hnp_name: final jobfam 29012 > [atl1-01-mic0:191024] [[29012,0],0] plm:rsh_setup on agent ssh : rsh path > NULL > [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive start comm > [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_job > [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm > [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm creating map > [atl1-01-mic0:191024] [[29012,0],0] setup:vm: working unmanaged allocation > [atl1-01-mic0:191024] [[29012,0],0] using dash_host > [atl1-01-mic0:191024] [[29012,0],0] checking node mic1 > [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm add new daemon > [[29012,0],1] > [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm assigning new daemon > [[29012,0],1] to node mic1 > [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: launching vm > [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: local shell: 0 (bash) > [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: assuming same remote shell as > local shell > [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: remote shell: 0 (bash) > [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: final template argv: > /usr/bin/ssh <template> > PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; > LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export > LD_LIBRARY_PATH ; > DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; > export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted > --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca > orte_ess_jobid "1901330432" -mca orte_ess_vpid "<template>" -mca > orte_ess_num_procs "2" -mca orte_hnp_uri > > "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1" > --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca > plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca > rmaps_ppr_n_pernode "2" > [atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a child of > mine > [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to launch > list > [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event > [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of daemon > [[29012,0],1] > [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing: (/usr/bin/ssh) > [/usr/bin/ssh mic1 PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; > export PATH ; > LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export > LD_LIBRARY_PATH ; > DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; > export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted > --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca > orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs > "2" -mca orte_hnp_uri > > "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1" > --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca > plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca > rmaps_ppr_n_pernode "2"] > /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared > libraries: libimf.so: cannot open shared object file: No such file or > directory > [atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127 > [atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending orted_exit > commands > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to > use. > > * compilation of the orted with dynamic libraries when static are > required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > -------------------------------------------------------------------------- > [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm > > On 04/13/2015 08:50 AM, Andy Riebs wrote: > > Hi Ralph, > > Here are the results with last night's "master" nightly, > openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose > option (yes, it looks like the "ERROR_LOG" problem has gone away): > > $ cat /proc/sys/kernel/shmmax > 33554432 > $ cat /proc/sys/kernel/shmall > 2097152 > $ cat /proc/sys/kernel/shmmni > 4096 > $ export SHMEM_SYMMETRIC_HEAP=1M > $ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5 > --mca memheap_base_verbose 100 $PWD/mic.out > [atl1-01-mic0:190439] mca:base:select:( plm) Querying component [rsh] > [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : > rsh path NULL > [atl1-01-mic0:190439] mca:base:select:( plm) Query of component [rsh] > set priority to 10 > [atl1-01-mic0:190439] mca:base:select:( plm) Querying component > [isolated] > [atl1-01-mic0:190439] mca:base:select:( plm) Query of component > [isolated] set priority to 0 > [atl1-01-mic0:190439] mca:base:select:( plm) Querying component [slurm] > [atl1-01-mic0:190439] mca:base:select:( plm) Skipping component > [slurm]. Query failed to return a module > [atl1-01-mic0:190439] mca:base:select:( plm) Selected component [rsh] > [atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439 > nodename hash 4121194178 > [atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875 > [atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh > path NULL > [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm > [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job > [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm > [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map > [atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged > allocation > [atl1-01-mic0:190439] [[31875,0],0] using dash_host > [atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0 > [atl1-01-mic0:190439] [[31875,0],0] ignoring myself > [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in > allocation > [atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1] > [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job > [31875,1] > [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof for > job [31875,1] > [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] registered > [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is not > a dynamic spawn > [atl1-01-mic0:190441] mca: base: components_register: registering > memheap components > [atl1-01-mic0:190441] mca: base: components_register: found loaded > component buddy > [atl1-01-mic0:190441] mca: base: components_register: component buddy > has no register or open function > [atl1-01-mic0:190442] mca: base: components_register: registering > memheap components > [atl1-01-mic0:190442] mca: base: components_register: found loaded > component buddy > [atl1-01-mic0:190442] mca: base: components_register: component buddy > has no register or open function > [atl1-01-mic0:190442] mca: base: components_register: found loaded > component ptmalloc > [atl1-01-mic0:190442] mca: base: components_register: component ptmalloc > has no register or open function > [atl1-01-mic0:190441] mca: base: components_register: found loaded > component ptmalloc > [atl1-01-mic0:190441] mca: base: components_register: component ptmalloc > has no register or open function > [atl1-01-mic0:190441] mca: base: components_open: opening memheap > components > [atl1-01-mic0:190441] mca: base: components_open: found loaded component > buddy > [atl1-01-mic0:190441] mca: base: components_open: component buddy open > function successful > [atl1-01-mic0:190441] mca: base: components_open: found loaded component > ptmalloc > [atl1-01-mic0:190441] mca: base: components_open: component ptmalloc > open function successful > [atl1-01-mic0:190442] mca: base: components_open: opening memheap > components > [atl1-01-mic0:190442] mca: base: components_open: found loaded component > buddy > [atl1-01-mic0:190442] mca: base: components_open: component buddy open > function successful > [atl1-01-mic0:190442] mca: base: components_open: found loaded component > ptmalloc > [atl1-01-mic0:190442] mca: base: components_open: component ptmalloc > open function successful > [atl1-01-mic0:190442] base/memheap_base_alloc.c:38 - > mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1 > segments by method: 1 > [atl1-01-mic0:190441] base/memheap_base_alloc.c:38 - > mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1 > segments by method: 1 > [atl1-01-mic0:190442] base/memheap_base_static.c:205 - _load_segments() > add: 00600000-00601000 rw-p 00000000 00:11 > 6029314 /home/ariebs/bench/hello/mic.out > [atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments() > add: 00600000-00601000 rw-p 00000000 00:11 > 6029314 /home/ariebs/bench/hello/mic.out > [atl1-01-mic0:190442] base/memheap_base_static.c:75 - > mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2 > segments > [atl1-01-mic0:190442] base/memheap_base_register.c:39 - > mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000 > 270532608 bytes type=0x1 id=0xFFFFFFFF > [atl1-01-mic0:190441] base/memheap_base_static.c:75 - > mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2 > segments > [atl1-01-mic0:190441] base/memheap_base_register.c:39 - > mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000 > 270532608 bytes type=0x1 id=0xFFFFFFFF > [atl1-01-mic0:190442] Error base/memheap_base_register.c:130 - > _reg_segment() Failed to register segment > [atl1-01-mic0:190441] Error base/memheap_base_register.c:130 - > _reg_segment() Failed to register segment > [atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM > failed to initialize - aborting > [atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM > failed to initialize - aborting > > -------------------------------------------------------------------------- > It looks like SHMEM_INIT failed for some reason; your parallel process > is > likely to abort. There are many reasons that a parallel process can > fail during SHMEM_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open SHMEM > developer): > > mca_memheap_base_select() failed > --> Returned "Error" (-1) instead of "Success" (0) > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with > errorcode -1. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > A SHMEM process is aborting at a time when it cannot guarantee that all > of its peer processes in the job will be killed properly. You should > double check that everything has shut down cleanly. > > Local host: atl1-01-mic0 > PID: 190441 > > -------------------------------------------------------------------------- > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > ------------------------------------------------------- > [atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending > orted_exit commands > > -------------------------------------------------------------------------- > shmemrun detected that one or more processes exited with non-zero > status, thus causing > the job to be terminated. The first process to do so was: > > Process name: [[31875,1],0] > Exit code: 255 > > -------------------------------------------------------------------------- > [atl1-01-mic0:190439] 1 more process has sent help message > help-shmem-runtime.txt / shmem_init:startup:internal-failure > [atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0 > to see all help / error messages > [atl1-01-mic0:190439] 1 more process has sent help message > help-shmem-api.txt / shmem-abort > [atl1-01-mic0:190439] 1 more process has sent help message > help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed > [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm > > On 04/12/2015 03:09 PM, Ralph Castain wrote: > > Sorry about that - I hadn't brought it over to the 1.8 branch yet. > I've done so now, which means the ERROR_LOG shouldn't show up any > more. It won't fix the memheap problem, though. > You might try adding "--mca memheap_base_verbose 100" to your cmd line > so we can see why none of the memheap components are being selected. > > On Apr 12, 2015, at 11:30 AM, Andy Riebs <andy.ri...@hp.com> wrote: > Hi Ralph, > > Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2: > > $ shmemrun -H localhost -N 2 --mca sshmem mmap --mca > plm_base_verbose 5 $PWD/mic.out > [atl1-01-mic0:190189] mca:base:select:( plm) Querying component > [rsh] > [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent > ssh : rsh path NULL > [atl1-01-mic0:190189] mca:base:select:( plm) Query of component > [rsh] set priority to 10 > [atl1-01-mic0:190189] mca:base:select:( plm) Querying component > [isolated] > [atl1-01-mic0:190189] mca:base:select:( plm) Query of component > [isolated] set priority to 0 > [atl1-01-mic0:190189] mca:base:select:( plm) Querying component > [slurm] > [atl1-01-mic0:190189] mca:base:select:( plm) Skipping component > [slurm]. Query failed to return a module > [atl1-01-mic0:190189] mca:base:select:( plm) Selected component > [rsh] > [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189 > nodename hash 4121194178 > [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137 > [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh > path NULL > [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm > [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job > [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm > [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map > [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged > allocation > [atl1-01-mic0:190189] [[32137,0],0] using dash_host > [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0 > [atl1-01-mic0:190189] [[32137,0],0] ignoring myself > [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in > allocation > [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1] > [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in > file base/plm_base_launch_support.c at line 440 > [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job > [32137,1] > [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof > for job [32137,1] > [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1] > registered > [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is > not a dynamic spawn > > -------------------------------------------------------------------------- > It looks like SHMEM_INIT failed for some reason; your parallel > process is > likely to abort. There are many reasons that a parallel process can > fail during SHMEM_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's > some > additional information (which may only be relevant to an Open SHMEM > developer): > > mca_memheap_base_select() failed > --> Returned "Error" (-1) instead of "Success" (0) > > -------------------------------------------------------------------------- > [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM > failed to initialize - aborting > [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM > failed to initialize - aborting > > -------------------------------------------------------------------------- > SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0) > with errorcode -1. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > A SHMEM process is aborting at a time when it cannot guarantee that > all > of its peer processes in the job will be killed properly. You > should > double check that everything has shut down cleanly. > > Local host: atl1-01-mic0 > PID: 190192 > > -------------------------------------------------------------------------- > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > ------------------------------------------------------- > [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending > orted_exit commands > > -------------------------------------------------------------------------- > shmemrun detected that one or more processes exited with non-zero > status, thus causing > the job to be terminated. The first process to do so was: > > Process name: [[32137,1],0] > Exit code: 255 > > -------------------------------------------------------------------------- > [atl1-01-mic0:190189] 1 more process has sent help message > help-shmem-runtime.txt / shmem_init:startup:internal-failure > [atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate" > to 0 to see all help / error messages > [atl1-01-mic0:190189] 1 more process has sent help message > help-shmem-api.txt / shmem-abort > [atl1-01-mic0:190189] 1 more process has sent help message > help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all > killed > [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm > > On 04/11/2015 07:41 PM, Ralph Castain wrote: > > Got it - thanks. I fixed that ERROR_LOG issue (I think- please > verify). I suspect the memheap issue relates to something else, > but I probably need to let the OSHMEM folks comment on it > > On Apr 11, 2015, at 9:52 AM, Andy Riebs <andy.ri...@hp.com> > wrote: > Everything is built on the Xeon side, with the icc "-mmic" > switch. I then ssh into one of the PHIs, and run shmemrun from > there. > > On 04/11/2015 12:00 PM, Ralph Castain wrote: > > Let me try to understand the setup a little better. Are you > running shmemrun on the PHI itself? Or is it running on the > host processor, and you are trying to spawn a process onto the > Phi? > > On Apr 11, 2015, at 7:55 AM, Andy Riebs <andy.ri...@hp.com> > wrote: > Hi Ralph, > > Yes, this is attempting to get OSHMEM to run on the Phi. > > I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured > it with > > $ ./configure --prefix=/home/ariebs/mic/mpi-nightly > CC=icc -mmic CXX=icpc -mmic \ > --build=x86_64-unknown-linux-gnu > --host=x86_64-k1om-linux \ > AR=x86_64-k1om-linux-ar > RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld \ > --enable-mpirun-prefix-by-default > --disable-io-romio --disable-mpi-fortran \ > --enable-debug > > --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud > > (Note that I had to add "oob-ud" to the > "--enable-mca-no-build" option, as the build complained that > mca oob/ud needed mca common-verbs.) > > With that configuration, here is what I am seeing now... > > $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G > $ shmemrun -H localhost -N 2 --mca sshmem mmap --mca > plm_base_verbose 5 $PWD/mic.out > [atl1-01-mic0:189895] mca:base:select:( plm) Querying > component [rsh] > [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on > agent ssh : rsh path NULL > [atl1-01-mic0:189895] mca:base:select:( plm) Query of > component [rsh] set priority to 10 > [atl1-01-mic0:189895] mca:base:select:( plm) Querying > component [isolated] > [atl1-01-mic0:189895] mca:base:select:( plm) Query of > component [isolated] set priority to 0 > [atl1-01-mic0:189895] mca:base:select:( plm) Querying > component [slurm] > [atl1-01-mic0:189895] mca:base:select:( plm) Skipping > component [slurm]. Query failed to return a module > [atl1-01-mic0:189895] mca:base:select:( plm) Selected > component [rsh] > [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias > 189895 nodename hash 4121194178 > [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam > 32419 > [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent > ssh : rsh path NULL > [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start > comm > [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job > [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm > [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm > creating map > [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working > unmanaged allocation > [atl1-01-mic0:189895] [[32419,0],0] using dash_host > [atl1-01-mic0:189895] [[32419,0],0] checking node > atl1-01-mic0 > [atl1-01-mic0:189895] [[32419,0],0] ignoring myself > [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only > HNP in allocation > [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job > [32419,1] > [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not > found in file base/plm_base_launch_support.c at line 440 > [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for > job [32419,1] > [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring > up iof for job [32419,1] > [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch > [32419,1] registered > [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job > [32419,1] is not a dynamic spawn > [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init() > SHMEM failed to initialize - aborting > [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init() > SHMEM failed to initialize - aborting > > -------------------------------------------------------------------------- > It looks like SHMEM_INIT failed for some reason; your > parallel process is > likely to abort. There are many reasons that a parallel > process can > fail during SHMEM_INIT; some of which are due to > configuration or environment > problems. This failure appears to be an internal failure; > here's some > additional information (which may only be relevant to an > Open SHMEM > developer): > > mca_memheap_base_select() failed > --> Returned "Error" (-1) instead of "Success" (0) > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > SHMEM_ABORT was invoked on rank 1 (pid 189899, > host=atl1-01-mic0) with errorcode -1. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > A SHMEM process is aborting at a time when it cannot > guarantee that all > of its peer processes in the job will be killed properly. > You should > double check that everything has shut down cleanly. > > Local host: atl1-01-mic0 > PID: 189899 > > -------------------------------------------------------------------------- > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been > aborted. > ------------------------------------------------------- > [atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd > sending orted_exit commands > > -------------------------------------------------------------------------- > shmemrun detected that one or more processes exited with > non-zero status, thus causing > the job to be terminated. The first process to do so was: > > Process name: [[32419,1],1] > Exit code: 255 > > -------------------------------------------------------------------------- > [atl1-01-mic0:189895] 1 more process has sent help message > help-shmem-runtime.txt / shmem_init:startup:internal-failure > [atl1-01-mic0:189895] Set MCA parameter > "orte_base_help_aggregate" to 0 to see all help / error > messages > [atl1-01-mic0:189895] 1 more process has sent help message > help-shmem-api.txt / shmem-abort > [atl1-01-mic0:189895] 1 more process has sent help message > help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee > all killed > [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop > comm > > On 04/10/2015 06:37 PM, Ralph Castain wrote: > > Andy - could you please try the current 1.8.5 nightly > tarball and see if it helps? The error log indicates that > it is failing to get the topology from some daemon, I**m > assuming the one on the Phi? > You might also add **enable-debug to that configure line > and then put -mca plm_base_verbose on the shmemrun cmd to > get more help > > On Apr 10, 2015, at 11:55 AM, Andy Riebs > <andy.ri...@hp.com> wrote: > Summary: MPI jobs work fine, SHMEM jobs work just often > enough to be tantalizing, on an Intel Xeon Phi/MIC > system. > > Longer version > > Thanks to the excellent write-up last June > > (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php>), > I have been able to build a version of Open MPI for the > Xeon Phi coprocessor that runs MPI jobs on the Phi > coprocessor with no problem, but not SHMEM jobs. Just > at the point where I was about to document the problems > I was having with SHMEM, my trivial SHMEM job worked. > And then failed when I tried to run it again, > immediately afterwards. I have a feeling I may be in > uncharted territory here. > > Environment > * RHEL 6.5 > * Intel Composer XE 2015 > * Xeon Phi/MIC > ---------------- > > Configuration > > $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH > $ source > /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh > intel64 > $ ./configure --prefix=/home/ariebs/mic/mpi \ > CC="icc -mmic" CXX="icpc -mmic" \ > --build=x86_64-unknown-linux-gnu > --host=x86_64-k1om-linux \ > AR=x86_64-k1om-linux-ar > RANLIB=x86_64-k1om-linux-ranlib \ > LD=x86_64-k1om-linux-ld \ > --enable-mpirun-prefix-by-default --disable-io-romio > \ > --disable-vt --disable-mpi-fortran \ > > --enable-mca-no-build=btl-usnic,btl-openib,common-verbs > $ make > $ make install > > ---------------- > > Test program > > #include <stdio.h> > #include <stdlib.h> > #include <shmem.h> > int main(int argc, char **argv) > { > int me, num_pe; > shmem_init(); > num_pe = num_pes(); > me = my_pe(); > printf("Hello World from process %ld of %ld\n", > me, num_pe); > exit(0); > } > > ---------------- > > Building the program > > export PATH=/home/ariebs/mic/mpi/bin:$PATH > export PATH=/usr/linux-k1om-4.7/bin/:$PATH > source > /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh > intel64 > export > > LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH > > icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include > -pthread \ > -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib > -Wl,--enable-new-dtags \ > -L/home/ariebs/mic/mpi/lib -loshmem -lmpi > -lopen-rte -lopen-pal \ > -lm -ldl -lutil \ > -Wl,-rpath > > -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic > \ > > -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic > \ > -o mic.out shmem_hello.c > > ---------------- > > Running the program > > (Note that the program had been consistently failing. > Then, when I logged back into the system to capture the > results, it worked once, and then immediately failed > when I tried again, as shown below. Logging in and out > isn't sufficient to correct the problem. Overall, I > think I had 3 successful runs in 30-40 attempts.) > > $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out > [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not > found in file base/plm_base_launch_support.c at line 426 > Hello World from process 0 of 2 > Hello World from process 1 of 2 > $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out > [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not > found in file base/plm_base_launch_support.c at line 426 > [atl1-01-mic0:189383] Error: pshmem_init.c:61 - > shmem_init() SHMEM failed to initialize - aborting > > -------------------------------------------------------------------------- > It looks like SHMEM_INIT failed for some reason; your > parallel process is > likely to abort. There are many reasons that a parallel > process can > fail during SHMEM_INIT; some of which are due to > configuration or environment > problems. This failure appears to be an internal > failure; here's some > additional information (which may only be relevant to an > Open SHMEM > developer): > > mca_memheap_base_select() failed > --> Returned "Error" (-1) instead of "Success" (0) > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > SHMEM_ABORT was invoked on rank 0 (pid 189383, > host=atl1-01-mic0) with errorcode -1. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > A SHMEM process is aborting at a time when it cannot > guarantee that all > of its peer processes in the job will be killed > properly. You should > double check that everything has shut down cleanly. > > Local host: atl1-01-mic0 > PID: 189383 > > -------------------------------------------------------------------------- > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has > been aborted. > ------------------------------------------------------- > > -------------------------------------------------------------------------- > shmemrun detected that one or more processes exited with > non-zero status, thus causing > the job to be terminated. The first process to do so > was: > > Process name: [[30881,1],0] > Exit code: 255 > > -------------------------------------------------------------------------- > > Any thoughts about where to go from here? > > Andy > > -- > Andy Riebs > Hewlett-Packard Company > High Performance Computing > +1 404 648 9024 > My opinions are not necessarily those of HP > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: > http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > > http://www.open-mpi.org/community/lists/users/2015/04/26670.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26676.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: > http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > > http://www.open-mpi.org/community/lists/users/2015/04/26678.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26679.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26680.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26682.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26683.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26684.php > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26697.php
pgp6ikuBxYvsk.pgp
Description: PGP signature