Hi Ralph, Yes, this is attempting to get OSHMEM to run on the Phi. I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured it with $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC=icc -mmic CXX=icpc -mmic \ --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \ AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld \ --enable-mpirun-prefix-by-default --disable-io-romio --disable-mpi-fortran \ --enable-debug --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud (Note that I had to add "oob-ud" to the "--enable-mca-no-build" option, as the build complained that mca oob/ud needed mca common-verbs.) With that configuration, here is what I am seeing now... $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G $ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5 $PWD/mic.out [atl1-01-mic0:189895] mca:base:select:( plm) Querying component [rsh] [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [atl1-01-mic0:189895] mca:base:select:( plm) Query of component [rsh] set priority to 10 [atl1-01-mic0:189895] mca:base:select:( plm) Querying component [isolated] [atl1-01-mic0:189895] mca:base:select:( plm) Query of component [isolated] set priority to 0 [atl1-01-mic0:189895] mca:base:select:( plm) Querying component [slurm] [atl1-01-mic0:189895] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [atl1-01-mic0:189895] mca:base:select:( plm) Selected component [rsh] [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias 189895 nodename hash 4121194178 [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam 32419 [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent ssh : rsh path NULL [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start comm [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm creating map [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working unmanaged allocation [atl1-01-mic0:189895] [[32419,0],0] using dash_host [atl1-01-mic0:189895] [[32419,0],0] checking node atl1-01-mic0 [atl1-01-mic0:189895] [[32419,0],0] ignoring myself [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only HNP in allocation [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job [32419,1] [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 440 [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for job [32419,1] [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring up iof for job [32419,1] [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch [32419,1] registered [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job [32419,1] is not a dynamic spawn [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting -------------------------------------------------------------------------- It looks like SHMEM_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during SHMEM_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open SHMEM developer): mca_memheap_base_select() failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- SHMEM_ABORT was invoked on rank 1 (pid 189899, host=atl1-01-mic0) with errorcode -1. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A SHMEM process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Local host: atl1-01-mic0 PID: 189899 -------------------------------------------------------------------------- ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- [atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd sending orted_exit commands -------------------------------------------------------------------------- shmemrun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[32419,1],1] Exit code: 255 -------------------------------------------------------------------------- [atl1-01-mic0:189895] 1 more process has sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure [atl1-01-mic0:189895] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [atl1-01-mic0:189895] 1 more process has sent help message help-shmem-api.txt / shmem-abort [atl1-01-mic0:189895] 1 more process has sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop comm On 04/10/2015 06:37 PM, Ralph Castain
wrote:
Andy - could you please try the current 1.8.5 nightly tarball and see if it helps? The error log indicates that it is failing to get the topology from some daemon, I’m assuming the one on the Phi? |
- [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel X... Andy Riebs
- Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM o... Ralph Castain
- Re: [OMPI users] Problems using Open MPI 1.8.4 OSHM... Andy Riebs
- Re: [OMPI users] Problems using Open MPI 1.8.4 ... Ralph Castain
- Re: [OMPI users] Problems using Open MPI 1.... Andy Riebs
- Re: [OMPI users] Problems using Open M... Ralph Castain
- Re: [OMPI users] Problems using Op... Andy Riebs
- Re: [OMPI users] Problems usin... Ralph Castain
- Re: [OMPI users] Problems usin... Riebs, Andy
- Re: [OMPI users] Problems usin... Andy Riebs
- Re: [OMPI users] Problems usin... Andy Riebs
- Re: [OMPI users] Problems usin... Ralph Castain
- Re: [OMPI users] Problems usin... Nathan Hjelm