Hi Ralph,

Yes, this is attempting to get OSHMEM to run on the Phi.

I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured it with

$ ./configure --prefix=/home/ariebs/mic/mpi-nightly    CC=icc -mmic CXX=icpc -mmic    \
    --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux    \
     AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib  LD=x86_64-k1om-linux-ld   \
     --enable-mpirun-prefix-by-default --disable-io-romio     --disable-mpi-fortran    \
     --enable-debug     --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud

(Note that I had to add "oob-ud" to the "--enable-mca-no-build" option, as the build complained that mca oob/ud needed mca common-verbs.)

With that configuration, here is what I am seeing now...

$ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
$ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [rsh]
[atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[atl1-01-mic0:189895] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [isolated]
[atl1-01-mic0:189895] mca:base:select:(  plm) Query of component [isolated] set priority to 0
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [slurm]
[atl1-01-mic0:189895] mca:base:select:(  plm) Skipping component [slurm]. Query failed to return a module
[atl1-01-mic0:189895] mca:base:select:(  plm) Selected component [rsh]
[atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias 189895 nodename hash 4121194178
[atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam 32419
[atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start comm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:189895] [[32419,0],0] setup:vm: working unmanaged allocation
[atl1-01-mic0:189895] [[32419,0],0] using dash_host
[atl1-01-mic0:189895] [[32419,0],0] checking node atl1-01-mic0
[atl1-01-mic0:189895] [[32419,0],0] ignoring myself
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only HNP in allocation
[atl1-01-mic0:189895] [[32419,0],0] complete_setup on job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 440
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring up iof for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch [32419,1] registered
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job [32419,1] is not a dynamic spawn
[atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 1 (pid 189899, host=atl1-01-mic0) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:        189899
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd sending orted_exit commands
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32419,1],1]
  Exit code:    255
--------------------------------------------------------------------------
[atl1-01-mic0:189895] 1 more process has sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:189895] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[atl1-01-mic0:189895] 1 more process has sent help message help-shmem-api.txt / shmem-abort
[atl1-01-mic0:189895] 1 more process has sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop comm




On 04/10/2015 06:37 PM, Ralph Castain wrote:
Andy - could you please try the current 1.8.5 nightly tarball and see if it helps? The error log indicates that it is failing to get the topology from some daemon, I’m assuming the one on the Phi?

You might also add —enable-debug to that configure line and then put -mca plm_base_verbose on the shmemrun cmd to get more help


On Apr 10, 2015, at 11:55 AM, Andy Riebs <andy.ri...@hp.com> wrote:

Summary: MPI jobs work fine, SHMEM jobs work just often enough to be tantalizing, on an Intel Xeon Phi/MIC system.

Longer version

Thanks to the excellent write-up last June (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php>), I have been able to build a version of Open MPI for the Xeon Phi coprocessor that runs MPI jobs on the Phi coprocessor with no problem, but not SHMEM jobs.  Just at the point where I was about to document the problems I was having with SHMEM, my trivial SHMEM job worked. And then failed when I tried to run it again, immediately afterwards. I have a feeling I may be in uncharted  territory here.

Environment
  • RHEL 6.5
  • Intel Composer XE 2015
  • Xeon Phi/MIC
----------------


Configuration

$ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
$ source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
$ ./configure --prefix=/home/ariebs/mic/mpi \
   CC="icc -mmic" CXX="icpc -mmic" \
   --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
    AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib \
    LD=x86_64-k1om-linux-ld \
    --enable-mpirun-prefix-by-default --disable-io-romio \
    --disable-vt --disable-mpi-fortran \
    --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
$ make
$ make install

----------------

Test program

#include <stdio.h>
#include <stdlib.h>
#include <shmem.h>
int main(int argc, char **argv)
{
        int me, num_pe;
        shmem_init();
        num_pe = num_pes();
        me = my_pe();
        printf("Hello World from process %ld of %ld\n", me, num_pe);
        exit(0);
}

----------------

Building the program

export PATH=/home/ariebs/mic/mpi/bin:$PATH
export PATH=/usr/linux-k1om-4.7/bin/:$PATH
source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
export LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH

icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include -pthread \
        -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib -Wl,--enable-new-dtags \
        -L/home/ariebs/mic/mpi/lib -loshmem -lmpi -lopen-rte -lopen-pal \
        -lm -ldl -lutil \
        -Wl,-rpath -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
        -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
        -o mic.out  shmem_hello.c

----------------

Running the program

(Note that the program had been consistently failing. Then, when I logged back into the system to capture the results, it worked once,  and then immediately failed when I tried again, as shown below. Logging in and out isn't sufficient to correct the problem. Overall, I think I had 3 successful runs in 30-40 attempts.)

$ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
[atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 426
Hello World from process 0 of 2
Hello World from process 1 of 2
$ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
[atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 426
[atl1-01-mic0:189383] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 189383, host=atl1-01-mic0) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:        189383
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[30881,1],0]
  Exit code:    255
--------------------------------------------------------------------------

Any thoughts about where to go from here?

Andy

-- 
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26670.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26676.php


Reply via email to