Summary: MPI jobs work fine, SHMEM jobs work just often enough to be
tantalizing, on an Intel Xeon Phi/MIC system. Longer version Thanks to the excellent write-up last June (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php>), I have been able to build a version of Open MPI for the Xeon Phi coprocessor that runs MPI jobs on the Phi coprocessor with no problem, but not SHMEM jobs. Just at the point where I was about to document the problems I was having with SHMEM, my trivial SHMEM job worked. And then failed when I tried to run it again, immediately afterwards. I have a feeling I may be in uncharted territory here. Environment
Configuration $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH $ source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64 $ ./configure --prefix=/home/ariebs/mic/mpi \ CC="icc -mmic" CXX="icpc -mmic" \ --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \ AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib \ LD=x86_64-k1om-linux-ld \ --enable-mpirun-prefix-by-default --disable-io-romio \ --disable-vt --disable-mpi-fortran \ --enable-mca-no-build=btl-usnic,btl-openib,common-verbs $ make $ make install ---------------- Test program #include <stdio.h> #include <stdlib.h> #include <shmem.h> int main(int argc, char **argv) { int me, num_pe; shmem_init(); num_pe = num_pes(); me = my_pe(); printf("Hello World from process %ld of %ld\n", me, num_pe); exit(0); } ---------------- Building the program export PATH=/home/ariebs/mic/mpi/bin:$PATH export PATH=/usr/linux-k1om-4.7/bin/:$PATH source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64 export LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include -pthread \ -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib -Wl,--enable-new-dtags \ -L/home/ariebs/mic/mpi/lib -loshmem -lmpi -lopen-rte -lopen-pal \ -lm -ldl -lutil \ -Wl,-rpath -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \ -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \ -o mic.out shmem_hello.c ---------------- Running the program (Note that the program had been consistently failing. Then, when I logged back into the system to capture the results, it worked once, and then immediately failed when I tried again, as shown below. Logging in and out isn't sufficient to correct the problem. Overall, I think I had 3 successful runs in 30-40 attempts.) $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 426 Hello World from process 0 of 2 Hello World from process 1 of 2 $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 426 [atl1-01-mic0:189383] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting -------------------------------------------------------------------------- It looks like SHMEM_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during SHMEM_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open SHMEM developer): mca_memheap_base_select() failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- SHMEM_ABORT was invoked on rank 0 (pid 189383, host=atl1-01-mic0) with errorcode -1. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A SHMEM process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Local host: atl1-01-mic0 PID: 189383 -------------------------------------------------------------------------- ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- shmemrun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[30881,1],0] Exit code: 255 -------------------------------------------------------------------------- Any thoughts about where to go from here? Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are not necessarily those of HP |
- [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel X... Andy Riebs
- Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM o... Ralph Castain
- Re: [OMPI users] Problems using Open MPI 1.8.4 OSHM... Andy Riebs
- Re: [OMPI users] Problems using Open MPI 1.8.4 ... Ralph Castain
- Re: [OMPI users] Problems using Open MPI 1.... Andy Riebs
- Re: [OMPI users] Problems using Open M... Ralph Castain
- Re: [OMPI users] Problems using Op... Andy Riebs
- Re: [OMPI users] Problems usin... Ralph Castain
- Re: [OMPI users] Problems usin... Riebs, Andy
- Re: [OMPI users] Problems usin... Andy Riebs
- Re: [OMPI users] Problems usin... Andy Riebs