Andy - could you please try the current 1.8.5 nightly tarball and see if it 
helps? The error log indicates that it is failing to get the topology from some 
daemon, I’m assuming the one on the Phi?

You might also add —enable-debug to that configure line and then put -mca 
plm_base_verbose on the shmemrun cmd to get more help


> On Apr 10, 2015, at 11:55 AM, Andy Riebs <andy.ri...@hp.com> wrote:
> 
> Summary: MPI jobs work fine, SHMEM jobs work just often enough to be 
> tantalizing, on an Intel Xeon Phi/MIC system.
> 
> Longer version
> 
> Thanks to the excellent write-up last June 
> (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php> 
> <https://www.open-mpi.org/community/lists/users/2014/06/24711.php>), I have 
> been able to build a version of Open MPI for the Xeon Phi coprocessor that 
> runs MPI jobs on the Phi coprocessor with no problem, but not SHMEM jobs.  
> Just at the point where I was about to document the problems I was having 
> with SHMEM, my trivial SHMEM job worked. And then failed when I tried to run 
> it again, immediately     afterwards. I have a feeling I may be in uncharted  
> territory here.
> 
> Environment
> RHEL 6.5
> Intel Composer XE 2015
> Xeon Phi/MIC
> ----------------
> 
> 
> Configuration
> 
> $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
> $ source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
> $ ./configure --prefix=/home/ariebs/mic/mpi \
>    CC="icc -mmic" CXX="icpc -mmic" \
>    --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
>     AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib \
>     LD=x86_64-k1om-linux-ld \
>     --enable-mpirun-prefix-by-default --disable-io-romio \
>     --disable-vt --disable-mpi-fortran \
>     --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
> $ make
> $ make install
> 
> ----------------
> 
> Test program
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <shmem.h>
> int main(int argc, char **argv)
> {
>         int me, num_pe;
>         shmem_init();
>         num_pe = num_pes();
>         me = my_pe();
>         printf("Hello World from process %ld of %ld\n", me, num_pe);
>         exit(0);
> }
> 
> ----------------
> 
> Building the program
> 
> export PATH=/home/ariebs/mic/mpi/bin:$PATH
> export PATH=/usr/linux-k1om-4.7/bin/:$PATH
> source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
> export 
> LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH
> 
> icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include -pthread \
>         -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib -Wl,--enable-new-dtags \
>         -L/home/ariebs/mic/mpi/lib -loshmem -lmpi -lopen-rte -lopen-pal \
>         -lm -ldl -lutil \
>         -Wl,-rpath 
> -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
>         -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
>         -o mic.out  shmem_hello.c
> 
> ----------------
> 
> Running the program
> 
> (Note that the program had been consistently failing. Then, when I logged 
> back into the system to capture the results, it worked once,  and then 
> immediately failed when I tried again, as shown below. Logging in and out 
> isn't sufficient to correct the problem. Overall, I think I had 3 successful 
> runs in 30-40 attempts.)
> 
> $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
> [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not found in file 
> base/plm_base_launch_support.c at line 426
> Hello World from process 0 of 2
> Hello World from process 1 of 2
> $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
> [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not found in file 
> base/plm_base_launch_support.c at line 426
> [atl1-01-mic0:189383] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
> initialize - aborting
> --------------------------------------------------------------------------
> It looks like SHMEM_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during SHMEM_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open SHMEM
> developer):
> 
>   mca_memheap_base_select() failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> SHMEM_ABORT was invoked on rank 0 (pid 189383, host=atl1-01-mic0) with 
> errorcode -1.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> A SHMEM process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
> 
> Local host: atl1-01-mic0
> PID:        189383
> --------------------------------------------------------------------------
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> shmemrun detected that one or more processes exited with non-zero status, 
> thus causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[30881,1],0]
>   Exit code:    255
> --------------------------------------------------------------------------
> 
> Any thoughts about where to go from here?
> 
> Andy
> 
> -- 
> Andy Riebs
> Hewlett-Packard Company
> High Performance Computing
> +1 404 648 9024
> My opinions are not necessarily those of HP
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26670.php

Reply via email to