On Fri, Jun 1, 2012 at 5:00 AM, Jeff Squyres <jsquy...@cisco.com> wrote:

> Try running:
>
> which mpirun
> ssh cl2n022 which mpirun
> ssh cl2n010 which mpirun
>
> and
>
> ldd your_mpi_executable
> ssh cl2n022 which mpirun
> ssh cl2n010 which mpirun
>
> Compare the results and ensure that you're finding the same mpirun on all
> nodes, and the same libmpi.so on all nodes.  There may well be another Open
> MPI installed in some non-default location of which you're unaware.
>

I'll try that Jeff (results given below). However, I suspect there must be
something goofy about this (brand new) cluster itself because among the
countless jobs that failed, I got one job that ran without error, and all I
ever did was to rearrange the echo and which commands. We've also observed
some peculiar behaviour on this cluster using Intel MPI that seemed to be
associated with the number of tasks requested. And after more
experimentation, the Open MPI version of the program also seems to be
sensitive to the number of tasks (e.g., works with 48, fails with 64).

Thanks for the feedback Jeff, but I think the ball is firmly in my court.



I ran the following PBS script with "qsub -l procs=128 job.pbs".
Environment variables are set using the Environment Modules packages.

echo $HOSTNAME
which mpiexec
module load library/openmpi/1.6-intel
which mpiexec
echo $PATH
echo $LD_LIBRARY_PATH
ldd test-ompi16
mpiexec --prefix /lustre/jasper/software/openmpi/openmpi-1.6-intel
./test-ompi16

Standard output gave

cl2n011

/lustre/jasper/software/openmpi/openmpi-1.6-intel/bin/mpiexec

/lustre/jasper/software/openmpi/openmpi-1.6-intel/bin:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/bin/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64:/home/esumbar/local/bin:/home/esumbar/bin:/usr/kerberos/bin:/bin:/usr/bin:/opt/sgi/sgimc/bin:/usr/local/torque/sbin:/usr/local/torque/bin

/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

    linux-vdso.so.1 =>  (0x00007fffb5358000)
    libmpi.so.1 =>
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1
(0x00002b3968d1d000)
    libdl.so.2 => /lib64/libdl.so.2 (0x000000329ce00000)
    libimf.so =>
/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libimf.so
(0x00002b3969137000)
    libm.so.6 => /lib64/libm.so.6 (0x000000329d200000)
    librt.so.1 => /lib64/librt.so.1 (0x000000329da00000)
    libnsl.so.1 => /lib64/libnsl.so.1 (0x00000032a6400000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00000032a8400000)
    libsvml.so =>
/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libsvml.so
(0x00002b3969504000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000032a4c00000)
    libintlc.so.5 =>
/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libintlc.so.5
(0x00002b3969c77000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x000000329d600000)
    libc.so.6 => /lib64/libc.so.6 (0x000000329ca00000)
    /lib64/ld-linux-x86-64.so.2 (0x000000329c200000)


Standard error gave

which: no mpiexec in
(/home/esumbar/local/bin:/home/esumbar/bin:/usr/kerberos/bin:/bin:/usr/bin:/opt/sgi/sgimc/bin:/usr/local/torque/sbin:/usr/local/torque/bin)

[cl2n005:05142] *** Process received signal ***
[cl2n005:05142] Signal: Segmentation fault (11)
[cl2n005:05142] Signal code: Address not mapped (1)
[cl2n005:05142] Failing at address: 0x10
[cl2n005:05142] [ 0] /lib64/libpthread.so.0 [0x373180ebe0]
[cl2n005:05142] [ 1]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
[0x2aff9aad5113]
[cl2n005:05142] [ 2]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59)
[0x2aff9aad78a9]
[cl2n005:05142] [ 3]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1
[0x2aff9aad7596]
[cl2n005:05142] [ 4]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_free_list_grow+0x89)
[0x2aff9aa0fa59]
[cl2n005:05142] [ 5]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_free_list_init_ex+0x9c)
[0x2aff9aa0fd8c]
[cl2n005:05142] [ 6]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so
[0x2aff9e94561c]
[cl2n005:05142] [ 7]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(mca_btl_base_select+0x130)
[0x2aff9aa57930]
[cl2n005:05142] [ 8]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0xe)
[0x2aff9e52bc1e]
[cl2n005:05142] [ 9]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(mca_bml_base_init+0x72)
[0x2aff9aa570b2]
[cl2n005:05142] [10]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_pml_ob1.so
[0x2aff9e1107e9]
[cl2n005:05142] [11]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(mca_pml_base_select+0x43e)
[0x2aff9aa6592e]
[cl2n005:05142] [12]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_mpi_init+0x782)
[0x2aff9aa276a2]
[cl2n005:05142] [13]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(MPI_Init+0xf4)
[0x2aff9aa3f884]
[cl2n005:05142] [14] ./test-ompi16(main+0x4c) [0x400b5c]
[cl2n005:05142] [15] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3730c1d994]
[cl2n005:05142] [16] ./test-ompi16 [0x400a59]
[cl2n005:05142] *** End of error message ***
[cl2n006:32362] [[58962,0],5] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file util/nidmap.c at line 776
[cl2n006:32362] [[58962,0],5] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file ess_tm_module.c at line 310
[cl2n006:32362] [[58962,0],5] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n003:04157] [[58962,0],8] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file util/nidmap.c at line 776
[cl2n003:04157] [[58962,0],8] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file ess_tm_module.c at line 310
[cl2n003:04157] [[58962,0],8] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file base/odls_base_default_fns.c at line 2342
--------------------------------------------------------------------------
mpiexec noticed that process rank 77 with PID 5142 on node cl2n005 exited
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


-- 
Edmund Sumbar
University of Alberta
+1 780 492 9360

Reply via email to