On Fri, Jun 1, 2012 at 8:09 AM, Jeff Squyres <jsquy...@cisco.com> wrote:

> It's been a loooong time since I've run under PBS, so I don't remember if
> your script's environment is copied out to the remote nodes where your
> application actually runs.
>
> Can you verify that PATH and LD_LIBRARY_PATH are the same on all nodes in
> your PBS allocation after you module load?
>

I compiled the following program and invoked it with "mpiexec -bynode
./test-env" in a PBS script.

#include "mpi.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main (int argc, char *argv[])
{
  int i, rank, size, namelen;
  MPI_Status stat;

  MPI_Init (&argc, &argv);

  MPI_Comm_size (MPI_COMM_WORLD, &size);
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);

  printf("rank: %d: ld_library_path: %s\n", rank,
getenv("LD_LIBRARY_PATH"));

  MPI_Finalize ();

  return (0);
}

I submitted the script with "qsub -l procs=24 job.pbs", and got

rank: 4: ld_library_path:
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

rank: 3: ld_library_path:
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

...more of the same...

When I submitted it with -l procs=48, I got

[cl2n004:11617] *** Process received signal ***
[cl2n004:11617] Signal: Segmentation fault (11)
[cl2n004:11617] Signal code: Address not mapped (1)
[cl2n004:11617] Failing at address: 0x10
[cl2n004:11617] [ 0] /lib64/libpthread.so.0 [0x376ca0ebe0]
[cl2n004:11617] [ 1]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
[0x2af788a98113]
[cl2n004:11617] [ 2]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59)
[0x2af788a9a8a9]
[cl2n004:11617] [ 3]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1
[0x2af788a9a596]
[cl2n004:11617] [ 4]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so
[0x2af78c916654]
[cl2n004:11617] [ 5] /lib64/libpthread.so.0 [0x376ca0677d]
[cl2n004:11617] [ 6] /lib64/libc.so.6(clone+0x6d) [0x376bed325d]
[cl2n004:11617] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 4 with PID 11617 on node cl2n004 exited
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

It seems that failures happen for arbitrary reasons. When I added a line in
the PBS script to print out the node allocation, the procs=24 case failed,
but then it worked a few seconds later, with the same list of allocated
nodes. So there's definitely something amiss with the cluster, although I
wouldn't know where to start investigating. Perhaps there is a
pre-installed OMPI somewhere that's interfering, but I'm doubtful.

By the way, thanks for all the support.

-- 
Edmund Sumbar
University of Alberta
+1 780 492 9360

Reply via email to