Hi Edmund The [Torque/PBS] syntax '-l procs=48' is somewhat troublesome, and may not be understood by the scheduler [It doesn't work correctly with Maui, which is what we have here. I read people saying it works with pbs_sched and with Moab, but that's hearsay.] This issue comes back very often in the Torque mailing list.
Have you tried instead this alternate syntax? '-l nodes=2:ppn=24' [I am assuming here that your nodes have 24 cores, i.e. 24 'ppn', each] Then in the script: mpiexec -np 48 ./your_program Also, in your PBS script you could print the contents of PBS_NODEFILE. cat $PBS_NODEFILE A simple troubleshooting test is to launch 'hostname' with mpirun mpirun -np 48 hostname Finally, are you sure that the OpenMPI you are using was compiled with Torque support? If not, I wonder if clauses like '-bynode' would work at all. Jeff may correct me if I am wrong, but if your OpenMPI lacks Torque support, you may need to pass to mpirun the $PBS_NODEFILE as your hostfile. I hope this helps, Gus Correa On 06/01/2012 11:26 AM, Edmund Sumbar wrote:
On Fri, Jun 1, 2012 at 8:09 AM, Jeff Squyres <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote: It's been a loooong time since I've run under PBS, so I don't remember if your script's environment is copied out to the remote nodes where your application actually runs. Can you verify that PATH and LD_LIBRARY_PATH are the same on all nodes in your PBS allocation after you module load? I compiled the following program and invoked it with "mpiexec -bynode ./test-env" in a PBS script. #include "mpi.h" #include <stdio.h> #include <string.h> #include <stdlib.h> int main (int argc, char *argv[]) { int i, rank, size, namelen; MPI_Status stat; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank); printf("rank: %d: ld_library_path: %s\n", rank, getenv("LD_LIBRARY_PATH")); MPI_Finalize (); return (0); } I submitted the script with "qsub -l procs=24 job.pbs", and got rank: 4: ld_library_path: /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64 rank: 3: ld_library_path: /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64 ...more of the same... When I submitted it with -l procs=48, I got [cl2n004:11617] *** Process received signal *** [cl2n004:11617] Signal: Segmentation fault (11) [cl2n004:11617] Signal code: Address not mapped (1) [cl2n004:11617] Failing at address: 0x10 [cl2n004:11617] [ 0] /lib64/libpthread.so.0 [0x376ca0ebe0] [cl2n004:11617] [ 1] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2af788a98113] [cl2n004:11617] [ 2] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59) [0x2af788a9a8a9] [cl2n004:11617] [ 3] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 [0x2af788a9a596] [cl2n004:11617] [ 4] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so [0x2af78c916654] [cl2n004:11617] [ 5] /lib64/libpthread.so.0 [0x376ca0677d] [cl2n004:11617] [ 6] /lib64/libc.so.6(clone+0x6d) [0x376bed325d] [cl2n004:11617] *** End of error message *** -------------------------------------------------------------------------- mpiexec noticed that process rank 4 with PID 11617 on node cl2n004 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- It seems that failures happen for arbitrary reasons. When I added a line in the PBS script to print out the node allocation, the procs=24 case failed, but then it worked a few seconds later, with the same list of allocated nodes. So there's definitely something amiss with the cluster, although I wouldn't know where to start investigating. Perhaps there is a pre-installed OMPI somewhere that's interfering, but I'm doubtful. By the way, thanks for all the support. -- Edmund Sumbar University of Alberta +1 780 492 9360 _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users