Hi, I have recompiled openmpi with the --enabled-debug and --with-tm=/usr/local flags, and submitted the job to torque 2.3.7:
#PBS -q cluster2 #PBS -l nodes=5:ppn=2 #PBS -N AlignImages #PBS -j oe /usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca plm_base_verbose 5 --debug-daemons -machinefile $PBS_NODEFILE /pcs/programs/grip/bin/RunAlignmentMPI DoAlign /pcs/pc00/keegstra/work/hm/hemo-mix-psml.img /pcs/pc00/keegstra/work/hm/hemo-mix-psml-ali.img 4 9 14 1 2497 360.000 64.000 /pcs/pc00/keegstra/work/hm/hemo-mix-pref.img 1 7 0 and the job crashed almost immediately. i have attached: tm.3.gz, Job output: AlignImages.o34.gz, momlog-20090731 I hope you can help me, kind regards, Wilko Ralph Castain wrote: > Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7 > allocation, and the man page for tm_spawn? > > My only guess would be that something changed in those areas as we don't > really use anything else from Torque, and run on Torque-based clusters > in production every day. Not sure what version we have here, though I > believe it is pretty current (will check). > > You also might want to configure OMPI 1.3.3 with --enable-debug. You > could then do a run with -mca ras_base_verbose 5 -mca plm_base_verbose 5 > --debug-daemons on your mpirun cmd line to get a step-by-step diagnostic > output of the interaction with Torque. Should give us some idea of where > the failure is occurring. > > Ralph > > On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote: > >> hi, >> >> I have the following problem: >> >> I am using openmpi 1.3.3 >> >> programs (directly and from scripts) submitted with mpiexec are running >> fine. >> >> programs (directly and from scripts) submitted through Torque 2.3.7 >> with openmpi compiled with --with-tm (and torque-devel) installed >> give segfaulting of the programs. >> >> programs submitted through Torque 2.3.7 directly with openmpi >> compiled without --with-tm (and NO torque-devel installed) run fine >> however mpiexec programs from script (script submiited through torque) >> are only running on 1 node, so I need openmpi compiled with --with-tm >> >> We also have a cluster running with openmpi 1.2.9 compiled without >> --with-tm in combination with torque 2.3.3 and everything is running >> fine, so NO segfaults and mpiexec from script also running on the >> nodes selected at submitting time. >> >> I don't have errors on log files only on the job log file: >> >> --------------------------------------------------------------------------- >> >> mpiexec noticed that process rank 7 with PID 3150 on node >> rugem21.chem.rug.nl exited on signal 11 (Segmentation fault). >> -------------------------------------------------------------------------- >> >> >> Could anyone please help me, >> many thanks in advance >> Wilko Keegstra >> >> -- >> +-------------------------------------------------------------+ >> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 | >> | Groningen University email: w.keegs...@rug.nl | >> | Groningen Biomolecular Sciences and Biotechnology Institute | >> | Nijenborgh 4 phone: +31503634224 | >> | 9747 AG GRONINGEN fax : +31503634800 | >> | The Netherlands | >> +-------------------------------------------------------------+ >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- +-------------------------------------------------------------+ | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 | | Groningen University email: w.keegs...@rug.nl | | Groningen Biomolecular Sciences and Biotechnology Institute | | Nijenborgh 4 phone: +31503634224 | | 9747 AG GRONINGEN fax : +31503634800 | | The Netherlands | +-------------------------------------------------------------+
tm.3.gz
Description: GNU Zip compressed data
AlignImages.o34.gz
Description: GNU Zip compressed data
momlog-20090731.gz
Description: GNU Zip compressed data