Hi,

I have recompiled openmpi with the --enabled-debug and --with-tm=/usr/local
flags, and submitted the job to torque 2.3.7:

#PBS -q cluster2
#PBS -l nodes=5:ppn=2
#PBS -N AlignImages
#PBS -j oe
/usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca plm_base_verbose
5 --debug-daemons  -machinefile $PBS_NODEFILE
/pcs/programs/grip/bin/RunAlignmentMPI DoAlign
/pcs/pc00/keegstra/work/hm/hemo-mix-psml.img
/pcs/pc00/keegstra/work/hm/hemo-mix-psml-ali.img 4 9 14 1 2497 360.000
64.000 /pcs/pc00/keegstra/work/hm/hemo-mix-pref.img 1 7 0

and the job crashed almost immediately. i have attached:
tm.3.gz, Job output: AlignImages.o34.gz, momlog-20090731

I hope you can help me,
kind regards,
Wilko


Ralph Castain wrote:
> Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7
> allocation, and the man page for tm_spawn?
> 
> My only guess would be that something changed in those areas as we don't
> really use anything else from Torque, and run on Torque-based clusters
> in production every day. Not sure what version we have here, though I
> believe it is pretty current (will check).
> 
> You also might want to configure OMPI 1.3.3 with --enable-debug. You
> could then do a run with -mca ras_base_verbose 5 -mca plm_base_verbose 5
> --debug-daemons on your mpirun cmd line to get a step-by-step diagnostic
> output of the interaction with Torque. Should give us some idea of where
> the failure is occurring.
> 
> Ralph
> 
> On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote:
> 
>> hi,
>>
>> I have the following problem:
>>
>> I am using openmpi 1.3.3
>>
>> programs (directly and from scripts) submitted with mpiexec are running
>> fine.
>>
>> programs (directly and from scripts) submitted through Torque 2.3.7
>> with openmpi compiled with --with-tm (and torque-devel) installed
>> give segfaulting of the programs.
>>
>> programs submitted through Torque 2.3.7 directly with openmpi
>> compiled without --with-tm (and NO torque-devel installed) run fine
>> however mpiexec programs from script (script submiited through torque)
>> are only running on 1 node, so I need openmpi compiled with --with-tm
>>
>> We also have a cluster running with openmpi 1.2.9 compiled without
>> --with-tm in combination with torque 2.3.3 and everything is running
>> fine, so NO segfaults and mpiexec from script also running on the
>> nodes selected at submitting time.
>>
>> I don't have errors on log files only on the job log file:
>>
>> ---------------------------------------------------------------------------
>>
>> mpiexec noticed that process rank 7 with PID 3150 on node
>> rugem21.chem.rug.nl exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>>
>>
>> Could anyone please help me,
>> many thanks in advance
>> Wilko Keegstra
>>
>> -- 
>> +-------------------------------------------------------------+
>> | Dr. Wilko Keegstra    priv.phone: +31594514153,+31610477915 |
>> | Groningen University       email: w.keegs...@rug.nl         |
>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>> | Nijenborgh 4               phone: +31503634224              |
>> | 9747 AG GRONINGEN          fax  : +31503634800              |
>> | The Netherlands                                             |
>> +-------------------------------------------------------------+
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
+-------------------------------------------------------------+
| Dr. Wilko Keegstra    priv.phone: +31594514153,+31610477915 |
| Groningen University       email: w.keegs...@rug.nl         |
| Groningen Biomolecular Sciences and Biotechnology Institute |
| Nijenborgh 4               phone: +31503634224              |
| 9747 AG GRONINGEN          fax  : +31503634800              |
| The Netherlands                                             |
+-------------------------------------------------------------+

Attachment: tm.3.gz
Description: GNU Zip compressed data

Attachment: AlignImages.o34.gz
Description: GNU Zip compressed data

Attachment: momlog-20090731.gz
Description: GNU Zip compressed data

Reply via email to