Ummm...this log indicates that OMPI ran perfectly - it is your
application that segfaulted.
Can you run gdb (or your favorite debugger) against a core file from
your app? It looks like something in your app is crashing.
As far as I can tell, everything is working fine. We launch and wireup
just fine, then detect one of your processes has segfaulted - which
triggers us to kill the remaining processes and terminate the job.
On Jul 31, 2009, at 8:35 AM, Wilko Keegstra wrote:
Hi,
I have recompiled openmpi with the --enabled-debug and --with-tm=/
usr/local
flags, and submitted the job to torque 2.3.7:
#PBS -q cluster2
#PBS -l nodes=5:ppn=2
#PBS -N AlignImages
#PBS -j oe
/usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca
plm_base_verbose
5 --debug-daemons -machinefile $PBS_NODEFILE
/pcs/programs/grip/bin/RunAlignmentMPI DoAlign
/pcs/pc00/keegstra/work/hm/hemo-mix-psml.img
/pcs/pc00/keegstra/work/hm/hemo-mix-psml-ali.img 4 9 14 1 2497 360.000
64.000 /pcs/pc00/keegstra/work/hm/hemo-mix-pref.img 1 7 0
and the job crashed almost immediately. i have attached:
tm.3.gz, Job output: AlignImages.o34.gz, momlog-20090731
I hope you can help me,
kind regards,
Wilko
Ralph Castain wrote:
Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7
allocation, and the man page for tm_spawn?
My only guess would be that something changed in those areas as we
don't
really use anything else from Torque, and run on Torque-based
clusters
in production every day. Not sure what version we have here, though I
believe it is pretty current (will check).
You also might want to configure OMPI 1.3.3 with --enable-debug. You
could then do a run with -mca ras_base_verbose 5 -mca
plm_base_verbose 5
--debug-daemons on your mpirun cmd line to get a step-by-step
diagnostic
output of the interaction with Torque. Should give us some idea of
where
the failure is occurring.
Ralph
On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote:
hi,
I have the following problem:
I am using openmpi 1.3.3
programs (directly and from scripts) submitted with mpiexec are
running
fine.
programs (directly and from scripts) submitted through Torque 2.3.7
with openmpi compiled with --with-tm (and torque-devel) installed
give segfaulting of the programs.
programs submitted through Torque 2.3.7 directly with openmpi
compiled without --with-tm (and NO torque-devel installed) run fine
however mpiexec programs from script (script submiited through
torque)
are only running on 1 node, so I need openmpi compiled with --with-
tm
We also have a cluster running with openmpi 1.2.9 compiled without
--with-tm in combination with torque 2.3.3 and everything is running
fine, so NO segfaults and mpiexec from script also running on the
nodes selected at submitting time.
I don't have errors on log files only on the job log file:
---------------------------------------------------------------------------
mpiexec noticed that process rank 7 with PID 3150 on node
rugem21.chem.rug.nl exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Could anyone please help me,
many thanks in advance
Wilko Keegstra
--
+-------------------------------------------------------------+
| Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
| Groningen University email: w.keegs...@rug.nl |
| Groningen Biomolecular Sciences and Biotechnology Institute |
| Nijenborgh 4 phone: +31503634224 |
| 9747 AG GRONINGEN fax : +31503634800 |
| The Netherlands |
+-------------------------------------------------------------+
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
+-------------------------------------------------------------+
| Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
| Groningen University email: w.keegs...@rug.nl |
| Groningen Biomolecular Sciences and Biotechnology Institute |
| Nijenborgh 4 phone: +31503634224 |
| 9747 AG GRONINGEN fax : +31503634800 |
| The Netherlands |
+-------------------------------------------------------------+
<tm.
3
.gz
>
<
AlignImages
.o34
.gz
><momlog-20090731.gz>_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users