Hi, Sofar I don't have a core file. the weird thing is that the same job will run well when openmpi is compiled without --with-tm. Is the amount of memory, or number of open files different in both cases? How can I force unlimited resources for the job?? only then I will get a core file.
kind regards, Wilko Ralph Castain wrote: > Ummm...this log indicates that OMPI ran perfectly - it is your > application that segfaulted. > > Can you run gdb (or your favorite debugger) against a core file from > your app? It looks like something in your app is crashing. > > As far as I can tell, everything is working fine. We launch and wireup > just fine, then detect one of your processes has segfaulted - which > triggers us to kill the remaining processes and terminate the job. > > > On Jul 31, 2009, at 8:35 AM, Wilko Keegstra wrote: > >> Hi, >> >> I have recompiled openmpi with the --enabled-debug and >> --with-tm=/usr/local >> flags, and submitted the job to torque 2.3.7: >> >> #PBS -q cluster2 >> #PBS -l nodes=5:ppn=2 >> #PBS -N AlignImages >> #PBS -j oe >> /usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca plm_base_verbose >> 5 --debug-daemons -machinefile $PBS_NODEFILE >> /pcs/programs/grip/bin/RunAlignmentMPI DoAlign >> /pcs/pc00/keegstra/work/hm/hemo-mix-psml.img >> /pcs/pc00/keegstra/work/hm/hemo-mix-psml-ali.img 4 9 14 1 2497 360.000 >> 64.000 /pcs/pc00/keegstra/work/hm/hemo-mix-pref.img 1 7 0 >> >> and the job crashed almost immediately. i have attached: >> tm.3.gz, Job output: AlignImages.o34.gz, momlog-20090731 >> >> I hope you can help me, >> kind regards, >> Wilko >> >> >> Ralph Castain wrote: >>> Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7 >>> allocation, and the man page for tm_spawn? >>> >>> My only guess would be that something changed in those areas as we don't >>> really use anything else from Torque, and run on Torque-based clusters >>> in production every day. Not sure what version we have here, though I >>> believe it is pretty current (will check). >>> >>> You also might want to configure OMPI 1.3.3 with --enable-debug. You >>> could then do a run with -mca ras_base_verbose 5 -mca plm_base_verbose 5 >>> --debug-daemons on your mpirun cmd line to get a step-by-step diagnostic >>> output of the interaction with Torque. Should give us some idea of where >>> the failure is occurring. >>> >>> Ralph >>> >>> On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote: >>> >>>> hi, >>>> >>>> I have the following problem: >>>> >>>> I am using openmpi 1.3.3 >>>> >>>> programs (directly and from scripts) submitted with mpiexec are running >>>> fine. >>>> >>>> programs (directly and from scripts) submitted through Torque 2.3.7 >>>> with openmpi compiled with --with-tm (and torque-devel) installed >>>> give segfaulting of the programs. >>>> >>>> programs submitted through Torque 2.3.7 directly with openmpi >>>> compiled without --with-tm (and NO torque-devel installed) run fine >>>> however mpiexec programs from script (script submiited through torque) >>>> are only running on 1 node, so I need openmpi compiled with --with-tm >>>> >>>> We also have a cluster running with openmpi 1.2.9 compiled without >>>> --with-tm in combination with torque 2.3.3 and everything is running >>>> fine, so NO segfaults and mpiexec from script also running on the >>>> nodes selected at submitting time. >>>> >>>> I don't have errors on log files only on the job log file: >>>> >>>> --------------------------------------------------------------------------- >>>> >>>> >>>> mpiexec noticed that process rank 7 with PID 3150 on node >>>> rugem21.chem.rug.nl exited on signal 11 (Segmentation fault). >>>> -------------------------------------------------------------------------- >>>> >>>> >>>> >>>> Could anyone please help me, >>>> many thanks in advance >>>> Wilko Keegstra >>>> >>>> -- >>>> +-------------------------------------------------------------+ >>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 | >>>> | Groningen University email: w.keegs...@rug.nl | >>>> | Groningen Biomolecular Sciences and Biotechnology Institute | >>>> | Nijenborgh 4 phone: +31503634224 | >>>> | 9747 AG GRONINGEN fax : +31503634800 | >>>> | The Netherlands | >>>> +-------------------------------------------------------------+ >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> -- >> +-------------------------------------------------------------+ >> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 | >> | Groningen University email: w.keegs...@rug.nl | >> | Groningen Biomolecular Sciences and Biotechnology Institute | >> | Nijenborgh 4 phone: +31503634224 | >> | 9747 AG GRONINGEN fax : +31503634800 | >> | The Netherlands | >> +-------------------------------------------------------------+ >> <tm.3.gz><AlignImages.o34.gz><momlog-20090731.gz>_______________________________________________ >> >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- +-------------------------------------------------------------+ | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 | | Groningen University email: w.keegs...@rug.nl | | Groningen Biomolecular Sciences and Biotechnology Institute | | Nijenborgh 4 phone: +31503634224 | | 9747 AG GRONINGEN fax : +31503634800 | | The Netherlands | +-------------------------------------------------------------+