Do you have something like valgrind on your machine? If so, then why not launch your apps under valgrind - eg., "mpirun .... valgrind my_app"?
If your app is segfaulting, there isn't much OMPI can do to tell you why. All we can do is tell you that your app was hit with a SIGTERM. Did you talk to your sys admin? Like Jeff said, that probably means you hit some system-imposed limit and the resource manager killed you. On Aug 5, 2011, at 11:55 PM, BasitAli Khan wrote: > Hi David, > Unfortunately there is no information about error in the rsl.out.*, > rsl.error and wrf.out files. The error message mentioned in the previous > email appeared in the wrf.err file. Both rsl.out and rsl.error shows > stopping of integration at the time of crash and that is it. I am just > wondering if there is a way to monitor processes and to know the reason if > some process dies. > > Cheers, > --- > > Basit A. Khan, Ph.D. > Postdoctoral Fellow > Division of Physical Sciences & Engineering > Office# 3204, Level 3, Building 1, > King Abdullah University of Science & Technology > 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 6900, > Kingdom of Saudi Arabia. > > Office: +966(0)2 808 0276, Mobile: +966(0)5 9538 7592 > E-mail: basitali.k...@kaust.edu.sa > Skype name: basit.a.khan > > > > > On 8/5/11 8:43 PM, "David Warren" <war...@atmos.washington.edu> wrote: > >> That error is from one of the processes that was working when another >> one died. It is not an indication that MPI had problems, but that you >> had one of the wrf processes (#45) crash. You need to look at what >> happened to process 45. What do the rsl.out and rsl.error files for #45 >> say? >> >> On 08/04/11 16:18, Jeff Squyres wrote: >>> Signal 15 is usually SIGTERM on Linux, meaning that some external >>> entity probably killed the job. >>> >>> The OMPI error message you describe is also typical for that kind of >>> scenario -- i.e., a process exited without calling MPI_Finalize could >>> mean that it called exit() or some external process killed it. >>> >>> >>> On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote: >>> >>> >>>> I am trying to run a rather heavy wrf simulation with spectral nudging >>>> but the simulation crashes after 1.8 minutes of integration. >>>> The simulation has two domains with d01 = 601x601 and d02 = >>>> 721x721 and 51 vertical levels. I tried this simulation on two >>>> different systems but result was more or less same. For example >>>> >>>> On our Bluegene/P with SUSE Linux Enterprise Server 10 ppc and XLF >>>> compiler I tried to run wrf on 2048 shared memory nodes (1 compute node >>>> = 4 cores , 32 bit, 850 Mhz). For the parallel run I used mpixlc, >>>> mpixlcxx and mpixlf90. I got the following error message in the >>>> wrf.err file >>>> >>>> <Aug 01 19:50:21.244540> BE_MPI (ERROR): The error message in the job >>>> record is as follows: >>>> <Aug 01 19:50:21.244657> BE_MPI (ERROR): "killed with signal 15" >>>> >>>> I also tried to run the same simulation on our linux cluster (Linux >>>> Red Hat Enterprise 5.4m x86_64 and Intel compiler) with 8, 16 and 64 >>>> nodes (1 compute node=8 cores). For the parallel run I am used >>>> mpi/openmpi/1.4.2-intel-11. I got the following error message in the >>>> error log after couple of minutes of integration. >>>> >>>> "mpirun has exited due to process rank 45 with PID 19540 on >>>> node ci118 exiting without calling "finalize". This may >>>> have caused other processes in the application to be >>>> terminated by signals sent by mpirun (as reported here)." >>>> >>>> I tried many things but nothing seems to be working. However, if I >>>> reduce grid points below 200, the simulation goes fine. It appears >>>> that probably OpenMP has problem with large number of grid points but I >>>> have no idea how to fix it. I will greatly appreciate if you could >>>> suggest some solution. >>>> >>>> Best regards, >>>> --- >>>> Basit A. Khan, Ph.D. >>>> Postdoctoral Fellow >>>> Division of Physical Sciences& Engineering >>>> Office# 3204, Level 3, Building 1, >>>> King Abdullah University of Science& Technology >>>> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 6900, >>>> Kingdom of Saudi Arabia. >>>> >>>> Office: +966(0)2 808 0276, Mobile: +966(0)5 9538 7592 >>>> E-mail: basitali.k...@kaust.edu.sa >>>> Skype name: basit.a.khan >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users