Do you have something like valgrind on your machine? If so, then why not launch 
your apps under valgrind - eg., "mpirun .... valgrind my_app"?

If your app is segfaulting, there isn't much OMPI can do to tell you why. All 
we can do is tell you that your app was hit with a SIGTERM.

Did you talk to your sys admin? Like Jeff said, that probably means you hit 
some system-imposed limit and the resource manager killed you.


On Aug 5, 2011, at 11:55 PM, BasitAli Khan wrote:

> Hi David,
> Unfortunately there is no information about error in the rsl.out.*,
> rsl.error and wrf.out files. The error message mentioned in the previous
> email appeared in the wrf.err file. Both rsl.out and rsl.error shows
> stopping of integration at the time of crash and that is it. I am just
> wondering if there is a way to monitor processes and to know the reason if
> some process dies.
> 
> Cheers,
> ---
> 
> Basit A. Khan, Ph.D.
> Postdoctoral Fellow
> Division of Physical Sciences & Engineering
> Office# 3204, Level 3, Building 1,
> King Abdullah University of Science & Technology
> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ­6900,
> Kingdom of Saudi Arabia.
> 
> Office: +966(0)2 808 0276,  Mobile: +966(0)5 9538 7592
> E-mail: basitali.k...@kaust.edu.sa
> Skype name: basit.a.khan
> 
> 
> 
> 
> On 8/5/11 8:43 PM, "David Warren" <war...@atmos.washington.edu> wrote:
> 
>> That error is from one of the processes that was working when another
>> one died. It is not an indication that MPI had problems, but that you
>> had one of the wrf processes (#45) crash. You need to look at what
>> happened to process 45. What do the rsl.out and rsl.error files for #45
>> say?
>> 
>> On 08/04/11 16:18, Jeff Squyres wrote:
>>> Signal 15 is usually SIGTERM on Linux, meaning that some external
>>> entity probably killed the job.
>>> 
>>> The OMPI error message you describe is also typical for that kind of
>>> scenario -- i.e., a process exited without calling MPI_Finalize could
>>> mean that it called exit() or some external process killed it.
>>> 
>>> 
>>> On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote:
>>> 
>>> 
>>>> I am trying to run a rather heavy wrf simulation with spectral nudging
>>>> but the simulation crashes after 1.8 minutes of integration.
>>>>  The simulation has two domains    with  d01 = 601x601 and d02 =
>>>> 721x721 and 51 vertical levels. I tried this simulation on two
>>>> different systems but result was more or less same. For example
>>>> 
>>>> On our Bluegene/P  with SUSE Linux Enterprise Server 10 ppc and XLF
>>>> compiler I tried to run wrf on 2048 shared memory nodes (1 compute node
>>>> = 4 cores , 32 bit, 850 Mhz). For the parallel run I used mpixlc,
>>>> mpixlcxx and mpixlf90.  I got the following error message in the
>>>> wrf.err file
>>>> 
>>>> <Aug 01 19:50:21.244540>  BE_MPI (ERROR): The error message in the job
>>>> record is as follows:
>>>> <Aug 01 19:50:21.244657>  BE_MPI (ERROR):   "killed with signal 15"
>>>> 
>>>> I also tried to run the same simulation on our linux cluster (Linux
>>>> Red Hat Enterprise 5.4m  x86_64 and Intel compiler) with 8, 16 and 64
>>>> nodes (1 compute node=8 cores). For the parallel run I am used
>>>> mpi/openmpi/1.4.2-intel-11. I got the following error message in the
>>>> error log after couple of minutes of integration.
>>>> 
>>>> "mpirun has exited due to process rank 45 with PID 19540 on
>>>> node ci118 exiting without calling "finalize". This may
>>>> have caused other processes in the application to be
>>>> terminated by signals sent by mpirun (as reported here)."
>>>> 
>>>> I tried many things but nothing seems to be working. However, if I
>>>> reduce  grid points below 200, the simulation goes fine. It appears
>>>> that probably OpenMP has problem with large number of grid points but I
>>>> have no idea how to fix it. I will greatly appreciate if you could
>>>> suggest some solution.
>>>> 
>>>> Best regards,
>>>> ---
>>>> Basit A. Khan, Ph.D.
>>>> Postdoctoral Fellow
>>>> Division of Physical Sciences&  Engineering
>>>> Office# 3204, Level 3, Building 1,
>>>> King Abdullah University of Science&  Technology
>>>> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ­6900,
>>>> Kingdom of Saudi Arabia.
>>>> 
>>>> Office: +966(0)2 808 0276,  Mobile: +966(0)5 9538 7592
>>>> E-mail: basitali.k...@kaust.edu.sa
>>>> Skype name: basit.a.khan
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> 
>>> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to