[OMPI users] OpenMPI causing WRF to crash

2011-08-03 Thread BasitAli Khan
I am trying to run a rather heavy wrf simulation with spectral nudging but the 
simulation crashes after 1.8 minutes of integration.
 The simulation has two domainswith  d01 = 601x601 and d02 = 721x721 and 51 
vertical levels. I tried this simulation on two different systems but result 
was more or less same. For example

On our Bluegene/P  with SUSE Linux Enterprise Server 10 ppc and XLF compiler I 
tried to run wrf on 2048 shared memory nodes (1 compute node = 4 cores , 32 
bit, 850 Mhz). For the parallel run I used mpixlc, mpixlcxx and mpixlf90.  I 
got the following error message in the wrf.err file

 BE_MPI (ERROR): The error message in the job
record is as follows:
 BE_MPI (ERROR):   "killed with signal 15"

I also tried to run the same simulation on our linux cluster (Linux Red Hat 
Enterprise 5.4m  x86_64 and Intel compiler) with 8, 16 and 64 nodes (1 compute 
node=8 cores). For the parallel run I am used mpi/openmpi/1.4.2-intel-11. I got 
the following error message in the error log after couple of minutes of 
integration.

"mpirun has exited due to process rank 45 with PID 19540 on
node ci118 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here)."

I tried many things but nothing seems to be working. However, if I reduce  grid 
points below 200, the simulation goes fine. It appears that probably OpenMP has 
problem with large number of grid points but I have no idea how to fix it. I 
will greatly appreciate if you could suggest some solution.

Best regards,
---
Basit A. Khan, Ph.D.
Postdoctoral Fellow
Division of Physical Sciences & Engineering
Office# 3204, Level 3, Building 1,
King Abdullah University of Science & Technology
4700 King Abdullah Blvd, Box 2753, Thuwal 23955 –6900,
Kingdom of Saudi Arabia.

Office: +966(0)2 808 0276,  Mobile: +966(0)5 9538 7592
E-mail: basitali.k...@kaust.edu.sa
Skype name: basit.a.khan


Re: [OMPI users] OpenMPI causing WRF to crash

2011-08-03 Thread BasitAli Khan
Hi Dmitry,
Thanks for a prompt and fairly detailed response. I have also forwarded
the email to wrf community in the hope that somebody would have some
straight forward solution. I will try to debug the error as suggested by
you if I would not have much luck from the wrf forum.

Cheers,
---

Basit A. Khan, Ph.D.
Postdoctoral Fellow
Division of Physical Sciences & Engineering
Office# 3204, Level 3, Building 1,
King Abdullah University of Science & Technology
4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ­6900,
Kingdom of Saudi Arabia.

Office: +966(0)2 808 0276,  Mobile: +966(0)5 9538 7592
E-mail: basitali.k...@kaust.edu.sa
Skype name: basit.a.khan




On 8/3/11 2:46 PM, "Dmitry N. Mikushin"  wrote:

>5 apparently means one of the WRF's MPI processes has been
>unexpectedly terminated, maybe by program decision. No matter, if it
>is OpenMPI-specifi




Re: [OMPI users] OpenMPI causing WRF to crash

2011-08-06 Thread BasitAli Khan
Hi David,
Unfortunately there is no information about error in the rsl.out.*,
rsl.error and wrf.out files. The error message mentioned in the previous
email appeared in the wrf.err file. Both rsl.out and rsl.error shows
stopping of integration at the time of crash and that is it. I am just
wondering if there is a way to monitor processes and to know the reason if
some process dies.

Cheers,
---

Basit A. Khan, Ph.D.
Postdoctoral Fellow
Division of Physical Sciences & Engineering
Office# 3204, Level 3, Building 1,
King Abdullah University of Science & Technology
4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ­6900,
Kingdom of Saudi Arabia.

Office: +966(0)2 808 0276,  Mobile: +966(0)5 9538 7592
E-mail: basitali.k...@kaust.edu.sa
Skype name: basit.a.khan




On 8/5/11 8:43 PM, "David Warren"  wrote:

>That error is from one of the processes that was working when another
>one died. It is not an indication that MPI had problems, but that you
>had one of the wrf processes (#45) crash. You need to look at what
>happened to process 45. What do the rsl.out and rsl.error files for #45
>say?
>
>On 08/04/11 16:18, Jeff Squyres wrote:
>> Signal 15 is usually SIGTERM on Linux, meaning that some external
>>entity probably killed the job.
>>
>> The OMPI error message you describe is also typical for that kind of
>>scenario -- i.e., a process exited without calling MPI_Finalize could
>>mean that it called exit() or some external process killed it.
>>
>>
>> On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote:
>>
>>
>>> I am trying to run a rather heavy wrf simulation with spectral nudging
>>>but the simulation crashes after 1.8 minutes of integration.
>>>   The simulation has two domainswith  d01 = 601x601 and d02 =
>>>721x721 and 51 vertical levels. I tried this simulation on two
>>>different systems but result was more or less same. For example
>>>
>>> On our Bluegene/P  with SUSE Linux Enterprise Server 10 ppc and XLF
>>>compiler I tried to run wrf on 2048 shared memory nodes (1 compute node
>>>= 4 cores , 32 bit, 850 Mhz). For the parallel run I used mpixlc,
>>>mpixlcxx and mpixlf90.  I got the following error message in the
>>>wrf.err file
>>>
>>>   BE_MPI (ERROR): The error message in the job
>>> record is as follows:
>>>   BE_MPI (ERROR):   "killed with signal 15"
>>>
>>> I also tried to run the same simulation on our linux cluster (Linux
>>>Red Hat Enterprise 5.4m  x86_64 and Intel compiler) with 8, 16 and 64
>>>nodes (1 compute node=8 cores). For the parallel run I am used
>>>mpi/openmpi/1.4.2-intel-11. I got the following error message in the
>>>error log after couple of minutes of integration.
>>>
>>> "mpirun has exited due to process rank 45 with PID 19540 on
>>> node ci118 exiting without calling "finalize". This may
>>> have caused other processes in the application to be
>>> terminated by signals sent by mpirun (as reported here)."
>>>
>>> I tried many things but nothing seems to be working. However, if I
>>>reduce  grid points below 200, the simulation goes fine. It appears
>>>that probably OpenMP has problem with large number of grid points but I
>>>have no idea how to fix it. I will greatly appreciate if you could
>>>suggest some solution.
>>>
>>> Best regards,
>>> ---
>>> Basit A. Khan, Ph.D.
>>> Postdoctoral Fellow
>>> Division of Physical Sciences&  Engineering
>>> Office# 3204, Level 3, Building 1,
>>> King Abdullah University of Science&  Technology
>>> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ­6900,
>>> Kingdom of Saudi Arabia.
>>>
>>> Office: +966(0)2 808 0276,  Mobile: +966(0)5 9538 7592
>>> E-mail: basitali.k...@kaust.edu.sa
>>> Skype name: basit.a.khan
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>  
>>
>>