Hi Ashan

Small stacksize sometimes causes segmentation fault,
specially on large programs like WRF.
However, it is not the only possible cause, of course.

Are you sure you set the stacksize unlimited on *all* nodes
where WRF ran?
It may be tricky.

Ask your system administrator to do it on a permanent
basis on /etc/security/limits.conf.
See 'man limits.conf'.

I hope it helps.
Gus Correa


Ahsan Ali wrote:
Hello Gus, Jody

The system has enough memory. I unlimited the stack size before runnning WRF by the command *ulimit -s unlimited*.But he problem occured.
Thanks

    Hi Ahsan, Jody

    Just a guess that this may be a stack size problem.
    Did you try to run WRF with unlimited stack size?
    Also, does your machine have enough memory to run WRF?

    I hope this helps,
    Gus Correa


    jody wrote:
     > Hi
     > At a first glance i would say this is not a OpenMPI problem,
     > but a wrf problem (though io must admit i have no knowledge
    whatsoever ith wrf)
     >
     > Have you tried running a single instance of wrf.exe?
     > Have you tried to run a simple application (like a "hello world")
    on your nodes?
     >
     > Jody
     >
     >
     > On Tue, Feb 22, 2011 at 7:37 AM, Ahsan Ali <ahsansha...@gmail.com
    <mailto:ahsansha...@gmail.com>> wrote:
     >> Hello,
     >>  I an stuck in a problem that is regarding the running for
    Weather research
     >> and Forecasting Model (WRFV 3.2.1). I get the following error
    while running
     >> with mpirun. Any help would be highly appreciated.
     >>
     >> [pmdtest@pmd02 em_real]$ mpirun -np 4 wrf.exe
     >> starting wrf task 0 of 4
     >> starting wrf task 1 of 4
     >> starting wrf task 3 of 4
     >> starting wrf task 2 of 4
     >>
    --------------------------------------------------------------------------
     >> mpirun noticed that process rank 3 with PID 6044 on node
    pmd02.pakmet.com <http://pmd02.pakmet.com>
     >> exited on signal 11 (Segmentation fault).
     >>
     >>
     >>
     >> --
     >> Syed Ahsan Ali Bokhari
     >> Electronic Engineer (EE)
     >> Research & Development Division
     >> Pakistan Meteorological Department H-8/4, Islamabad.
     >> Phone # off  +92518358714
     >> Cell # +923155145014
     >>
     >>
    Dear Jody,

    WRF is running well on serial option (i.e single interface) . I am
    running
    another application HRM using OpenMPI , there is no issue with that and
    application is running on cluster of many nodes. The wrf manual says the
    following about MPI run:

    I*f you have run the model on multiple processors using MPI, you
    should have
    a number of rsl.out.* and rsl.error.* files. Type ?tail
    rsl.out.0000? to see
    if you get ?SUCCESS COMPLETE WRF?. This is a good indication that
    the model
    has run successfully.*

    *Take a look at either rsl.out.0000 file or other standard out file.
    This
    file logs the times taken to compute for one model time step, and to
    write
    one history and restart output:*

    *
    Timing for main: time 2006-01-21_23:55:00 on domain  2:    4.91110
    elapsed
    seconds.*

    *Timing for main: time 2006-01-21_23:56:00 on domain  2:    4.73350
    elapsed
    seconds.*

    *Timing for main: time 2006-01-21_23:57:00 on domain  2:    4.72360
    elapsed
    seconds.*

    *Timing for main: time 2006-01-21_23:57:00 on domain  1:   19.55880
    elapsed
    seconds.*

    *and*

    *Timing for Writing wrfout_d02_2006-01-22_00:00:00 for domain 2: 1.17970
    elapsed seconds.*

    *Timing for main: time 2006-01-22_00:00:00 on domain 1: 27.66230 elapsed
    seconds.*

    *Timing for Writing wrfout_d01_2006-01-22_00:00:00 for domain 1: 0.60250
    elapsed seconds.*

    * *

    *If the model did not run to completion, take a look at these standard
    output/error files too. If the model has become numerically
    unstable, it may
    have violated the CFL criterion (for numerical stability). Check whether
    this is true by typing the following:*

    * *

    *grep cfl rsl.error.* or grep cfl wrf.out*

    *you might see something like these:*

    *5 points exceeded cfl=2 in domain            1 at time   4.200000 *

    *  MAX AT i,j,k:          123          48          3 cfl,w,d(eta)=
    4.165821*

    *21 points exceeded cfl=2 in domain            1 at time   4.200000 *

    *  MAX AT i,j,k:          123          49          4 cfl,w,d(eta)=
    10.66290*

    But when I check the rsl.out* or rsl.error* there is no indication
    on any
    error occured ,It seems that the application just didn't start.
    [pmdtest@pmd02 em_real]$ tail rsl.out.0000
     WRF NUMBER OF TILES FROM OMP_GET_MAX_THREADS =   8
     WRF TILE   1 IS      1 IE    360 JS      1 JE     25
     WRF TILE   2 IS      1 IE    360 JS     26 JE     50
     WRF TILE   3 IS      1 IE    360 JS     51 JE     74
     WRF TILE   4 IS      1 IE    360 JS     75 JE     98
     WRF TILE   5 IS      1 IE    360 JS     99 JE    122
     WRF TILE   6 IS      1 IE    360 JS    123 JE    146
     WRF TILE   7 IS      1 IE    360 JS    147 JE    170
     WRF TILE   8 IS      1 IE    360 JS    171 JE    195
     WRF NUMBER OF TILES =   8



Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014


------------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to