On 4/24/2012 6:19 AM, Syed Ahsan Ali wrote:
I am not familiar with attaching debugger to the processes. Other things you asked are as follows:
The easiest is to get Totalview or Allinea (both are parallel debuggers) and attach them to the job. However they cost. Another is to try padb, look at http://padb.pittman.org.uk (this is probably your best bet). Lastly is on a node that has a running process find the pid of that process and attach gdb or dbx to it using "gdb - <pid>" where <pid> is the process id of one of the processes. Then once in the debugger do a "where" command (this will give you the stack of the process).
Is this the first time you've ran it (with Open MPI? with any MPI?) *No We have been running this and other models but this problem has arised now
*
Ok, so from the above are you saying HRM has worked with Open MPI on the same cluster before? If so what has changed?
How many processes is the job using? Are you oversubscribing your processors?*I have tried to run on cluster having 184 cores as well on 8 cores of the same server
*
So the hang even happens on a single server without any networks?
Does the job get past MPI_Init?
**  What version of Open MPI are you using? *openmpi 1.4.2*
  Have you tested all network connections? *yes
* It might help us to know the size of cluster you are running and what type of network? *the cluster has 32 nodes dell power edge blade servers and connectivity is Gigabit Ethernet and Infiniband,
*

--td
**


On Tue, Apr 24, 2012 at 3:02 PM, TERRY DONTJE <terry.don...@oracle.com <mailto:terry.don...@oracle.com>> wrote:

    To determine if an MPI process is waiting for a message do what
    Rayson suggested and attach a debugger to the processes and see if
    any of them are stuck in MPI.  Either internally in a MPI_Recv or
    MPI_Wait call or looping on a MPI_Test call.

    Other things to consider.
      Is this the first time you've ran it (with Open MPI? with any MPI?)?
      How many processes is the job using?  Are you oversubscribing
    your processors?
      What version of Open MPI are you using?
      Have you tested all network connections?
      It might help us to know the size of cluster you are running and
    what type of network?

    --td

    On 4/24/2012 2:42 AM, Syed Ahsan Ali wrote:
    Dear Rayson,

    That is a Nuemrical model that is written by National weather
    service of a country. The logs of the model show every detail
    about the simulation progress. I have checked on the remote nodes
    as well the application binary is running but the logs show no
    progress, it is just waiting at a point. The input data is
    correct everything is fine. How can I check if the MPI task is
    waiting for a message?
    Ahsan

    On Tue, Apr 24, 2012 at 11:03 AM, Rayson Ho
    <raysonlo...@gmail.com <mailto:raysonlo...@gmail.com>> wrote:

        Seems like there's a bug in the application. Did you or
        someone else
        write it, or did you get it from an ISV??

        You can log onto one of the nodes, attach a debugger, and see
        if the
        MPI task is waiting for a message (looping in one of the MPI
        receive
        functions)...

        Rayson

        =================================
        Open Grid Scheduler / Grid Engine
        http://gridscheduler.sourceforge.net/

        Scalable Grid Engine Support Program
        http://www.scalablelogic.com/


        On Tue, Apr 24, 2012 at 12:49 AM, Syed Ahsan Ali
        <ahsansha...@gmail.com <mailto:ahsansha...@gmail.com>> wrote:
        > Dear All,
        >
        > I am having problem with running an application on Dell
        cluster . The model
        > starts well but no further progress is shown. It
        just stuck. I have checked
        > the systems, no apparent hardware error is there. Other
        open mpi
        > applications are running well on the same cluster. I have
        tried running the
        > application on cores of the same server as well but the
        problem is same. The
        > application just don't move further. The same application
        is also running
        > well on a backup cluster. Please help.
        >
        >
        > Thanks and Best Regards
        >
        > Ahsan
        >
        > _______________________________________________
        > users mailing list
        > us...@open-mpi.org <mailto:us...@open-mpi.org>
        > http://www.open-mpi.org/mailman/listinfo.cgi/users



        --
        ==================================================
        Open Grid Scheduler - The Official Open Source Grid Engine
        http://gridscheduler.sourceforge.net/

        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        http://www.open-mpi.org/mailman/listinfo.cgi/users






    _______________________________________________
    users mailing list
    us...@open-mpi.org  <mailto:us...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/users

-- Terry D. Dontje | Principal Software Engineer
    Developer Tools Engineering | +1.781.442.2631 <tel:%2B1.781.442.2631>
    Oracle *- Performance Technologies*
    95 Network Drive, Burlington, MA 01803
    Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>




    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/users






--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>



Reply via email to