Are you sure ulimit -c unlimited is *really* applied on all hosts


can you please run the simple program below and confirm that ?


Cheers,


Gilles


#include <sys/time.h>
#include <sys/resource.h>
#include <poll.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    struct rlimit rlim;
    char * c = (char *)0;
    getrlimit(RLIMIT_CORE, &rlim);
    printf ("before MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
    MPI_Init(&argc, &argv);
    getrlimit(RLIMIT_CORE, &rlim);
    printf ("after MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
    *c = 0;
    MPI_Finalize();
    return 0;
}


On 5/12/2016 4:22 AM, dpchoudh . wrote:
Hello Gilles

Thank you for the advice. However, that did not seem to make any difference. Here is what I did (on the cluster that generates .btr files for core dumps):

[durga@smallMPI git]$ ompi_info --all | grep opal_signal
MCA opal base: parameter "opal_signal" (current value: "6,7,8,11", data source: default, level: 3 user/all, type: string)
[durga@smallMPI git]$


According to <bits/signum.h>, signals 6.7,8,11 are this:

#define SIGABRT        6    /* Abort (ANSI).  */
#define    SIGBUS        7    /* BUS error (4.2 BSD).  */
#define    SIGFPE        8    /* Floating-point exception (ANSI).  */
#define    SIGSEGV        11    /* Segmentation violation (ANSI).  */

And thus I added the following just after MPI_Init()

MPI_Init(&argc, &argv);
    signal(SIGABRT, SIG_DFL);
    signal(SIGBUS, SIG_DFL);
    signal(SIGFPE, SIG_DFL);
    signal(SIGSEGV, SIG_DFL);
    signal(SIGTERM, SIG_DFL);

(I added the 'SIGTERM' part later, just in case it would make a difference; i didn't)

The resulting code still generates .btr files instead of core files.

It looks like the 'execinfo' MCA component is being used as the backtrace mechanism:

[durga@smallMPI git]$ ompi_info | grep backtrace
MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v3.0.0)

However, I could not find any way to choose 'none' instead of 'execinfo'

And the strange thing is, on the cluster where regular core dump is happening, the output of
$ ompi_info | grep backtrace
is identical to the above. (Which kind of makes sense because they were created from the same source with the same configure options.)

Sorry to harp on this, but without a core file it is hard to debug the application (e.g. examine stack variables).

Thank you
Durga


The surgeon general advises you to eat right, exercise regularly and quit ageing.

On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> wrote:

    Durga,

    you might wanna try to restore the signal handler for other
    signals as well
    (SIGSEGV, SIGBUS, ...)
    ompi_info --all | grep opal_signal
    does list the signal you should restore the handler


    only one backtrace component is built (out of several candidates :
    execinfo, none, printstack)
    nm -l libopen-pal.so | grep backtrace
    will hint you which component was built

    your two similar distros might have different backtrace component



    Gus,

    btr is a plain text file with a back trace "ala" gdb



    Nathan,

    i did a 'grep btr' and could not find anything :-(
    opal_backtrace_buffer and opal_backtrace_print are only used with
    stderr.
    so i am puzzled who creates the tracefile name and where ...
    also, no stack is printed by default unless opal_abort_print_stack
    is true

    Cheers,

    Gilles


    On Wed, May 11, 2016 at 3:43 PM, dpchoudh . <dpcho...@gmail.com
    <mailto:dpcho...@gmail.com>> wrote:
    > Hello Nathan
    >
    > Thank you for your response. Could you please be more specific?
    Adding the
    > following after MPI_Init() does not seem to make a difference.
    >
    >     MPI_Init(&argc, &argv);
    >   signal(SIGABRT, SIG_DFL);
    >   signal(SIGTERM, SIG_DFL);
    >
    > I also find it puzzling that nearly identical OMPI distro
    running on a
    > different machine shows different behaviour.
    >
    > Best regards
    > Durga
    >
    > The surgeon general advises you to eat right, exercise regularly
    and quit
    > ageing.
    >
    > On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas
    <hje...@lanl.gov <mailto:hje...@lanl.gov>>
    > wrote:
    >>
    >> btr files are indeed created by open mpi's backtrace mechanism.
    I think we
    >> should revisit it at some point but for now the only effective
    way i have
    >> found to prevent it is to restore the default signal handlers after
    >> MPI_Init.
    >>
    >> Excuse the quoting style. Good sucks.
    >>
    >>
    >> ________________________________________
    >> From: users on behalf of dpchoudh .
    >> Sent: Monday, May 09, 2016 2:59:37 PM
    >> To: Open MPI Users
    >> Subject: Re: [OMPI users] No core dump in some cases
    >>
    >> Hi Gus
    >>
    >> Thanks for your suggestion. But I am not using any resource
    manager (i.e.
    >> I am launching mpirun from the bash shell.). In fact, both of
    the two
    >> clusters I talked about run CentOS 7 and I launch the job the
    same way on
    >> both of these, yet one of them creates standard core files and
    the other
    >> creates the 'btr; files. Strange thing is, I could not find
    anything on the
    >> .btr (= Backtrace?) files on Google, which is any I asked on
    this forum.
    >>
    >> Best regards
    >> Durga
    >>
    >> The surgeon general advises you to eat right, exercise
    regularly and quit
    >> ageing.
    >>
    >> On Mon, May 9, 2016 at 12:04 PM, Gus Correa
    >> <g...@ldeo.columbia.edu
    <mailto:g...@ldeo.columbia.edu><mailto:g...@ldeo.columbia.edu
    <mailto:g...@ldeo.columbia.edu>>> wrote:
    >> Hi Durga
    >>
    >> Just in case ...
    >> If you're using a resource manager to start the jobs (Torque, etc),
    >> you need to have them set the limits (for coredump size,
    stacksize, locked
    >> memory size, etc).
    >> This way the jobs will inherit the limits from the
    >> resource manager daemon.
    >> On Torque (which I use) I do this on the pbs_mom daemon
    >> init script (I am still before the systemd era, that lovely POS).
    >> And set the hard/soft limits on /etc/security/limits.conf as well.
    >>
    >> I hope this helps,
    >> Gus Correa
    >>
    >> On 05/07/2016 12:27 PM, Jeff Squyres (jsquyres) wrote:
    >> I'm afraid I don't know what a .btr file is -- that is not
    something that
    >> is controlled by Open MPI.
    >>
    >> You might want to look into your OS settings to see if it has
    some kind of
    >> alternate corefile mechanism...?
    >>
    >>
    >> On May 6, 2016, at 8:58 PM, dpchoudh .
    >> <dpcho...@gmail.com
    <mailto:dpcho...@gmail.com><mailto:dpcho...@gmail.com
    <mailto:dpcho...@gmail.com>>> wrote:
    >>
    >> Hello all
    >>
    >> I run MPI jobs (for test purpose only) on two different
    'clusters'. Both
    >> 'clusters' have two nodes only, connected back-to-back. The two
    are very
    >> similar, but not identical, both software and hardware wise.
    >>
    >> Both have ulimit -c set to unlimited. However, only one of the
    two creates
    >> core files when an MPI job crashes. The other creates a text
    file named
    >> something like
    >>
    >>
    
<program_name_that_crashed>.80s-<a-number-that-looks-like-a-PID>,<hostname-where-the-crash-happened>.btr
    >>
    >> I'd much prefer a core file because that allows me to debug
    with a lot
    >> more options than a static text file with addresses. How do I
    get a core
    >> file in all situations? I am using MPI source from the master
    branch.
    >>
    >> Thanks in advance
    >> Durga
    >>
    >> The surgeon general advises you to eat right, exercise
    regularly and quit
    >> ageing.
    >> _______________________________________________
    >> users mailing list
    >> us...@open-mpi.org
    <mailto:us...@open-mpi.org><mailto:us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    >> Link to this post:
    >> http://www.open-mpi.org/community/lists/users/2016/05/29124.php
    >>
    >>
    >>
    >> _______________________________________________
    >> users mailing list
    >> us...@open-mpi.org
    <mailto:us...@open-mpi.org><mailto:us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    >> Link to this post:
    >> http://www.open-mpi.org/community/lists/users/2016/05/29141.php
    >>
    >> _______________________________________________
    >> users mailing list
    >> us...@open-mpi.org <mailto:us...@open-mpi.org>
    >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    >> Link to this post:
    >> http://www.open-mpi.org/community/lists/users/2016/05/29154.php
    >
    >
    >
    > _______________________________________________
    > users mailing list
    > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    > Link to this post:
    > http://www.open-mpi.org/community/lists/users/2016/05/29169.php
    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2016/05/29170.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29176.php

Reply via email to