Are you sure ulimit -c unlimited is *really* applied on all hosts
can you please run the simple program below and confirm that ?
Cheers,
Gilles
#include <sys/time.h>
#include <sys/resource.h>
#include <poll.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
struct rlimit rlim;
char * c = (char *)0;
getrlimit(RLIMIT_CORE, &rlim);
printf ("before MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
MPI_Init(&argc, &argv);
getrlimit(RLIMIT_CORE, &rlim);
printf ("after MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
*c = 0;
MPI_Finalize();
return 0;
}
On 5/12/2016 4:22 AM, dpchoudh . wrote:
Hello Gilles
Thank you for the advice. However, that did not seem to make any
difference. Here is what I did (on the cluster that generates .btr
files for core dumps):
[durga@smallMPI git]$ ompi_info --all | grep opal_signal
MCA opal base: parameter "opal_signal" (current value:
"6,7,8,11", data source: default, level: 3 user/all, type: string)
[durga@smallMPI git]$
According to <bits/signum.h>, signals 6.7,8,11 are this:
#define SIGABRT 6 /* Abort (ANSI). */
#define SIGBUS 7 /* BUS error (4.2 BSD). */
#define SIGFPE 8 /* Floating-point exception (ANSI). */
#define SIGSEGV 11 /* Segmentation violation (ANSI). */
And thus I added the following just after MPI_Init()
MPI_Init(&argc, &argv);
signal(SIGABRT, SIG_DFL);
signal(SIGBUS, SIG_DFL);
signal(SIGFPE, SIG_DFL);
signal(SIGSEGV, SIG_DFL);
signal(SIGTERM, SIG_DFL);
(I added the 'SIGTERM' part later, just in case it would make a
difference; i didn't)
The resulting code still generates .btr files instead of core files.
It looks like the 'execinfo' MCA component is being used as the
backtrace mechanism:
[durga@smallMPI git]$ ompi_info | grep backtrace
MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component
v3.0.0)
However, I could not find any way to choose 'none' instead of 'execinfo'
And the strange thing is, on the cluster where regular core dump is
happening, the output of
$ ompi_info | grep backtrace
is identical to the above. (Which kind of makes sense because they
were created from the same source with the same configure options.)
Sorry to harp on this, but without a core file it is hard to debug the
application (e.g. examine stack variables).
Thank you
Durga
The surgeon general advises you to eat right, exercise regularly and
quit ageing.
On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>>
wrote:
Durga,
you might wanna try to restore the signal handler for other
signals as well
(SIGSEGV, SIGBUS, ...)
ompi_info --all | grep opal_signal
does list the signal you should restore the handler
only one backtrace component is built (out of several candidates :
execinfo, none, printstack)
nm -l libopen-pal.so | grep backtrace
will hint you which component was built
your two similar distros might have different backtrace component
Gus,
btr is a plain text file with a back trace "ala" gdb
Nathan,
i did a 'grep btr' and could not find anything :-(
opal_backtrace_buffer and opal_backtrace_print are only used with
stderr.
so i am puzzled who creates the tracefile name and where ...
also, no stack is printed by default unless opal_abort_print_stack
is true
Cheers,
Gilles
On Wed, May 11, 2016 at 3:43 PM, dpchoudh . <dpcho...@gmail.com
<mailto:dpcho...@gmail.com>> wrote:
> Hello Nathan
>
> Thank you for your response. Could you please be more specific?
Adding the
> following after MPI_Init() does not seem to make a difference.
>
> MPI_Init(&argc, &argv);
> signal(SIGABRT, SIG_DFL);
> signal(SIGTERM, SIG_DFL);
>
> I also find it puzzling that nearly identical OMPI distro
running on a
> different machine shows different behaviour.
>
> Best regards
> Durga
>
> The surgeon general advises you to eat right, exercise regularly
and quit
> ageing.
>
> On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas
<hje...@lanl.gov <mailto:hje...@lanl.gov>>
> wrote:
>>
>> btr files are indeed created by open mpi's backtrace mechanism.
I think we
>> should revisit it at some point but for now the only effective
way i have
>> found to prevent it is to restore the default signal handlers after
>> MPI_Init.
>>
>> Excuse the quoting style. Good sucks.
>>
>>
>> ________________________________________
>> From: users on behalf of dpchoudh .
>> Sent: Monday, May 09, 2016 2:59:37 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] No core dump in some cases
>>
>> Hi Gus
>>
>> Thanks for your suggestion. But I am not using any resource
manager (i.e.
>> I am launching mpirun from the bash shell.). In fact, both of
the two
>> clusters I talked about run CentOS 7 and I launch the job the
same way on
>> both of these, yet one of them creates standard core files and
the other
>> creates the 'btr; files. Strange thing is, I could not find
anything on the
>> .btr (= Backtrace?) files on Google, which is any I asked on
this forum.
>>
>> Best regards
>> Durga
>>
>> The surgeon general advises you to eat right, exercise
regularly and quit
>> ageing.
>>
>> On Mon, May 9, 2016 at 12:04 PM, Gus Correa
>> <g...@ldeo.columbia.edu
<mailto:g...@ldeo.columbia.edu><mailto:g...@ldeo.columbia.edu
<mailto:g...@ldeo.columbia.edu>>> wrote:
>> Hi Durga
>>
>> Just in case ...
>> If you're using a resource manager to start the jobs (Torque, etc),
>> you need to have them set the limits (for coredump size,
stacksize, locked
>> memory size, etc).
>> This way the jobs will inherit the limits from the
>> resource manager daemon.
>> On Torque (which I use) I do this on the pbs_mom daemon
>> init script (I am still before the systemd era, that lovely POS).
>> And set the hard/soft limits on /etc/security/limits.conf as well.
>>
>> I hope this helps,
>> Gus Correa
>>
>> On 05/07/2016 12:27 PM, Jeff Squyres (jsquyres) wrote:
>> I'm afraid I don't know what a .btr file is -- that is not
something that
>> is controlled by Open MPI.
>>
>> You might want to look into your OS settings to see if it has
some kind of
>> alternate corefile mechanism...?
>>
>>
>> On May 6, 2016, at 8:58 PM, dpchoudh .
>> <dpcho...@gmail.com
<mailto:dpcho...@gmail.com><mailto:dpcho...@gmail.com
<mailto:dpcho...@gmail.com>>> wrote:
>>
>> Hello all
>>
>> I run MPI jobs (for test purpose only) on two different
'clusters'. Both
>> 'clusters' have two nodes only, connected back-to-back. The two
are very
>> similar, but not identical, both software and hardware wise.
>>
>> Both have ulimit -c set to unlimited. However, only one of the
two creates
>> core files when an MPI job crashes. The other creates a text
file named
>> something like
>>
>>
<program_name_that_crashed>.80s-<a-number-that-looks-like-a-PID>,<hostname-where-the-crash-happened>.btr
>>
>> I'd much prefer a core file because that allows me to debug
with a lot
>> more options than a static text file with addresses. How do I
get a core
>> file in all situations? I am using MPI source from the master
branch.
>>
>> Thanks in advance
>> Durga
>>
>> The surgeon general advises you to eat right, exercise
regularly and quit
>> ageing.
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org
<mailto:us...@open-mpi.org>>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/05/29124.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org
<mailto:us...@open-mpi.org>>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/05/29141.php
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/05/29154.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29169.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29170.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29176.php