Hello Gilles Thank you for the advice. However, that did not seem to make any difference. Here is what I did (on the cluster that generates .btr files for core dumps):
[durga@smallMPI git]$ ompi_info --all | grep opal_signal MCA opal base: parameter "opal_signal" (current value: "6,7,8,11", data source: default, level: 3 user/all, type: string) [durga@smallMPI git]$ According to <bits/signum.h>, signals 6.7,8,11 are this: #define SIGABRT 6 /* Abort (ANSI). */ #define SIGBUS 7 /* BUS error (4.2 BSD). */ #define SIGFPE 8 /* Floating-point exception (ANSI). */ #define SIGSEGV 11 /* Segmentation violation (ANSI). */ And thus I added the following just after MPI_Init() MPI_Init(&argc, &argv); signal(SIGABRT, SIG_DFL); signal(SIGBUS, SIG_DFL); signal(SIGFPE, SIG_DFL); signal(SIGSEGV, SIG_DFL); signal(SIGTERM, SIG_DFL); (I added the 'SIGTERM' part later, just in case it would make a difference; i didn't) The resulting code still generates .btr files instead of core files. It looks like the 'execinfo' MCA component is being used as the backtrace mechanism: [durga@smallMPI git]$ ompi_info | grep backtrace MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v3.0.0) However, I could not find any way to choose 'none' instead of 'execinfo' And the strange thing is, on the cluster where regular core dump is happening, the output of $ ompi_info | grep backtrace is identical to the above. (Which kind of makes sense because they were created from the same source with the same configure options.) Sorry to harp on this, but without a core file it is hard to debug the application (e.g. examine stack variables). Thank you Durga The surgeon general advises you to eat right, exercise regularly and quit ageing. On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Durga, > > you might wanna try to restore the signal handler for other signals as well > (SIGSEGV, SIGBUS, ...) > ompi_info --all | grep opal_signal > does list the signal you should restore the handler > > > only one backtrace component is built (out of several candidates : > execinfo, none, printstack) > nm -l libopen-pal.so | grep backtrace > will hint you which component was built > > your two similar distros might have different backtrace component > > > > Gus, > > btr is a plain text file with a back trace "ala" gdb > > > > Nathan, > > i did a 'grep btr' and could not find anything :-( > opal_backtrace_buffer and opal_backtrace_print are only used with stderr. > so i am puzzled who creates the tracefile name and where ... > also, no stack is printed by default unless opal_abort_print_stack is true > > Cheers, > > Gilles > > > On Wed, May 11, 2016 at 3:43 PM, dpchoudh . <dpcho...@gmail.com> wrote: > > Hello Nathan > > > > Thank you for your response. Could you please be more specific? Adding > the > > following after MPI_Init() does not seem to make a difference. > > > > MPI_Init(&argc, &argv); > > signal(SIGABRT, SIG_DFL); > > signal(SIGTERM, SIG_DFL); > > > > I also find it puzzling that nearly identical OMPI distro running on a > > different machine shows different behaviour. > > > > Best regards > > Durga > > > > The surgeon general advises you to eat right, exercise regularly and quit > > ageing. > > > > On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas <hje...@lanl.gov> > > wrote: > >> > >> btr files are indeed created by open mpi's backtrace mechanism. I think > we > >> should revisit it at some point but for now the only effective way i > have > >> found to prevent it is to restore the default signal handlers after > >> MPI_Init. > >> > >> Excuse the quoting style. Good sucks. > >> > >> > >> ________________________________________ > >> From: users on behalf of dpchoudh . > >> Sent: Monday, May 09, 2016 2:59:37 PM > >> To: Open MPI Users > >> Subject: Re: [OMPI users] No core dump in some cases > >> > >> Hi Gus > >> > >> Thanks for your suggestion. But I am not using any resource manager > (i.e. > >> I am launching mpirun from the bash shell.). In fact, both of the two > >> clusters I talked about run CentOS 7 and I launch the job the same way > on > >> both of these, yet one of them creates standard core files and the other > >> creates the 'btr; files. Strange thing is, I could not find anything on > the > >> .btr (= Backtrace?) files on Google, which is any I asked on this forum. > >> > >> Best regards > >> Durga > >> > >> The surgeon general advises you to eat right, exercise regularly and > quit > >> ageing. > >> > >> On Mon, May 9, 2016 at 12:04 PM, Gus Correa > >> <g...@ldeo.columbia.edu<mailto:g...@ldeo.columbia.edu>> wrote: > >> Hi Durga > >> > >> Just in case ... > >> If you're using a resource manager to start the jobs (Torque, etc), > >> you need to have them set the limits (for coredump size, stacksize, > locked > >> memory size, etc). > >> This way the jobs will inherit the limits from the > >> resource manager daemon. > >> On Torque (which I use) I do this on the pbs_mom daemon > >> init script (I am still before the systemd era, that lovely POS). > >> And set the hard/soft limits on /etc/security/limits.conf as well. > >> > >> I hope this helps, > >> Gus Correa > >> > >> On 05/07/2016 12:27 PM, Jeff Squyres (jsquyres) wrote: > >> I'm afraid I don't know what a .btr file is -- that is not something > that > >> is controlled by Open MPI. > >> > >> You might want to look into your OS settings to see if it has some kind > of > >> alternate corefile mechanism...? > >> > >> > >> On May 6, 2016, at 8:58 PM, dpchoudh . > >> <dpcho...@gmail.com<mailto:dpcho...@gmail.com>> wrote: > >> > >> Hello all > >> > >> I run MPI jobs (for test purpose only) on two different 'clusters'. Both > >> 'clusters' have two nodes only, connected back-to-back. The two are very > >> similar, but not identical, both software and hardware wise. > >> > >> Both have ulimit -c set to unlimited. However, only one of the two > creates > >> core files when an MPI job crashes. The other creates a text file named > >> something like > >> > >> > <program_name_that_crashed>.80s-<a-number-that-looks-like-a-PID>,<hostname-where-the-crash-happened>.btr > >> > >> I'd much prefer a core file because that allows me to debug with a lot > >> more options than a static text file with addresses. How do I get a core > >> file in all situations? I am using MPI source from the master branch. > >> > >> Thanks in advance > >> Durga > >> > >> The surgeon general advises you to eat right, exercise regularly and > quit > >> ageing. > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org<mailto:us...@open-mpi.org> > >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2016/05/29124.php > >> > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org<mailto:us...@open-mpi.org> > >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2016/05/29141.php > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2016/05/29154.php > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2016/05/29169.php > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29170.php >