Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
Sorry for belabouring on this, but this (hopefully final!) piece of information might be of interest to the developers: There must be a reason why PSM is installing its signal handlers; often this is done to modify the permission of a page in response to a SEGV and attempt access again. By disabli

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
Hello Gilles Mystery solved! In fact, this one line is exactly what was needed!! It turns out the OMPI signal handlers are irrelevant. (i.e. don't make any difference when this env variable is set) This explains: 1. The difference in the behaviour in the two clusters (one has PSM, the other does

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
Hello Ralph I have the latest from master, but I still see this old behaviour. Is your code available on master? Thank you Durga The surgeon general advises you to eat right, exercise regularly and quit ageing. On Wed, May 11, 2016 at 10:53 PM, Ralph Castain wrote: > This is a known problem -

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread Gilles Gouaillardet
Note the psm library sets its own signal handler, possibly after the OpenMPI one. that can be disabled by |export IPATH_NO_BACKTRACE=1 Cheers, Gilles | On 5/12/2016 11:34 AM, dpchoudh . wrote: Hello Gilles Thank you for your continued support. With your help, I have a better understanding

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
Hello Gilles and all Here is an update: looks like I have root-caused it: Disabling MPI's signal handlers after MPI_Init() AND mentioning both the PML AND BTL explicitly does generate a core dump. (Note that just mentioning -mca pml ob1 alone does not do it, which I find strange) I believe there

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread Ralph Castain
This is a known problem - I committed the fix for PSM with a link down just today. > On May 11, 2016, at 7:34 PM, dpchoudh . wrote: > > Hello Gilles > > Thank you for your continued support. With your help, I have a better > understanding of what is happening. Here are the details. > > 1. Y

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
Hello Gilles Thank you for your continued support. With your help, I have a better understanding of what is happening. Here are the details. 1. Yes, I am sure that ulimit -c is 'unlimited' (and for the test in question, I am running it on a single node, so there are no other nodes) 2. The comman

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread Gilles Gouaillardet
Are you sure ulimit -c unlimited is *really* applied on all hosts can you please run the simple program below and confirm that ? Cheers, Gilles #include #include #include #include int main(int argc, char *argv[]) { struct rlimit rlim; char * c = (char *)0; getrlimit(RLIMIT

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
Hello Gilles Thank you for the advice. However, that did not seem to make any difference. Here is what I did (on the cluster that generates .btr files for core dumps): [durga@smallMPI git]$ ompi_info --all | grep opal_signal MCA opal base: parameter "opal_signal" (current value: "6,7,8

Re: [OMPI users] openib MTL not working via slurm after update

2016-05-11 Thread Nathan Smith
We determined that this issue was actually due to not having an unlimited memlock for the slurm user when the slurm service started. The work-around was to simply restart slurm subsequent to boot and the new unlimited setting would allow infiniband usage. Moving the startup script to runlevel 3

Re: [OMPI users] mpirun command won't run unless the firewalld daemon is disabled

2016-05-11 Thread Llolsten Kaonga
Hello Gilles, Sorry my last message was a bit garbled. I edited the original and hit the send button while distracted. We actually use one internal port and an external port for the RoCe traffic. I thank you. -- Llolsten From: users [mailto:users-boun...@open-mpi.org] On Behalf O

Re: [OMPI users] mpirun command won't run unless the firewalld daemon is disabled

2016-05-11 Thread Gilles Gouaillardet
I am not sure I understand your last message. if MPI only need the internal port, and there is no firewall protecting this port, then simply tell ompi to use it and only it mpirun --mca oob_tco_if_include ethxx --mca btl_tcp_if_include ethxx ... otherwise, it should work, but only after some inte

Re: [OMPI users] mpirun command won't run unless the firewalld daemon is disabled

2016-05-11 Thread Llolsten Kaonga
Hello Gilles/Jeff, Thank you for clarifying this. We have three ports but the RoCE traffic is supposed to use one of the internal ports. However, we do allow use of one of the external ports which we assign a static address. I thank you. -- Llolsten From: users [mailto:users-bou

Re: [OMPI users] 'AINT' undeclared

2016-05-11 Thread Ilias Miroslav
Ad: https://www.open-mpi.org/community/lists/users/2016/05/29166.php Yes ! The the PGI stuff in path was the reason ... the OpenMPI build with GNU was taking PGI preprocessor. I unloaded the PGI module (and Intel module to be sure) and the OpenMPI 1.10.1 static build with GNU compilers went

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread Gilles Gouaillardet
Durga, you might wanna try to restore the signal handler for other signals as well (SIGSEGV, SIGBUS, ...) ompi_info --all | grep opal_signal does list the signal you should restore the handler only one backtrace component is built (out of several candidates : execinfo, none, printstack) nm -l li

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
Hello Nathan Thank you for your response. Could you please be more specific? Adding the following after MPI_Init() does not seem to make a difference. MPI_Init(&argc, &argv); * signal(SIGABRT, SIG_DFL); signal(SIGTERM, SIG_DFL);* I also find it puzzling that nearly identical OMPI distro ru

Re: [OMPI users] Question about mpirun mca_oob_tcp_recv_handler error.

2016-05-11 Thread Gilles Gouaillardet
Hi, Where did you get the openmpi package from ? fc20 ships openmpi 1.7.3 ... does it work as expected if you do not use mpirun (e.g. ./hello_c) if yes, then you can try ldd hello_c which mpirun ldd mpirun mpirun -np 1 ldd hello_c and confirm both mpirun and hello_c use the same mpi