Sorry for belabouring on this, but this (hopefully final!) piece of
information might be of interest to the developers:
There must be a reason why PSM is installing its signal handlers; often
this is done to modify the permission of a page in response to a SEGV and
attempt access again. By disabli
Hello Gilles
Mystery solved! In fact, this one line is exactly what was needed!! It
turns out the OMPI signal handlers are irrelevant. (i.e. don't make any
difference when this env variable is set)
This explains:
1. The difference in the behaviour in the two clusters (one has PSM, the
other does
Hello Ralph
I have the latest from master, but I still see this old behaviour. Is your
code available on master?
Thank you
Durga
The surgeon general advises you to eat right, exercise regularly and quit
ageing.
On Wed, May 11, 2016 at 10:53 PM, Ralph Castain wrote:
> This is a known problem -
Note the psm library sets its own signal handler, possibly after the
OpenMPI one.
that can be disabled by
|export IPATH_NO_BACKTRACE=1 Cheers, Gilles |
On 5/12/2016 11:34 AM, dpchoudh . wrote:
Hello Gilles
Thank you for your continued support. With your help, I have a better
understanding
Hello Gilles and all
Here is an update: looks like I have root-caused it:
Disabling MPI's signal handlers after MPI_Init() AND mentioning both the
PML AND BTL explicitly does generate a core dump. (Note that just
mentioning -mca pml ob1 alone does not do it, which I find strange) I
believe there
This is a known problem - I committed the fix for PSM with a link down just
today.
> On May 11, 2016, at 7:34 PM, dpchoudh . wrote:
>
> Hello Gilles
>
> Thank you for your continued support. With your help, I have a better
> understanding of what is happening. Here are the details.
>
> 1. Y
Hello Gilles
Thank you for your continued support. With your help, I have a better
understanding of what is happening. Here are the details.
1. Yes, I am sure that ulimit -c is 'unlimited' (and for the test in
question, I am running it on a single node, so there are no other nodes)
2. The comman
Are you sure ulimit -c unlimited is *really* applied on all hosts
can you please run the simple program below and confirm that ?
Cheers,
Gilles
#include
#include
#include
#include
int main(int argc, char *argv[]) {
struct rlimit rlim;
char * c = (char *)0;
getrlimit(RLIMIT
Hello Gilles
Thank you for the advice. However, that did not seem to make any
difference. Here is what I did (on the cluster that generates .btr files
for core dumps):
[durga@smallMPI git]$ ompi_info --all | grep opal_signal
MCA opal base: parameter "opal_signal" (current value:
"6,7,8
We determined that this issue was actually due to not having an unlimited
memlock for the slurm user when the slurm service started. The work-around was
to simply restart slurm subsequent to boot and the new unlimited setting would
allow infiniband usage. Moving the startup script to runlevel 3
Hello Gilles,
Sorry my last message was a bit garbled. I edited the original and hit the send
button while distracted.
We actually use one internal port and an external port for the RoCe traffic.
I thank you.
--
Llolsten
From: users [mailto:users-boun...@open-mpi.org] On Behalf O
I am not sure I understand your last message.
if MPI only need the internal port, and there is no firewall protecting
this port, then simply tell ompi to use it and only it
mpirun --mca oob_tco_if_include ethxx --mca btl_tcp_if_include ethxx ...
otherwise, it should work, but only after some inte
Hello Gilles/Jeff,
Thank you for clarifying this.
We have three ports but the RoCE traffic is supposed to use one of the internal
ports. However, we do allow use of one of the external ports which we assign a
static address.
I thank you.
--
Llolsten
From: users [mailto:users-bou
Ad: https://www.open-mpi.org/community/lists/users/2016/05/29166.php
Yes ! The the PGI stuff in path was the reason ... the OpenMPI build with GNU
was taking PGI preprocessor.
I unloaded the PGI module (and Intel module to be sure) and the OpenMPI 1.10.1
static build with GNU compilers went
Durga,
you might wanna try to restore the signal handler for other signals as well
(SIGSEGV, SIGBUS, ...)
ompi_info --all | grep opal_signal
does list the signal you should restore the handler
only one backtrace component is built (out of several candidates :
execinfo, none, printstack)
nm -l li
Hello Nathan
Thank you for your response. Could you please be more specific? Adding the
following after MPI_Init() does not seem to make a difference.
MPI_Init(&argc, &argv);
* signal(SIGABRT, SIG_DFL); signal(SIGTERM, SIG_DFL);*
I also find it puzzling that nearly identical OMPI distro ru
Hi,
Where did you get the openmpi package from ?
fc20 ships openmpi 1.7.3 ...
does it work as expected if you do not use mpirun
(e.g. ./hello_c)
if yes, then you can try
ldd hello_c
which mpirun
ldd mpirun
mpirun -np 1 ldd hello_c
and confirm both mpirun and hello_c use the same mpi
17 matches
Mail list logo