--
>
> Message: 1
> Date: Mon, 12 Dec 2016 09:32:25 +0900
> From: Gilles Gouaillardet
> To: users@lists.open-mpi.org
> Subject: Re: [OMPI users] Abort/ Deadlock issue in allreduce
> Message-ID: <8316882f-01a
Christof,
Ralph fixed the issue,
meanwhile, the patch can be manually downloaded at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi/pull/2552.patch
Cheers,
Gilles
On 12/9/2016 5:39 PM, Christof Koehler wrote:
Hello,
our case is. The libwannier.a is a "third party"
library
> On Dec 9, 2016, at 3:39 AM, Christof Koehler
> wrote:
>
> Hello,
>
> our case is. The libwannier.a is a "third party"
> library which is built seperately and the just linked in. So the vasp
> preprocessor never touches it. As far as I can see no preprocessing of
> the f90 source is involved i
Hello,
our case is. The libwannier.a is a "third party"
library which is built seperately and the just linked in. So the vasp
preprocessor never touches it. As far as I can see no preprocessing of
the f90 source is involved in the libwannier build process.
I finally managed to set a breakpoint a
Folks,
the problem is indeed pretty trivial to reproduce
i opened https://github.com/open-mpi/ompi/issues/2550 (and included a
reproducer)
Cheers,
Gilles
On Fri, Dec 9, 2016 at 5:15 AM, Noam Bernstein
wrote:
> On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet
> wrote:
>
> Christof,
>
>
> Ther
> On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet
> wrote:
>
> Christof,
>
>
> There is something really odd with this stack trace.
> count is zero, and some pointers do not point to valid addresses (!)
>
> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> the sta
To the best I can determine, mpirun catches SIGTERM just fine and will hit the
procs with SIGCONT, followed by SIGTERM and then SIGKILL. It will then wait to
see the remote daemons complete after they hit their procs with the same
sequence.
> On Dec 8, 2016, at 5:18 AM, Christof Koehler
> wr
Hello again,
I am still not sure about breakpoints. But I did a "catch signal" in
gdb, gdb's were attached to the two vasp processes and mpirun.
When the root rank exits I see in the gdb attaching to it
[Thread 0x2b2787df8700 (LWP 2457) exited]
[Thread 0x2b277f483180 (LWP 2455) exited]
[Inferior
Hello,
On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote:
> Christof,
>
>
> There is something really odd with this stack trace.
> count is zero, and some pointers do not point to valid addresses (!)
Yes, I assumed it was interesting :-) Note that the program is compiled
with
Christof,
There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)
in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not
using the libr
Hello everybody,
I tried it with the nightly and the direct 2.0.2 branch from git which
according to the log should contain that patch
commit d0b97d7a408b87425ca53523de369da405358ba2
Merge: ac8c019 b9420bb
Author: Jeff Squyres
Date: Wed Dec 7 18:24:46 2016 -0500
Merge pull request #2528 fr
> On Dec 7, 2016, at 12:37 PM, Christof Koehler
> wrote:
>
>
>> Presumably someone here can comment on what the standard says about the
>> validity of terminating without mpi_abort.
>
> Well, probably stop is not a good way to terminate then.
>
> My main point was the change relative to 1.10
Hi Christof
Sorry if I missed this, but it sounds like you are saying that one of your
procs abnormally terminates, and we are failing to kill the remaining job? Is
that correct?
If so, I just did some work that might relate to that problem that is pending
in PR #2528: https://github.com/open-
Hello,
On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
> > On Dec 7, 2016, at 10:07 AM, Christof Koehler
> > wrote:
> >>
> > I really think the hang is a consequence of
> > unclean termination (in the sense that the non-root ranks are not
> > terminated) and probably not the cau
> On Dec 7, 2016, at 10:07 AM, Christof Koehler
> wrote:
>>
> I really think the hang is a consequence of
> unclean termination (in the sense that the non-root ranks are not
> terminated) and probably not the cause, in my interpretation of what I
> see. Would you have any suggestion to catch sig
Hello,
On Wed, Dec 07, 2016 at 11:07:49PM +0900, Gilles Gouaillardet wrote:
> Christof,
>
> out of curiosity, can you run
> dmesg
> and see if you find some tasks killed by the oom-killer ?
Definitively not the oom-killer. It is a real tiny example. I checked
the machines logfile and dmesg.
>
>
Hello again,
attaching the gdb to mpirun the back trace when it hangs is
(gdb) bt
#0 0x2b039f74169d in poll () from /usr/lib64/libc.so.6
#1 0x2b039e1a9c42 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x2b039e1a2751 in opal_libevent2022_even
Hello,
thank you for the fast answer.
On Wed, Dec 07, 2016 at 08:23:43PM +0900, Gilles Gouaillardet wrote:
> Christoph,
>
> can you please try again with
>
> mpirun --mca btl tcp,self --mca pml ob1 ...
mpirun -n 20 --mca btl tcp,self --mca pml ob1
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin
Christoph,
can you please try again with
mpirun --mca btl tcp,self --mca pml ob1 ...
that will help figuring out whether pml/cm and/or mtl/psm2 is involved or not.
if that causes a crash, then can you please try
mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...
that will help fig
Hi Jeff,
I've reproduced your test here, with the same results. Moreover, if I
put the nodes with rank>0 into a blocking MPI call (MPI_Bcast or
MPI_Barrier) I still get the same behavior; namely, rank 0's calling
abort() generates a core file and leads to termination, which is the
behavior I want
FWIW, I'm unable to replicate your behavior. This is with Open MPI 1.4.2 on
RHEL5:
[9:52] svbu-mpi:~/mpi % cat abort.c
#include
#include
#include
int main(int argc, char **argv)
{
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (0 == rank) {
I've tried both--as you said, MPI_Abort doesn't drop a core file, but
does kill off the entire MPI job. abort() drops core when I'm running
on 1 processor, but not in a multiprocessor run. In addition, a node
calling abort() doesn't lead to the entire run being killed off.
David
O
n Mon, 2010-0
On Aug 13, 2010, at 12:53 PM, David Ronis wrote:
> I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
> with an intel i7). coresize is unlimited:
>
> ulimit -a
> core file size (blocks, -c) unlimited
That looks good.
In reviewing the email thread, it's not entirely
I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
with an intel i7). coresize is unlimited:
ulimit -a
core file size (blocks, -c) unlimited
David
n Fri, 2010-08-13 at 13:47 -0400, Jeff Squyres wrote:
> On Aug 13, 2010, at 1:18 PM, David Ronis wrote:
>
> > Second
On Aug 13, 2010, at 1:18 PM, David Ronis wrote:
> Second coredumpsize is unlimited, and indeed I DO get core dumps when
> I'm running a single-processor version.
What launcher are you using underneath Open MPI?
You might want to make sure that the underlying launcher actually sets the
coredum
Thanks to all who replied.
First, I'm running openmpi 1.4.2.
Second coredumpsize is unlimited, and indeed I DO get core dumps when
I'm running a single-processor version. Third, the problem isn't
stopping the program, MPI_Abort does that just fine, rather it's getting
a cordump. According t
David Zhang wrote:
When my MPI code fails (seg fault), it usually cause the rest of the mpi
process to abort as well. Perhaps rather than calling abort(), perhaps
you could do a divide-by-zero operation to halt the program?
David Zhang
University of California, San Diego
>
On Thu, Aug 12, 201
Sounds very strange - what OMPI version, on what type of machine, and how was
it configured?
On Aug 12, 2010, at 7:49 PM, David Ronis wrote:
> I've got a mpi program that is supposed to to generate a core file if
> problems arise on any of the nodes. I tried to do this by adding a
> call to a
When my MPI code fails (seg fault), it usually cause the rest of the mpi
process to abort as well. Perhaps rather than calling abort(), perhaps you
could do a divide-by-zero operation to halt the program?
On Thu, Aug 12, 2010 at 6:49 PM, David Ronis wrote:
> I've got a mpi program that is suppo
29 matches
Mail list logo