Hi,

You should definitely try everything people before me mentioned.
Also, try running single process per node - and see if it happens.

I do not have some great insight about this issue - but I did have similar problem in March. Unfortunately it went away (don't remember how - either by me quiting doing those tests or.. - so we never filed a bug). It won't help you but maybe will help Brian for debugging purposes.

Anyway, here is what I think was happening with our bug.
we would see occasionally a benchmark livelocking in Barrier with large(r) number of processes with two processes per node (80+) over (pml ob1, btl mx,sm,self). However, the problem did not occur with (pml cm, mtl mx).

We ended up concluding that, on a node the following would occur:
1) send request from barrier would be still marked as active
2) the pending queue was empty and remaining data size was 0,
   implying that MCA_PML_BASE_REQUEST_MPI_COMPLETE was not called.

After (quite a bit of debugging) we concluded that the sequence numbers of the packets in the requests on the receiver side were not correct. So the message was either sent with incorrect sequence number, or received with incorrect number (which is lower than the "correct" sequence number, and thus ignored). This conclusion may or may not be correct :-)

I am pretty sure that the problem is not in collective itself. I would run 1 process per node, until Brian is done with his stuff.

Cheers,
Jelena

On Tue, 19 Jun 2007, Jeff Squyres wrote:

On Jun 19, 2007, at 9:18 AM, Chris Reeves wrote:

I've had a look through the FAQ and searched the list archives and
can't find
any similar problems to this one.

I'm running OpenMPI 1.2.2 on 10 Intel iMacs (Intel Core2 Duo CPU).
I am
specifiying two slots per machine and starting my job with:
/Network/Guanine/csr201/local-i386/opt/openmpi/bin/mpirun -np 20 --
hostfile
bhost.jobControl nice -19 /Network/Guanine/csr201/jobControl/
run_torus.pl
/Network/Guanine/csr201/models-gap/torus/torus.ompiosx-intel

The config.log and output of 'ompi_info --all' are attached.

Also attached is a small patch that I wrote to work around some
firewall
limitations on the nodes (I don't know if there's a better way to
do this -
suggestions are welcome). The patch may or may not be relevant, but
I'm not
ruling out network issues and a bit of peer review never goes amiss
in case
I've done something very silly.

From the looks of the patch, it looks like you just want Open MPI to
restrict itself to a specific range of ports, right?  If that's the
case, we'd probably do this slightly differently (with MCA parameters
-- we certainly wouldn't want to force everyone to use a hard-coded
port range).  Brian's also re-working some TCP and OOB issues on a /
tmp branch right now; we'd want to wait until he's done before
applying a similar patch.

The programme that I'm trying to run is fairly hefty, so I'm afraid
that I
can't provide you with a simple test case to highlight the problem.
The best I
can do it provide you with a description of where I'm at and then
ask for some
advice/suggestions...

The code itself has run in the past with various versions of MPI/
LAM and
OpenMPI and hasn't, to my knowledge, undergone any significant changes
recently. I have noticed delays before, both on this system and on
others,
when MPI_BARRIER is called but they don't always result in a permanent
'spinning' of the process.

My first question is: why are you calling MPI_BARRIER?  ;-)

Clearly, if we're getting stuck in there, it could be a bug.  Have
you run your code through a memory-checking debugger?  It's hard to
say exactly what the problem is without more information -- it could
be your app, it could be OMPI, it could be the network, ...

It's a good datapoint to run with other MPI implementations, but "it
worked with MPI X" isn't always an iron-clad indication that the new
MPI is at fault.  I'm not saying we don't have bugs in Open MPI :-)
-- I'm just saying that I agree with you: more data is necessary.

The 20-node job that I'm running right now is using 90-100% of
every CPU, but
hasn't made any progress for around 14 hours. I've used GDB to
attach to each
of these processes and verified that every single one of them is
sitting
inside a call to MPI_BARRIER. My understanding is that once every
process hits
the barrier, they should then move on to the next part of the code.

That's correct.  FWIW, you shouldn't need to wait 14 hours to tell
this; you can assume that if *all* processes are stopped in the same
MPI_BARRIER for any length of time (to include just a few seconds),
the job is hung.

Here's an example of what I see when I attach to one of these
processes:
----------------------------------------------------------------------
--------

Attaching to program: `/private/var/automount/Network/Guanine/
csr201/models-gap/torus/torus.ompiosx-intel', process 29578.
Reading symbols for shared libraries ..++++
+....................................................................
done
0x9000121c in sigprocmask ()
(gdb) where
#0  0x9000121c in sigprocmask ()
#1  0x01c46f96 in opal_evsignal_recalc ()
#2  0x01c458c2 in opal_event_base_loop ()
#3  0x01c45d32 in opal_event_loop ()
#4  0x01c3e6f2 in opal_progress ()
#5  0x01b6083e in ompi_request_wait_all ()
#6  0x01ec68d8 in ompi_coll_tuned_sendrecv_actual ()
#7  0x01ecbf64 in ompi_coll_tuned_barrier_intra_bruck ()
#8  0x01b75590 in MPI_Barrier ()

Just a quick sanity check: I assume the call stack is the same on all
processes, right?  I.e., ompi_coll_tuned_barrier_intra_bruck () is
the call right after MPI_BARRIER?

Does anyone have any suggestions as to what might be happening
here? Is there
any way to 'tickle' the processes and get them to move on?

It is unlikely.  If everyone is waiting in the barrier, then
something went wrong.  They're going to stay there until OMPI thinks
that everyone has hit the barrier.

What if some
packets went missing on the network? Surely TCP should take care of
this an
resend?

One would assume so, yes.  But the timeout may be very, very long.

What is the topology of the network that you're running on?

As implied by my line of questioning, my current thoughts are that
some messages between nodes have somehow gone missing. Could this
happen? What
could cause this? All machines are on the same subnet.

Hmm.  On a single subnet, but you need the firewall capability -- are
they physically remote from each other, or do you just have the local
firewalling capabilities enabled on each node?

--
Jeff Squyres
Cisco Systems

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jelena Pjesivac-Grbovic, Pjesa
Graduate Research Assistant
Innovative Computing Laboratory
Computer Science Department, UTK
Claxton Complex 350
(865) 974 - 6722 (865) 974 - 6321
jpjes...@utk.edu

Murphy's Law of Research:
        Enough research will tend to support your theory.

Reply via email to