On Thu, 26 Oct 2006 15:11:46 -0600, George Bosilca <bosi...@cs.utk.edu> wrote:

The Open MPI behavior is the same independently of the network used
for the job. At least the behavior dictated by our internal message
passing layer.

Which is one of the things I like about Open MPI.

There is nothing (that has a reasonable cost) we can do about this.

Nor do I think there should be something done. In all honesty, I think it's a good thing that TCP & Myrinet have such a long timeout. It makes administration a bit less scary; if you accidentally unplug the network cable from the wrong node during maintenance, neither the MPI nor the administrator loses a job.

I'm also confident that both TCP & Myrinet would throw an error when they time out; it's just that I haven't felt the need to verify it. (And with some-odd 20 minutes for Myrinet, it takes a bit of attention span. The last time I tried it I had forgotten about it for about 3-4 hours).

If none are available, then Open
MPI is supposed to abort the job. For your particular run did you had
Ethernet between the nodes ? If yes, I'm quite sure the MPI run
wasn't stopped ... it continued using the TCP device (if not disabled
by hand at mpirun time).

This brings up an interesting question: The job was simply Intel's MPI benchmark (IMB), which is fairly chatty (ie. lots of screen output).

On the first try, I used '--mca btl ^gm,^mx' to start the job. Ethernet was connected (eth0=10/100, eth1=gigabit), but after the IB cable was disconnected, everything stopped. The link lights (ethernet & IB) were not blinking, nor do any of the system monitors show much TCP traffic; certainly not the sort of traffic one would expect from an IMB run.

I've also tried using --mca openib,sm,self,tcp (specifically adding TCP) and didn't see any sort of difference; the job still 'stuck' as soon as the IB cable was removed. I'll let that job continue to run overnight (ie. --mca btl tcp,openib,sm,self) to see if the job ever wakes up.

--mca btl ^tcp (or --mca btl opnib,sm,self).

I get the messages that something is amiss with the IB fabric (as expected). However, the job does *not* abort. Every (MPI) process on every node in the job is still active, and sucking 100% of its cpu (I imagine busy-waiting).

PS: There are several internal message passing modules available for
Open MPI. The default one, look more for performance than
reliability. If reliability it's what you need you should use the DR
PML. For this, you can specify --mca pml dr at mpirun time. This (DR)
PML has data reliability and timeout (Open MPI internal timeout that
are configurable), allowing to recover faster from a network failure.

I don't have such a component. Hopefully it's just the version of Open MPI I'm using (1.1), or an option needed by ./configure I didn't use. (If it should be in 1.1, I'll take a deeper look, and can provide things like the config.log, etc. I just don't want to flood the list at the moment.)
--
Troy Telford

Reply via email to