Hmm, that is the way that I expected it to work as well -
we see the warnings also, but closely followed by the
errors (I've been trying both 1.2.5 and a recent 1.3
snapshot with the same behavior). You don't have the
mx driver loaded on the nodes that do not have a myrinet
card, do you? Our mx is a touch behind yours (1.2.3),
but I agree that it appears to be something in the process
startup that is at fault, so it doesn't seem likely that
the mx version is to blame (perhaps just the fact that it
is not installed on those nodes?).
Matt
On Wed, 16 Jan 2008, 8mj6tc...@sneakemail.com wrote:
We also have a mixed myrinet/ip cluster, and maybe I'm missing some
nuance of your configuration, but openmpi seems to work fine for me "as
is" with no --mca options across mixed nodes (there's a bunch of
warnings at the beginning where the non-mx nodes realize they don't have
myrinet cards and the mx nodes realize they can't talk mx to the non-mx
nodes, but everything completes fine, so I assumed OpenMPI was working
things out the transport details on it's own (and was quite pleased
about that)).
I just did a quick test to confirm that it is in fact still using mx in
that situation, and it is. I'm running OpenMPI 1.2.4 and MX 1.2.3.
It sounds to me based on those "PML add procs failed" messages that
OpenMPI is dying on start up on the non-mx nodes unless you explicitly
disable mx at runtime (perhaps because they're expecting the mx library
to be there, but it's not?)
users-request-at-open-mpi.org |openmpi-users/Allow| wrote:
Date: Tue, 15 Jan 2008 10:25:00 -0500 (EST)
From: M D Jones <jon...@ccr.buffalo.edu>
Subject: Re: [OMPI users] mixed myrinet/non-myrinet nodes
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <pine.lnx.4.64.0801151018430.18...@mail.ccr.buffalo.edu>
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Hmm, that combination seems to hang on me - but
'--mca pml ob1 --mca btl ^mx' does indeed do the trick.
Many thanks!
Matt
On Tue, 15 Jan 2008, George Bosilca wrote:
This case actually works. We run into it few days ago, when we discovered
that one of the compute nodes in a cluster didn't get his Myrinet card
installed properly ... The performance were horrible but the application run
to completion.
You will have to use the following flags: --mca pml ob1 --mca btl mx,tcp,self