Well, it is getting better! :-)

On your cmd line, what btl's are you specifying? You should try -mca btl
sm,tcp,self for this to work. Reason: sometimes systems block tcp loopback
on the node. What I see below indicates that inter-node comm was fine, but
the two procs that share a node couldn't communicate. Including shared
memory should remove that problem.

The port numbers are fine and can be different or the same - it is totally
random. The procs exchange their respective port info during wireup.


On Wed, Aug 12, 2009 at 12:51 PM, Jody Klymak <jkly...@uvic.ca> wrote:

> Hi Ralph,
> That gives me something more to work with...
>
>
> On Aug 12, 2009, at  9:44 AM, Ralph Castain wrote:
>
> I believe TCP works fine, Jody, as it is used on Macs fairly widely. I
> suspect this is something funny about your installation.
>
> One thing I have found is that you can get this error message when you have
> multiple NICs installed, each with a different subnet, and the procs try to
> connect across different ones. Do you by chance have multiple NICs?
>
>
> The head node has two active NICs:
> en0: public
> en1: private
>
> The server nodes only have one connection
> en0:private
>
>
> Have you tried telling OMPI which TCP interface to use? You can do so with
> -mca btl_tcp_if_include eth0 (or whatever you want to use).
>
>
> If I try this, I get the same results. (though I need to use "en0" on my
> machine)...
>
> If I include -mca btl_base_verbose 30 I get for n=2:
>
> ++++++++++
> [xserve03.local:00841] select: init of component tcp returned success
> Done MPI init
> checking connection between rank 0 on xserve02.local and rank 1
> Done MPI init
> [xserve02.local:01094] btl: tcp: attempting to connect() to address
> 192.168.2.103 on port 4
> Done checking connection between rank 0 on xserve02.local and rank 1
> Connectivity test on 2 processes PASSED.
> ++++++++++
>
> If I try n=3 the job hangs and I have to kill:
>
> ++++++++++
> Done MPI init
> checking connection between rank 0 on xserve02.local and rank 1
> [xserve02.local:01110] btl: tcp: attempting to connect() to address
> 192.168.2.103 on port 4
> Done MPI init
> Done MPI init
> checking connection between rank 1 on xserve03.local and rank 2
> [xserve03.local:00860] btl: tcp: attempting to connect() to address
> 192.168.2.102 on port 4
> Done checking connection between rank 0 on xserve02.local and rank 1
> checking connection between rank 0 on xserve02.local and rank 2
> Done checking connection between rank 0 on xserve02.local and rank 2
> mpirun: killing job...
> ++++++++++
>
> Those ip addresses are correct, no idea if port 4 make sense.  Sometimes I
> get port 260.  Should xserve03 and xserve02 be trying to use the same port
> for these comms?
>
>
> Thanks,  Jody
>
>
>
>
>
> On Wed, Aug 12, 2009 at 10:01 AM, Jody Klymak <jkly...@uvic.ca> wrote:
>
>>
>> On Aug 11, 2009, at  18:55 PM, Gus Correa wrote:
>>
>>
>>> Did you wipe off the old directories before reinstalling?
>>>
>>
>> Check.
>>
>>  I prefer to install on a NFS mounted directory,
>>>
>>
>> Check
>>
>>
>>  Have you tried to ssh from node to node on all possible pairs?
>>>
>>
>> check - fixed this today, works fine with the spawning user...
>>
>>  How could you roll back to 1.1.5,
>>> now that you overwrote the directories?
>>>
>>
>> Oh, I still have it on another machine off the cluster in
>> /usr/local/openmpi.  Will take just 5 mintues to reinstall.
>>
>>  Launching jobs with Torque is way much better than
>>> using barebones mpirun.
>>>
>>
>>  And you don't want to stay behind with the OpenMPI versions
>>> and improvements either.
>>>
>>
>> Sure, but I'd like the jobs to be able to run at all..
>>
>> Is there any sense in rolling back to to 1.2.3 since that is known to work
>> with OS X (its the one that comes with 10.5)?  My only guess at this point
>> is other OS X users are using non-tcpip communication, and the tcp stuff
>> just doesn't work in 1.3.3.
>>
>> Thanks,  Jody
>>
>> --
>> Jody Klymak
>> http://web.uvic.ca/~jklymak/ <http://web.uvic.ca/%7Ejklymak/>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jody Klymak
> http://web.uvic.ca/~jklymak/ <http://web.uvic.ca/%7Ejklymak/>
>
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to