--
List-Post: users@lists.open-mpi.org
Date: Tue, 25 Feb 2014 20:07:31 -0500 (EST)
From: Doug Roberts
To: us...@open-mpi.org
Subject: Re: [OMPI users] Connection timed out with multiple nodes
Hello again, The "oob_stress" program runs cleanly on each of
the two test nodes bro127 and bro12
23275,0],0] plm:base:receive stop comm
[bro128:04462] [[23275,0],0] plm:base:local:slave:finalize
-- Forwarded message --
List-Post: users@lists.open-mpi.org
Date: Fri, 31 Jan 2014 13:55:41 -0800
From: Ralph Castain
Reply-To: Open MPI Users
To: Open MPI Users
Subject: Re: [OMPI users] Connection
The only relevant parts are from the application procs - orterun and the orted
don't participate in this exchange and never see the BTLs anyway.
It looks like there is just something blocking data transfer across eth2 for
some reason. I'm afraid I have no idea why - can you run a standard (i.e.,
It's the failure on readv that's the source of the trouble. What
happens if you only if_include eth2? Does it work then?
Still hangs, details follow ... tx!
o Using only eth2 with verbosity gives:
[roberpj@bro127:~/samples/mpi_test]
/opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 -
It's the failure on readv that's the source of the trouble. What happens if you
only if_include eth2? Does it work then?
On Jan 23, 2014, at 5:38 PM, Doug Roberts wrote:
>
>> Date: Fri, 17 Jan 2014 19:24:50 -0800
>> From: Ralph Castain
>>
>> The most common cause of this problem is a firewa
Date: Fri, 17 Jan 2014 19:24:50 -0800
From: Ralph Castain
The most common cause of this problem is a firewall between the
nodes - you can ssh across, but not communicate. Have you checked
to see that the firewall is turned off?
Turns out some iptables rules (typical on our clusters) were act
The most common cause of this problem is a firewall between the nodes - you can
ssh across, but not communicate. Have you checked to see that the firewall is
turned off?
On Jan 17, 2014, at 4:59 PM, Doug Roberts wrote:
>
> 1) When openmpi programs run across multiple nodes they hang
> rather
1) When openmpi programs run across multiple nodes they hang
rather quickly as shown in the mpi_test example below. Note
that I am assuming the initital topology error message is a
separate issue since single node openmpi jobs run just fine.
[roberpj@bro127:~/samples/mpi_test]
/opt/sharcnet/op