Re: [OMPI users] Connection timed out with multiple nodes

2014-02-26 Thread Doug Roberts
-- List-Post: users@lists.open-mpi.org Date: Tue, 25 Feb 2014 20:07:31 -0500 (EST) From: Doug Roberts To: us...@open-mpi.org Subject: Re: [OMPI users] Connection timed out with multiple nodes Hello again, The "oob_stress" program runs cleanly on each of the two test nodes bro127 and bro12

Re: [OMPI users] Connection timed out with multiple nodes

2014-02-25 Thread Doug Roberts
23275,0],0] plm:base:receive stop comm [bro128:04462] [[23275,0],0] plm:base:local:slave:finalize -- Forwarded message -- List-Post: users@lists.open-mpi.org Date: Fri, 31 Jan 2014 13:55:41 -0800 From: Ralph Castain Reply-To: Open MPI Users To: Open MPI Users Subject: Re: [OMPI users] Connection

Re: [OMPI users] Connection timed out with multiple nodes

2014-01-31 Thread Ralph Castain
The only relevant parts are from the application procs - orterun and the orted don't participate in this exchange and never see the BTLs anyway. It looks like there is just something blocking data transfer across eth2 for some reason. I'm afraid I have no idea why - can you run a standard (i.e.,

Re: [OMPI users] Connection timed out with multiple nodes

2014-01-31 Thread Doug Roberts
It's the failure on readv that's the source of the trouble. What happens if you only if_include eth2? Does it work then? Still hangs, details follow ... tx! o Using only eth2 with verbosity gives: [roberpj@bro127:~/samples/mpi_test] /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 -

Re: [OMPI users] Connection timed out with multiple nodes

2014-01-23 Thread Ralph Castain
It's the failure on readv that's the source of the trouble. What happens if you only if_include eth2? Does it work then? On Jan 23, 2014, at 5:38 PM, Doug Roberts wrote: > >> Date: Fri, 17 Jan 2014 19:24:50 -0800 >> From: Ralph Castain >> >> The most common cause of this problem is a firewa

Re: [OMPI users] Connection timed out with multiple nodes

2014-01-23 Thread Doug Roberts
Date: Fri, 17 Jan 2014 19:24:50 -0800 From: Ralph Castain The most common cause of this problem is a firewall between the nodes - you can ssh across, but not communicate. Have you checked to see that the firewall is turned off? Turns out some iptables rules (typical on our clusters) were act

Re: [OMPI users] Connection timed out with multiple nodes

2014-01-17 Thread Ralph Castain
The most common cause of this problem is a firewall between the nodes - you can ssh across, but not communicate. Have you checked to see that the firewall is turned off? On Jan 17, 2014, at 4:59 PM, Doug Roberts wrote: > > 1) When openmpi programs run across multiple nodes they hang > rather

[OMPI users] Connection timed out with multiple nodes

2014-01-17 Thread Doug Roberts
1) When openmpi programs run across multiple nodes they hang rather quickly as shown in the mpi_test example below. Note that I am assuming the initital topology error message is a separate issue since single node openmpi jobs run just fine. [roberpj@bro127:~/samples/mpi_test] /opt/sharcnet/op