You were right about iptables being very complex. It seems that uninstalling it 
completly did the trick. All my Send / Receive operations now complete as they 
should. Just one more question. Will uninstalling iptables have any undesired 
effects on my Linux cluster? 
 
Thaks!
Adrian
 

________________________________
 From: Jeff Squyres <jsquy...@cisco.com>
To: adrian sabou <adrian.sa...@yahoo.com> 
Sent: Friday, February 3, 2012 12:30 PM
Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
  
On Feb 3, 2012, at 5:21 AM, adrian sabou wrote:

> There is no iptables in my /etc/init.d.

It might be different in different OS's -- my RedHat-based system has 
/etc/init.d/iptables.

Perhaps try uninstalling iptables using your local package manager (rpm, yum, 
apt, ...whatever).

> It's most probably a communication issue between the nodes. However, I have 
> no ideea what it might be. It's weird though that the first Send / Receive 
> pair works and only subsequent pairs fail. Anyway, thankyou for taking the 
> time to help me out. I am grateful!
>  
> Adrian
> 
> From: Jeff Squyres <jsquy...@cisco.com>
> To: adrian sabou <adrian.sa...@yahoo.com>; Open MPI Users 
> <us...@open-mpi.org> 
> Sent: Thursday, February 2, 2012 11:19 PM
> Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
> 
> When you run without a hostfile, you're likely only running on a single node 
> via shared memory (unless you're running inside a SLURM job, which is 
> unlikely, given the context of your mails).  
> 
> When you're running in SLURM, I'm guessing that you're running across 
> multiple nodes.  Are you using TCP as your MPI transport?
> 
> If so, I would still recommend trying stopping iptables altogether -- 
> /etc/init.d/iptables stop.  It might not make a difference, but I've found 
> iptables to be sufficiently complex that it's easier to take that variable 
> out altogether by stopping it to really, really test to see if that's the 
> problem.
> 
> 
> 
> On Feb 2, 2012, at 9:48 AM, adrian sabou wrote:
> 
> > Hi,
> >  
> > I have disabled iptables on all nodes using:
> >  
> > iptables -F
> > iptables -X
> > iptables -t nat -F
> > iptables -t nat -X
> > iptables -t mangle -F
> > iptables -t mangle -X
> > iptables -P INPUT ACCEPT
> > iptables -P FORWARD ACCEPT
> > iptables -P OUTPUT ACCEPT
> >  
> > My problem is still there. I have re-enabled iptables. The current output 
> > of the "iptables --list" command is:
> >  
> > Chain INPUT (policy ACCEPT)
> > target    prot opt source              destination
> > ACCEPT    udp  --  anywhere            anywhere            udp dpt:domain
> > ACCEPT    tcp  --  anywhere            anywhere            tcp dpt:domain
> > ACCEPT    udp  --  anywhere            anywhere            udp dpt:bootps
> > ACCEPT    tcp  --  anywhere            anywhere            tcp dpt:bootps
> > Chain FORWARD (policy ACCEPT)
> > target    prot opt source              destination
> > ACCEPT    all  --  anywhere            192.168.122.0/24    state 
> > RELATED,ESTABLISHED
> > ACCEPT    all  --  192.168.122.0/24    anywhere
> > ACCEPT    all  --  anywhere            anywhere
> > REJECT    all  --  anywhere            anywhere            reject-with 
> > icmp-port-unreachable
> > REJECT    all  --  anywhere            anywhere            reject-with 
> > icmp-port-unreachable
> > Chain OUTPUT (policy ACCEPT)
> > target    prot opt source              destination
> > I don't think this is it. I have tried to run a simple ping-pong program 
> > that I found (keeps bouncing a value between two processes) and I keep 
> > getting the same results : the first Send / Receive pairs (p1 sends to p2, 
> > p2 receives and sends back to p1, p1 receives) work and after that the 
> > program just blocks. However, like all other examples, the example works if 
> > I launch it with "mpirun -np 2 <ping-pong>" and bounces the value 100 times.
> >  
> > Adrian
> > From: Jeff Squyres <jsquy...@cisco.com>
> > To: adrian sabou <adrian.sa...@yahoo.com>; Open MPI Users 
> > <us...@open-mpi.org> 
> > Sent: Thursday, February 2, 2012 3:09 PM
> > Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
> > 
> > Have you disabled iptables (firewalling) on your nodes?
> > 
> > Or, if you want to leave iptables enabled, set it such that all nodes in 
> > your cluster are allowed to open TCP connections from any port to any other 
> > port.
> > 
> > 
> > 
> > 
> > On Feb 2, 2012, at 4:49 AM, adrian sabou wrote:
> > 
> > > Hi,
> > > 
> > > The only example that works is hello_c.c. All others (that use MPI_Send 
> > > and MPI_Recv)(connectivity_c.c and ring_c.c) block after the first 
> > > MPI_Send / MPI_Recv (although the first Send/Receive pair works well for 
> > > all processes, subsequent Send/Receive pairs block). My slurm version is 
> > > 2.1.0. It is also worth mentioning that all examples work when not using 
> > > SLURM (launching with "mpirun -np 5 <exaple_app>"). Blocking occurs only 
> > > when I try to run on multiple hosts with SLURM ("salloc -N5 mpirun 
> > > <example_app>").
> > > 
> > > Adrian
> > > 
> > > From: Jeff Squyres <jsquy...@cisco.com>
> > > To: adrian sabou <adrian.sa...@yahoo.com>; Open MPI Users 
> > > <us...@open-mpi.org> 
> > > Sent: Wednesday, February 1, 2012 10:32 PM
> > > Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
> > > 
> > > On Jan 31, 2012, at 11:16 AM, adrian sabou wrote:
> > > 
> > > > Like I said, a very simple program.
> > > > When launching this application with SLURM (using "salloc -N2 mpirun 
> > > > ./<my_app>"), it hangs at the barrier.
> > > 
> > > Are you able to run the MPI example programs in examples/ ?
> > > 
> > > > However, it passes the barrier if I launch it without SLURM (using 
> > > > "mpirun -np 2 ./<my_app>"). I first noticed this problem when my 
> > > > application hanged if I tried to send two successive messages from a 
> > > > process to another. Only the first MPI_Send would work. The second 
> > > > MPI_Send would block indefinitely. I was wondering whether any of you 
> > > > have encountered a similar problem, or may have an ideea as to what is 
> > > > causing the Send/Receive pair to block when using SLURM. The exact 
> > > > output in my console is as follows:
> > > >  
> > > >        salloc: Granted job allocation 1138
> > > >        Process 0 - Sending...
> > > >        Process 1 - Receiving...
> > > >        Process 1 - Received.
> > > >        Process 1 - Barrier reached.
> > > >        Process 0 - Sent.
> > > >        Process 0 - Barrier reached.
> > > >        (it just hangs here)
> > > >  
> > > > I am new to MPI programming and to OpenMPI and would greatly appreciate 
> > > > any help. My OpenMPI version is 1.4.4 (although I have also tried it on 
> > > > 1.5.4), my SLURM version is 0.3.3-1 (slurm-llnl 2.1.0-1),
> > > 
> > > I'm not sure what SLURM version that is -- my "srun --version" shows 
> > > 2.2.4.  0.3.3 would be pretty ancient, no?
> > > 
> > > -- 
> > > Jeff Squyres
> > > jsquy...@cisco.com
> > > For corporate legal information go to:
> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> > -- 
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> > 
> > 
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to