You were right about iptables being very complex. It seems that uninstalling it completly did the trick. All my Send / Receive operations now complete as they should. Just one more question. Will uninstalling iptables have any undesired effects on my Linux cluster? Thaks! Adrian
________________________________ From: Jeff Squyres <jsquy...@cisco.com> To: adrian sabou <adrian.sa...@yahoo.com> Sent: Friday, February 3, 2012 12:30 PM Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking On Feb 3, 2012, at 5:21 AM, adrian sabou wrote: > There is no iptables in my /etc/init.d. It might be different in different OS's -- my RedHat-based system has /etc/init.d/iptables. Perhaps try uninstalling iptables using your local package manager (rpm, yum, apt, ...whatever). > It's most probably a communication issue between the nodes. However, I have > no ideea what it might be. It's weird though that the first Send / Receive > pair works and only subsequent pairs fail. Anyway, thankyou for taking the > time to help me out. I am grateful! > > Adrian > > From: Jeff Squyres <jsquy...@cisco.com> > To: adrian sabou <adrian.sa...@yahoo.com>; Open MPI Users > <us...@open-mpi.org> > Sent: Thursday, February 2, 2012 11:19 PM > Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking > > When you run without a hostfile, you're likely only running on a single node > via shared memory (unless you're running inside a SLURM job, which is > unlikely, given the context of your mails). > > When you're running in SLURM, I'm guessing that you're running across > multiple nodes. Are you using TCP as your MPI transport? > > If so, I would still recommend trying stopping iptables altogether -- > /etc/init.d/iptables stop. It might not make a difference, but I've found > iptables to be sufficiently complex that it's easier to take that variable > out altogether by stopping it to really, really test to see if that's the > problem. > > > > On Feb 2, 2012, at 9:48 AM, adrian sabou wrote: > > > Hi, > > > > I have disabled iptables on all nodes using: > > > > iptables -F > > iptables -X > > iptables -t nat -F > > iptables -t nat -X > > iptables -t mangle -F > > iptables -t mangle -X > > iptables -P INPUT ACCEPT > > iptables -P FORWARD ACCEPT > > iptables -P OUTPUT ACCEPT > > > > My problem is still there. I have re-enabled iptables. The current output > > of the "iptables --list" command is: > > > > Chain INPUT (policy ACCEPT) > > target prot opt source destination > > ACCEPT udp -- anywhere anywhere udp dpt:domain > > ACCEPT tcp -- anywhere anywhere tcp dpt:domain > > ACCEPT udp -- anywhere anywhere udp dpt:bootps > > ACCEPT tcp -- anywhere anywhere tcp dpt:bootps > > Chain FORWARD (policy ACCEPT) > > target prot opt source destination > > ACCEPT all -- anywhere 192.168.122.0/24 state > > RELATED,ESTABLISHED > > ACCEPT all -- 192.168.122.0/24 anywhere > > ACCEPT all -- anywhere anywhere > > REJECT all -- anywhere anywhere reject-with > > icmp-port-unreachable > > REJECT all -- anywhere anywhere reject-with > > icmp-port-unreachable > > Chain OUTPUT (policy ACCEPT) > > target prot opt source destination > > I don't think this is it. I have tried to run a simple ping-pong program > > that I found (keeps bouncing a value between two processes) and I keep > > getting the same results : the first Send / Receive pairs (p1 sends to p2, > > p2 receives and sends back to p1, p1 receives) work and after that the > > program just blocks. However, like all other examples, the example works if > > I launch it with "mpirun -np 2 <ping-pong>" and bounces the value 100 times. > > > > Adrian > > From: Jeff Squyres <jsquy...@cisco.com> > > To: adrian sabou <adrian.sa...@yahoo.com>; Open MPI Users > > <us...@open-mpi.org> > > Sent: Thursday, February 2, 2012 3:09 PM > > Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking > > > > Have you disabled iptables (firewalling) on your nodes? > > > > Or, if you want to leave iptables enabled, set it such that all nodes in > > your cluster are allowed to open TCP connections from any port to any other > > port. > > > > > > > > > > On Feb 2, 2012, at 4:49 AM, adrian sabou wrote: > > > > > Hi, > > > > > > The only example that works is hello_c.c. All others (that use MPI_Send > > > and MPI_Recv)(connectivity_c.c and ring_c.c) block after the first > > > MPI_Send / MPI_Recv (although the first Send/Receive pair works well for > > > all processes, subsequent Send/Receive pairs block). My slurm version is > > > 2.1.0. It is also worth mentioning that all examples work when not using > > > SLURM (launching with "mpirun -np 5 <exaple_app>"). Blocking occurs only > > > when I try to run on multiple hosts with SLURM ("salloc -N5 mpirun > > > <example_app>"). > > > > > > Adrian > > > > > > From: Jeff Squyres <jsquy...@cisco.com> > > > To: adrian sabou <adrian.sa...@yahoo.com>; Open MPI Users > > > <us...@open-mpi.org> > > > Sent: Wednesday, February 1, 2012 10:32 PM > > > Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking > > > > > > On Jan 31, 2012, at 11:16 AM, adrian sabou wrote: > > > > > > > Like I said, a very simple program. > > > > When launching this application with SLURM (using "salloc -N2 mpirun > > > > ./<my_app>"), it hangs at the barrier. > > > > > > Are you able to run the MPI example programs in examples/ ? > > > > > > > However, it passes the barrier if I launch it without SLURM (using > > > > "mpirun -np 2 ./<my_app>"). I first noticed this problem when my > > > > application hanged if I tried to send two successive messages from a > > > > process to another. Only the first MPI_Send would work. The second > > > > MPI_Send would block indefinitely. I was wondering whether any of you > > > > have encountered a similar problem, or may have an ideea as to what is > > > > causing the Send/Receive pair to block when using SLURM. The exact > > > > output in my console is as follows: > > > > > > > > salloc: Granted job allocation 1138 > > > > Process 0 - Sending... > > > > Process 1 - Receiving... > > > > Process 1 - Received. > > > > Process 1 - Barrier reached. > > > > Process 0 - Sent. > > > > Process 0 - Barrier reached. > > > > (it just hangs here) > > > > > > > > I am new to MPI programming and to OpenMPI and would greatly appreciate > > > > any help. My OpenMPI version is 1.4.4 (although I have also tried it on > > > > 1.5.4), my SLURM version is 0.3.3-1 (slurm-llnl 2.1.0-1), > > > > > > I'm not sure what SLURM version that is -- my "srun --version" shows > > > 2.2.4. 0.3.3 would be pretty ancient, no? > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/