[OMPI users] [btl_openib_component.c:1373:btl_openib_component_progress] error polling HP CQ with -2 errno says Success
Hello, I just got some new cluster hardware :) :( I can't seem to overcome an openib problem I get this at run time error polling HP CQ with -2 errno says Success I've tried 2 different IB switches and multiple sets of nodes all on one switch or the other to try to eliminate the hardware. (IPoIB pings work and IB switches ree I've tried both v1.3.3 and v1.2.9 and get the same errors.I'm not really sure what these errors mean or how to get rid of them. My MPI application work if all the CPUs are on the same node (self btl only probably) Any advice would be appreciated. Thanks. asnrcw@dmc:~> qsub -I -l nodes=32,partition=dmc,feature=qc226 -q sysadm qsub: waiting for job 232035.mds1.asc.edu to start qsub: job 232035.mds1.asc.edu ready # Alabama Supercomputer Center - PBS Prologue # Your job id is : 232035 # Your job name is : STDIN # Your job's queue is : sysadm # Your username for this job is : asnrcw # Your groupfor this job is : analyst # Your job used : # 8 CPUs on dmc101 # 8 CPUs on dmc102 # 8 CPUs on dmc103 # 8 CPUs on dmc104 # Your job started at : Fri Sep 25 10:20:05 CDT 2009 asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> cd mpiprintrank asnrcw@dmc101:~/mpiprintrank> which mpirun /apps/openmpi-1.3.3-intel/bin/mpirun asnrcw@dmc101:~/mpiprintrank> mpirun ./mpiprintrank-dmc-1.3.3-intel [dmc103][[46071,1],19][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],16][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],17][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],18][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],20][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],21][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],23][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc101][[46071,1],6][btl_openib_component.c:3047:poll_device] [dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success error polling HP CQ with -2 errno says Success [dmc101][[46071,1],7][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],22][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] [dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success error polling HP CQ with -2 errno says Success [dmc101][[46071,1],3][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc101][[46071,1],4][btl_openib_component.c:3047:poll_device] [dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc101][[46071,1],0][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success error polling HP CQ with -2 errno says Success [dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc101][[46071,1],1][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device] [dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success error polling HP CQ with -2 errno says Success [dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc101][[46071,1],5][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device] [dmc101][[46071,1],2][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success error polling HP CQ with -2 errno says Success[dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device] error polling HP CQ wit
Re: [OMPI users] [btl_openib_component.c:1373:btl_openib_component_progress] error polling HP CQ with -2 errno says Success
I've verified that ulimit -l is unlimited everywhere. After further testing I think the errors are related to OFED not openmpi. I've uninstalled the OFED that comes with SLES (1.4.0) and installed OFED 1.4.2 and 1.5-beta and I don't get the errors. I got the idea to swap out OFED that after reading this: http://kerneltrap.org/mailarchive/openfabrics-general/2008/11/3/3903184 Under OFED 1.4.0 (from SLES 11) I had to set options mlx4_core msi_x=0 in /etc/modprobe.conf.local to even get the mlx4 module to load. I found that advice here: http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1254161827534+28353475&threadId=1361415 (Under 1.4.2 and 1.5-Beta the modules load fine without mlx4_core msi_x=0 being set) Now my problem is that with OFED 1.4.2 and 1.5-beta the system hang and the GigE network stops working and I have to power cycle nodes to login. I'm going to try to get some help from the OFED mailing list now. Pavel Shamis (Pasha) wrote: Very strange. MPI tries to access CQ context and it get immediate error. Please make sure that you limits configuration is ok, take a look on this FAQ - http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Pasha. Charles Wright wrote: Hello, I just got some new cluster hardware :) :( I can't seem to overcome an openib problem I get this at run time error polling HP CQ with -2 errno says Success I've tried 2 different IB switches and multiple sets of nodes all on one switch or the other to try to eliminate the hardware. (IPoIB pings work and IB switches ree I've tried both v1.3.3 and v1.2.9 and get the same errors.I'm not really sure what these errors mean or how to get rid of them. My MPI application work if all the CPUs are on the same node (self btl only probably) Any advice would be appreciated. Thanks. asnrcw@dmc:~> qsub -I -l nodes=32,partition=dmc,feature=qc226 -q sysadm qsub: waiting for job 232035.mds1.asc.edu to start qsub: job 232035.mds1.asc.edu ready # Alabama Supercomputer Center - PBS Prologue # Your job id is : 232035 # Your job name is : STDIN # Your job's queue is : sysadm # Your username for this job is : asnrcw # Your groupfor this job is : analyst # Your job used : # 8 CPUs on dmc101 # 8 CPUs on dmc102 # 8 CPUs on dmc103 # 8 CPUs on dmc104 # Your job started at : Fri Sep 25 10:20:05 CDT 2009 asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> cd mpiprintrank asnrcw@dmc101:~/mpiprintrank> which mpirun /apps/openmpi-1.3.3-intel/bin/mpirun asnrcw@dmc101:~/mpiprintrank> mpirun ./mpiprintrank-dmc-1.3.3-intel [dmc103][[46071,1],19][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],16][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],17][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],18][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],20][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],21][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],23][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc101][[46071,1],6][btl_openib_component.c:3047:poll_device] [dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success error polling HP CQ with -2 errno says Success [dmc101][[46071,1],7][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc103][[46071,1],22][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] [dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success error polling HP CQ with -2 errno says Success [dmc101][[46071,1],3][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc101][[46071,1],4][btl_openib_component.c:3047:poll_device] [dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success [dmc101][[46071,1],0][btl_openib_comp
[OMPI users] mca_oob_tcp_peer_complete_connect: connection failed
m (MCA v1.0, API v1.0, Component v1.0.1) MCA rml: oob (MCA v1.0, API v1.0, Component v1.0.1) MCA pls: daemon (MCA v1.0, API v1.0, Component v1.0.1) MCA pls: proxy (MCA v1.0, API v1.0, Component v1.0.1) MCA pls: fork (MCA v1.0, API v1.0, Component v1.0.1) MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0.1) MCA pls: slurm (MCA v1.0, API v1.0, Component v1.0.1) MCA pls: tm (MCA v1.0, API v1.0, Component v1.0.1) MCA sds: env (MCA v1.0, API v1.0, Component v1.0.1) MCA sds: seed (MCA v1.0, API v1.0, Component v1.0.1) MCA sds: singleton (MCA v1.0, API v1.0, Component v1.0.1) MCA sds: slurm (MCA v1.0, API v1.0, Component v1.0.1) MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0.1) uahrcw@c275-6:~/mpi-benchmarks> -- Charles Wright, HPC Systems Administrator Alabama Research and Education Network Computer Sciences Corporation config.log.bz2 Description: BZip2 compressed data
Re: [OMPI users] mca_oob_tcp_peer_complete_connect: connection failed
Thanks for the tip. I see that both number 1 and 2 are true. Openmpi is insisting on using my eth0 (I know this by watching the firewall log on the node it is trying to go to) This is despite the fact that I have the first dns entry go to eth1, normally that is all pbs would need to do the right thing and use the network I prefer. Ok so I see there are some options to in/exclude interfaces. however mpiexec is igorning my requests. I tried it two ways. Neither worked. Firewall rejects traffic coming into 1.0.x.x. network in both cases. /opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_include eth1 -n 2 $XD1LAUNCHER ./mpimeasure /opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_exclude eth0 -n 2 $XD1LAUNCHER ./mpimeasure (see dns works... not over eth0) uahrcw@c344-6:~/mpi-benchmarks> /sbin/ifconfig eth0 Link encap:Ethernet HWaddr 00:0E:AB:01:58:60 inet addr:1.0.21.134 Bcast:1.127.255.255 Mask:255.128.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:6596091 errors:0 dropped:0 overruns:0 frame:0 TX packets:316165 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:560395541 (534.4 Mb) TX bytes:34367848 (32.7 Mb) Interrupt:16 eth1 Link encap:Ethernet HWaddr 00:0E:AB:01:58:61 inet addr:1.128.21.134 Mask:255.128.0.0 UP RUNNING NOARP MTU:1500 Metric:1 RX packets:5600487 errors:0 dropped:0 overruns:0 frame:0 TX packets:4863441 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:6203028277 (5915.6 Mb) TX bytes:566471561 (540.2 Mb) Interrupt:25 eth2 Link encap:Ethernet HWaddr 00:0E:AB:01:58:62 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:829064 errors:0 dropped:0 overruns:0 frame:0 TX packets:181572 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:61216408 (58.3 Mb) TX bytes:19079579 (18.1 Mb) Base address:0x2000 Memory:fea8-feaa eth2:2Link encap:Ethernet HWaddr 00:0E:AB:01:58:62 inet addr:129.66.9.146 Bcast:129.66.9.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Base address:0x2000 Memory:fea8-feaa loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:14259 errors:0 dropped:0 overruns:0 frame:0 TX packets:14259 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:879631 (859.0 Kb) TX bytes:879631 (859.0 Kb) uahrcw@c344-6:~/mpi-benchmarks> ping c344-5 PING c344-5.x.asc.edu (1.128.21.133) 56(84) bytes of data. 64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=1 ttl=64 time=0.067 ms 64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=2 ttl=64 time=0.037 ms 64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=3 ttl=64 time=0.022 ms --- c344-5.x.asc.edu ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1999ms rtt min/avg/max/mdev = 0.022/0.042/0.067/0.018 ms George Bosilca wrote: >I see only 2 possibilities: >1. your trying to run Open MPI on nodes having multiple IP >addresses. >2. your nodes are behind firewalls and Open MPI is unable to pass through. > >Please check the FAQ on http://www.open-mpi.org/faq/ to find out the full >answer to your question. > > Thanks, > george. > >On Thu, 16 Mar 2006, Charles Wright wrote: > > >>Hello, >> I'm just compiled open-mpi and tried to run my code which just >>measures bandwidth from one node to another. (Code compile fine and >>runs under other mpi implementations) >> >>When I did I got this. >> >>uahrcw@c275-6:~/mpi-benchmarks> cat openmpitcp.o15380 >>c317-6 >>c317-5 >>[c317-5:24979] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect: >>connection failed (errno=110) - retrying (pid=24979) >>[c317-5:24979] mca_oob_tcp_peer_timer_handler >>[c317-5:24997] [0,1,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: >>connection failed (errno=110) - retrying (pid=24997) >>[c317-5:24997] mca_oob_tcp_peer_timer_handler >> >>[0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] >>connect() failed with errno=110 >> >> >>I compiled open-mpi with Pbspro 5.4-4 and I'm guessing that has >>something to do with it. >> >>I've attached my config.log >> >>Any help with this would be appreciated. >> >>uahrcw@c275-6:~/mpi-benchmarks> ompi_info >> Open MPI: 1.0.1r8453 >> Open MPI SVN revision: r8453 >> Open RTE: 1.0.1r8453 >> Open RTE SVN revision: r
Re: [OMPI users] mca_oob_tcp_peer_complete_connect: connection failed
That works!! Thanks!! George Bosilca wrote: >Sorry I wasn't clear enough on my previous post. The error messages that >you get are comming from the OOB which is the framework we're using to >setup the MPI run. The options that you use (btl_tcp_if_include) are only >used for MPI communications. Please add "--mca oob_tcp_include eth0" to >force the OOB framework to use eth0. In order to don't have to type all >these options all the time you can add them in the >$(HOME).openmpi/mca-params.conf file. A file containing: > >oob_tcp_include=eth1 >btl_tcp_if_include=eth1 > >should solve your problems, if the firewall is opened on eth1 between >these nodes. > > Thanks, > george. > >On Thu, 16 Mar 2006, Charles Wright wrote: > > >>Thanks for the tip. >> >>I see that both number 1 and 2 are true. >>Openmpi is insisting on using my eth0 (I know this by watching the >>firewall log on the node it is trying to go to) >> >>This is despite the fact that I have the first dns entry go to eth1, >>normally that is all pbs would need to do the right thing and use the >>network I prefer. >> >>Ok so I see there are some options to in/exclude interfaces. >> >>however mpiexec is igorning my requests. >>I tried it two ways. Neither worked. Firewall rejects traffic coming >>into 1.0.x.x. network in both cases. >> >>/opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_include eth1 >>-n 2 $XD1LAUNCHER ./mpimeasure >>/opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_exclude eth0 >>-n 2 $XD1LAUNCHER ./mpimeasure >> >>(see dns works... not over eth0) >>uahrcw@c344-6:~/mpi-benchmarks> /sbin/ifconfig >>eth0 Link encap:Ethernet HWaddr 00:0E:AB:01:58:60 >> inet addr:1.0.21.134 Bcast:1.127.255.255 Mask:255.128.0.0 >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:6596091 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:316165 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:560395541 (534.4 Mb) TX bytes:34367848 (32.7 Mb) >> Interrupt:16 >> >>eth1 Link encap:Ethernet HWaddr 00:0E:AB:01:58:61 >> inet addr:1.128.21.134 Mask:255.128.0.0 >> UP RUNNING NOARP MTU:1500 Metric:1 >> RX packets:5600487 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:4863441 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:6203028277 (5915.6 Mb) TX bytes:566471561 (540.2 Mb) >> Interrupt:25 >> >>eth2 Link encap:Ethernet HWaddr 00:0E:AB:01:58:62 >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:829064 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:181572 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:61216408 (58.3 Mb) TX bytes:19079579 (18.1 Mb) >> Base address:0x2000 Memory:fea8-feaa >> >>eth2:2Link encap:Ethernet HWaddr 00:0E:AB:01:58:62 >> inet addr:129.66.9.146 Bcast:129.66.9.255 Mask:255.255.255.0 >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> Base address:0x2000 Memory:fea8-feaa >> >>loLink encap:Local Loopback >> inet addr:127.0.0.1 Mask:255.0.0.0 >> UP LOOPBACK RUNNING MTU:16436 Metric:1 >> RX packets:14259 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:14259 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:0 >> RX bytes:879631 (859.0 Kb) TX bytes:879631 (859.0 Kb) >> >>uahrcw@c344-6:~/mpi-benchmarks> ping c344-5 >>PING c344-5.x.asc.edu (1.128.21.133) 56(84) bytes of data. >>64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=1 ttl=64 >>time=0.067 ms >>64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=2 ttl=64 >>time=0.037 ms >>64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=3 ttl=64 >>time=0.022 ms >> >>--- c344-5.x.asc.edu ping statistics --- >>3 packets transmitted, 3 received, 0% packet loss, time 1999ms >>rtt min/avg/max/mdev = 0.022/0.042/0.067/0.018 ms >> >> >> >>George Bosilca wrote: >> >>>I see only 2 possibilities: >>>1. your trying to run Open MPI on nodes having multiple IP >>>addresses. >>>2. your nodes are behind firewalls and Open MPI is unable to pass through. >>> >>>Please check the FAQ on http://www.open-