[OMPI users] [btl_openib_component.c:1373:btl_openib_component_progress] error polling HP CQ with -2 errno says Success

2009-09-25 Thread Charles Wright

Hello,
   I just got some new cluster hardware :)  :(

I can't seem to overcome an openib problem
I get this at run time

error polling HP CQ with -2 errno says Success

I've tried 2 different IB switches and multiple sets of nodes all on one 
switch or the other to try to eliminate the hardware.   (IPoIB pings 
work and IB switches ree
I've tried both v1.3.3 and v1.2.9 and get the same errors.I'm not 
really sure what these errors mean or how to get rid of them.
My MPI application work if all the CPUs are on the same node (self btl 
only probably)


Any advice would be appreciated.  
Thanks.


asnrcw@dmc:~> qsub -I -l nodes=32,partition=dmc,feature=qc226 -q sysadm
qsub: waiting for job 232035.mds1.asc.edu to start
qsub: job 232035.mds1.asc.edu ready


# Alabama Supercomputer Center - PBS Prologue
# Your job id is : 232035
# Your job name is : STDIN
# Your job's queue is : sysadm
# Your username for this job is : asnrcw
# Your groupfor this job is : analyst
# Your job used : 
#   8 CPUs on dmc101

#   8 CPUs on dmc102
#   8 CPUs on dmc103
#   8 CPUs on dmc104
# Your job started at : Fri Sep 25 10:20:05 CDT 2009

asnrcw@dmc101:~> 
asnrcw@dmc101:~> 
asnrcw@dmc101:~> 
asnrcw@dmc101:~> 
asnrcw@dmc101:~> cd mpiprintrank

asnrcw@dmc101:~/mpiprintrank> which mpirun
/apps/openmpi-1.3.3-intel/bin/mpirun
asnrcw@dmc101:~/mpiprintrank> mpirun ./mpiprintrank-dmc-1.3.3-intel 
[dmc103][[46071,1],19][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success

[dmc103][[46071,1],16][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc103][[46071,1],17][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc103][[46071,1],18][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc103][[46071,1],20][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc103][[46071,1],21][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc103][[46071,1],23][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc101][[46071,1],6][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],7][btl_openib_component.c:3047:poll_device] error polling HP 
CQ with -2 errno says Success
[dmc103][[46071,1],22][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],3][btl_openib_component.c:3047:poll_device] error polling HP 
CQ with -2 errno says Success
[dmc101][[46071,1],4][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error polling HP 
CQ with -2 errno says Success
[dmc101][[46071,1],0][btl_openib_component.c:3047:poll_device] error polling HP 
CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc101][[46071,1],1][btl_openib_component.c:3047:poll_device] error polling HP 
CQ with -2 errno says Success
[dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device] error polling HP 
CQ with -2 errno says Success
[dmc101][[46071,1],5][btl_openib_component.c:3047:poll_device] error polling HP 
CQ with -2 errno says Success
[dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device] error polling 
HP CQ with -2 errno says Success
[dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device] 
[dmc101][[46071,1],2][btl_openib_component.c:3047:poll_device] error polling HP 
CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success[dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device] 
error polling HP CQ with -2 errno says Success

[dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device] error polling 
HP CQ wit

Re: [OMPI users] [btl_openib_component.c:1373:btl_openib_component_progress] error polling HP CQ with -2 errno says Success

2009-09-28 Thread Charles Wright

I've verified that ulimit -l is unlimited everywhere.

After further testing I think the errors are related to OFED not openmpi.
I've uninstalled the OFED that comes with SLES (1.4.0) and installed 
OFED 1.4.2 and 1.5-beta and I don't get the errors.


I got the idea to swap out OFED that after reading this:
http://kerneltrap.org/mailarchive/openfabrics-general/2008/11/3/3903184

Under OFED 1.4.0 (from SLES 11) I had to set options mlx4_core msi_x=0 
in /etc/modprobe.conf.local to even get the mlx4 module to load.

I found that advice here:
http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1254161827534+28353475&threadId=1361415
(Under 1.4.2 and 1.5-Beta the modules load fine without mlx4_core 
msi_x=0 being set)


Now my problem is that with OFED 1.4.2 and 1.5-beta the system hang and 
the GigE network stops working and I have to power cycle nodes to login.


I'm going to try to get some help from the OFED mailing list now. 


Pavel Shamis (Pasha) wrote:

Very strange. MPI tries to access CQ context and it get immediate error.
Please make sure that you limits configuration is ok, take a look on 
this FAQ - 
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages


Pasha.


Charles Wright wrote:

Hello,
   I just got some new cluster hardware :)  :(

I can't seem to overcome an openib problem
I get this at run time

error polling HP CQ with -2 errno says Success

I've tried 2 different IB switches and multiple sets of nodes all on 
one switch or the other to try to eliminate the hardware.   (IPoIB 
pings work and IB switches ree
I've tried both v1.3.3 and v1.2.9 and get the same errors.I'm not 
really sure what these errors mean or how to get rid of them.
My MPI application work if all the CPUs are on the same node (self 
btl only probably)


Any advice would be appreciated.  Thanks.

asnrcw@dmc:~> qsub -I -l nodes=32,partition=dmc,feature=qc226 -q sysadm
qsub: waiting for job 232035.mds1.asc.edu to start
qsub: job 232035.mds1.asc.edu ready


# Alabama Supercomputer Center - PBS Prologue
# Your job id is : 232035
# Your job name is : STDIN
# Your job's queue is : sysadm
# Your username for this job is : asnrcw
# Your groupfor this job is : analyst
# Your job used : #   8 CPUs on dmc101
#   8 CPUs on dmc102
#   8 CPUs on dmc103
#   8 CPUs on dmc104
# Your job started at : Fri Sep 25 10:20:05 CDT 2009

asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> 
asnrcw@dmc101:~> cd mpiprintrank

asnrcw@dmc101:~/mpiprintrank> which mpirun
/apps/openmpi-1.3.3-intel/bin/mpirun
asnrcw@dmc101:~/mpiprintrank> mpirun ./mpiprintrank-dmc-1.3.3-intel 
[dmc103][[46071,1],19][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],16][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],17][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],18][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],20][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],21][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],23][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],6][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success

error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],7][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],22][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success

error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],3][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],4][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],0][btl_openib_comp

[OMPI users] mca_oob_tcp_peer_complete_connect: connection failed

2006-03-16 Thread Charles Wright
m (MCA v1.0, API v1.0, Component v1.0.1)
 MCA rml: oob (MCA v1.0, API v1.0, Component v1.0.1)
 MCA pls: daemon (MCA v1.0, API v1.0, Component v1.0.1)
 MCA pls: proxy (MCA v1.0, API v1.0, Component v1.0.1)
 MCA pls: fork (MCA v1.0, API v1.0, Component v1.0.1)
 MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0.1)
 MCA pls: slurm (MCA v1.0, API v1.0, Component v1.0.1)
 MCA pls: tm (MCA v1.0, API v1.0, Component v1.0.1)
 MCA sds: env (MCA v1.0, API v1.0, Component v1.0.1)
 MCA sds: seed (MCA v1.0, API v1.0, Component v1.0.1)
 MCA sds: singleton (MCA v1.0, API v1.0, Component v1.0.1)
 MCA sds: slurm (MCA v1.0, API v1.0, Component v1.0.1)
 MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0.1)
uahrcw@c275-6:~/mpi-benchmarks>


-- 
Charles Wright, HPC Systems Administrator
Alabama Research and Education Network
Computer Sciences Corporation 



config.log.bz2
Description: BZip2 compressed data


Re: [OMPI users] mca_oob_tcp_peer_complete_connect: connection failed

2006-03-16 Thread Charles Wright
Thanks for the tip.

I see that both number 1 and 2 are true.
Openmpi is insisting on using my eth0 (I know this by watching the
firewall log on the node it is trying to go to)

This is despite the fact that I have the first dns entry go to eth1,
normally that is all pbs would need to do the right thing and use the
network I prefer.

Ok so I see there are some options to in/exclude interfaces.

however mpiexec is igorning my requests.
I tried it two ways.  Neither worked.   Firewall rejects traffic coming
into 1.0.x.x. network in both cases.

/opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_include eth1
-n 2 $XD1LAUNCHER ./mpimeasure
/opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_exclude eth0
-n 2 $XD1LAUNCHER ./mpimeasure

(see dns works... not over eth0)
uahrcw@c344-6:~/mpi-benchmarks> /sbin/ifconfig
eth0  Link encap:Ethernet  HWaddr 00:0E:AB:01:58:60
  inet addr:1.0.21.134  Bcast:1.127.255.255  Mask:255.128.0.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:6596091 errors:0 dropped:0 overruns:0 frame:0
  TX packets:316165 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:560395541 (534.4 Mb)  TX bytes:34367848 (32.7 Mb)
  Interrupt:16

eth1  Link encap:Ethernet  HWaddr 00:0E:AB:01:58:61
  inet addr:1.128.21.134  Mask:255.128.0.0
  UP RUNNING NOARP  MTU:1500  Metric:1
  RX packets:5600487 errors:0 dropped:0 overruns:0 frame:0
  TX packets:4863441 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:6203028277 (5915.6 Mb)  TX bytes:566471561 (540.2 Mb)
  Interrupt:25

eth2  Link encap:Ethernet  HWaddr 00:0E:AB:01:58:62
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:829064 errors:0 dropped:0 overruns:0 frame:0
  TX packets:181572 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:61216408 (58.3 Mb)  TX bytes:19079579 (18.1 Mb)
  Base address:0x2000 Memory:fea8-feaa

eth2:2Link encap:Ethernet  HWaddr 00:0E:AB:01:58:62
  inet addr:129.66.9.146  Bcast:129.66.9.255  Mask:255.255.255.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  Base address:0x2000 Memory:fea8-feaa

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:14259 errors:0 dropped:0 overruns:0 frame:0
  TX packets:14259 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:879631 (859.0 Kb)  TX bytes:879631 (859.0 Kb)

uahrcw@c344-6:~/mpi-benchmarks> ping c344-5
PING c344-5.x.asc.edu (1.128.21.133) 56(84) bytes of data.
64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=1 ttl=64
time=0.067 ms
64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=2 ttl=64
time=0.037 ms
64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=3 ttl=64
time=0.022 ms

--- c344-5.x.asc.edu ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.022/0.042/0.067/0.018 ms



George Bosilca wrote:
>I see only 2 possibilities:
>1. your trying to run Open MPI on nodes having multiple IP 
>addresses.
>2. your nodes are behind firewalls and Open MPI is unable to pass through.
>
>Please check the FAQ on http://www.open-mpi.org/faq/ to find out the full 
>answer to your question.
>
>   Thanks,
> george.
>
>On Thu, 16 Mar 2006, Charles Wright wrote:
>
>  
>>Hello,
>>   I'm just compiled open-mpi and tried to run my code which just
>>measures bandwidth from one node to another.   (Code compile fine and
>>runs under other mpi implementations)
>>
>>When I did I got this.
>>
>>uahrcw@c275-6:~/mpi-benchmarks> cat openmpitcp.o15380
>>c317-6
>>c317-5
>>[c317-5:24979] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect:
>>connection failed (errno=110) - retrying (pid=24979)
>>[c317-5:24979] mca_oob_tcp_peer_timer_handler
>>[c317-5:24997] [0,1,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
>>connection failed (errno=110) - retrying (pid=24997)
>>[c317-5:24997] mca_oob_tcp_peer_timer_handler
>>
>>[0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect]
>>connect() failed with errno=110
>>
>>
>>I compiled open-mpi with Pbspro 5.4-4 and I'm guessing that has
>>something to do with it.
>>
>>I've attached my config.log
>>
>>Any help with this would be appreciated.
>>
>>uahrcw@c275-6:~/mpi-benchmarks> ompi_info
>>   Open MPI: 1.0.1r8453
>>  Open MPI SVN revision: r8453
>>   Open RTE: 1.0.1r8453
>>  Open RTE SVN revision: r

Re: [OMPI users] mca_oob_tcp_peer_complete_connect: connection failed

2006-03-16 Thread Charles Wright
That works!!
Thanks!!

George Bosilca wrote:
>Sorry I wasn't clear enough on my previous post. The error messages that 
>you get are comming from the OOB which is the framework we're using to 
>setup the MPI run. The options that you use (btl_tcp_if_include) are only 
>used for MPI communications. Please add "--mca oob_tcp_include eth0" to 
>force the OOB framework to use eth0. In order to don't have to type all 
>these options all the time you can add them in the 
>$(HOME).openmpi/mca-params.conf file. A file containing:
>
>oob_tcp_include=eth1
>btl_tcp_if_include=eth1
>
>should solve your problems, if the firewall is opened on eth1 between 
>these nodes.
>
>   Thanks,
> george.
>
>On Thu, 16 Mar 2006, Charles Wright wrote:
>
>  
>>Thanks for the tip.
>>
>>I see that both number 1 and 2 are true.
>>Openmpi is insisting on using my eth0 (I know this by watching the
>>firewall log on the node it is trying to go to)
>>
>>This is despite the fact that I have the first dns entry go to eth1,
>>normally that is all pbs would need to do the right thing and use the
>>network I prefer.
>>
>>Ok so I see there are some options to in/exclude interfaces.
>>
>>however mpiexec is igorning my requests.
>>I tried it two ways.  Neither worked.   Firewall rejects traffic coming
>>into 1.0.x.x. network in both cases.
>>
>>/opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_include eth1
>>-n 2 $XD1LAUNCHER ./mpimeasure
>>/opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_exclude eth0
>>-n 2 $XD1LAUNCHER ./mpimeasure
>>
>>(see dns works... not over eth0)
>>uahrcw@c344-6:~/mpi-benchmarks> /sbin/ifconfig
>>eth0  Link encap:Ethernet  HWaddr 00:0E:AB:01:58:60
>> inet addr:1.0.21.134  Bcast:1.127.255.255  Mask:255.128.0.0
>> UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> RX packets:6596091 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:316165 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:560395541 (534.4 Mb)  TX bytes:34367848 (32.7 Mb)
>> Interrupt:16
>>
>>eth1  Link encap:Ethernet  HWaddr 00:0E:AB:01:58:61
>> inet addr:1.128.21.134  Mask:255.128.0.0
>> UP RUNNING NOARP  MTU:1500  Metric:1
>> RX packets:5600487 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:4863441 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:6203028277 (5915.6 Mb)  TX bytes:566471561 (540.2 Mb)
>> Interrupt:25
>>
>>eth2  Link encap:Ethernet  HWaddr 00:0E:AB:01:58:62
>> UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> RX packets:829064 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:181572 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:61216408 (58.3 Mb)  TX bytes:19079579 (18.1 Mb)
>> Base address:0x2000 Memory:fea8-feaa
>>
>>eth2:2Link encap:Ethernet  HWaddr 00:0E:AB:01:58:62
>> inet addr:129.66.9.146  Bcast:129.66.9.255  Mask:255.255.255.0
>> UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> Base address:0x2000 Memory:fea8-feaa
>>
>>loLink encap:Local Loopback
>> inet addr:127.0.0.1  Mask:255.0.0.0
>> UP LOOPBACK RUNNING  MTU:16436  Metric:1
>> RX packets:14259 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:14259 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:0
>> RX bytes:879631 (859.0 Kb)  TX bytes:879631 (859.0 Kb)
>>
>>uahrcw@c344-6:~/mpi-benchmarks> ping c344-5
>>PING c344-5.x.asc.edu (1.128.21.133) 56(84) bytes of data.
>>64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=1 ttl=64
>>time=0.067 ms
>>64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=2 ttl=64
>>time=0.037 ms
>>64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=3 ttl=64
>>time=0.022 ms
>>
>>--- c344-5.x.asc.edu ping statistics ---
>>3 packets transmitted, 3 received, 0% packet loss, time 1999ms
>>rtt min/avg/max/mdev = 0.022/0.042/0.067/0.018 ms
>>
>>
>>
>>George Bosilca wrote:
>>
>>>I see only 2 possibilities:
>>>1. your trying to run Open MPI on nodes having multiple IP
>>>addresses.
>>>2. your nodes are behind firewalls and Open MPI is unable to pass through.
>>>
>>>Please check the FAQ on http://www.open-