Re: [OMPI users] Problem with openmpi and infiniband

Biagio Lucini Thu, 15 Jan 2009 12:23:39 -0500

Jeff Squyres wrote:

On Jan 7, 2009, at 6:28 PM, Biagio Lucini wrote:

[[5963,1],13][btl_openib_component.c:2893:handle_wc] from node24 to:
node11 error polling LP CQ with status RECEIVER NOT READY RETRY
EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0


Ah! If we're dealing a RNR retry exceeded, this is *usually* a physical
layer problem on the IB fabric.

Have you run a complete layer 0 / physical set of diagnostics on the
fabric to know that it is completely working properly?

Once again, apologies for the delayed answer, but I always need to finda free spot to perform checks without disrupting the activity of theother users, who seem to be happy with the present status (this includesthe other users of infiniband).

What I have done is to run the Intel MPI Benchmark in a stress-mode over40 nodes and then on exactly the same nodes my code. The errors for mycode are attached. I do not attach the Intel benchmark file, since it is100k and might upset someone, but I can send it on request. If I pick arandom test:


#-----------------------------------------------------------------------------

# Benchmarking Exchange# #processes = 40

#-----------------------------------------------------------------------------

#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]Mbytes/sec0 1000 19.70 20.37 19.870.001 1000 12.80 13.61 13.250.282 1000 12.94 13.73 13.390.564 1000 12.93 13.24 13.141.158 1000 12.46 12.89 12.652.3716 1000 14.59 15.35 15.003.9832 1000 12.83 13.42 13.269.0964 1000 13.17 13.49 13.3118.10128 1000 13.83 14.40 14.2033.90256 1000 16.47 17.34 16.8956.33512 1000 22.72 23.29 22.9983.851024 1000 35.09 36.30 35.72107.622048 1000 71.28 72.46 71.91107.814096 1000 139.78 141.55 140.72110.388192 1000 237.86 240.13 239.10130.1416384 1000 481.37 486.15 484.10128.5632768 1000 864.89 872.48 869.35143.2765536 640 1607.97 1629.53 1620.19153.42131072 320 3106.92 3196.91 3160.10156.40262144 160 5970.66 6333.02 6185.35157.90524288 80 16322.10 18509.40 17627.17108.051048576 40 31194.17 40981.73 37056.9797.602097152 20 38023.90 77308.80 61021.08103.484194304 10 20423.82 143447.80 84832.93111.54

------------------------------------------------------------------


As you can see, the Intel benchmark runs fine on this set

of nodes; I have been running it for a few hours without any problem. Onthe other hands, my job still has this problem. To recap:

both are compiled with openmpi, the benchmark looks fine and my job

refuses to establish communication among processes without giving anyerror message with OMPI 1.2.x (various x) while gives the attached errormessage with 1.3rc2.


I have tried ibcheckerrors, which reports:

#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkDowned = 20  (threshold 10)
#warn: counter XmtDiscards = 65535      (threshold 100)

Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies)port all: FAILED

#warn: counter SymbolErrors = 65535     (threshold 10)

Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies)port 10: FAILED

# Checked Switch: nodeguid 0x000b8cffff002347 with failure
#warn: counter XmtDiscards = 65535      (threshold 100)

Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies)port 1: FAILED


## Summary: 25 nodes checked, 0 bad nodes found
##          48 ports checked, 2 ports have errors beyond threshold

Admittedly, not encouraging. The output of ibnetdiscover is attached.

I should had that the cluster (including infiniband) is currently beingused. Unfortunately, my experience with infiniband is not adequate to


Any further clue on possible problems is very welcome.

Many thanks for your attention,
Biagio

--
=========================================================

Dr. Biagio Lucini                               
Department of Physics, Swansea University
Singleton Park, SA2 8PP Swansea (UK)
Tel. +44 (0)1792 602284

=========================================================

[node17:25443] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node17_0/10802/1/shared_mem_pool.node17
 failed with errno=2
[node21:28610] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node21_0/10802/1/shared_mem_pool.node21
 failed with errno=2
[node10:29396] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node10_0/10802/1/shared_mem_pool.node10
 failed with errno=2
[node24:02084] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node24_0/10802/1/shared_mem_pool.node24
 failed with errno=2
[node19:01502] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node19_0/10802/1/shared_mem_pool.node19
 failed with errno=2
[node12:31509] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node12_0/10802/1/shared_mem_pool.node12
 failed with errno=2
[node22:10933] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node22_0/10802/1/shared_mem_pool.node22
 failed with errno=2
[node23:18518] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node23_0/10802/1/shared_mem_pool.node23
 failed with errno=2
[node9:26098] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node9_0/10802/1/shared_mem_pool.node9
 failed with errno=2
[node16:27655] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node16_0/10802/1/shared_mem_pool.node16
 failed with errno=2
[node15:27478] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node15_0/10802/1/shared_mem_pool.node15
 failed with errno=2
[node18:21742] mca_common_sm_mmap_init: open 
/tmp/10955.1.gold/openmpi-sessions-kstrings@node18_0/10802/1/shared_mem_pool.node18
 failed with errno=2
[[10802,1],32][btl_openib_component.c:2893:handle_wc] from node21 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
--------------------------------------------------------------------------
The OpenFabrics "receiver not ready" retry count on a per-peer
connection between two MPI processes has been exceeded.  In general,
this should not happen because Open MPI uses flow control on per-peer
connections to ensure that receivers are always ready when data is
sent.

This error usually means one of two things:

1. There is something awry within the network fabric itself.  
2. A bug in Open MPI has caused flow control to malfunction.

error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Below is some information about the host that raised the error and the
peer to which it was connected:

  Local host:   node21
  Local device: mthca0
  Peer host:    node11

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
[[10802,1],36][btl_openib_component.c:2893:handle_wc] from node19 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],1][btl_openib_component.c:2893:handle_wc] from node9 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],38][btl_openib_component.c:2893:handle_wc] from node20 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],35][btl_openib_component.c:2893:handle_wc] from node23 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],10][btl_openib_component.c:2893:handle_wc] from node15 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],14][btl_openib_component.c:2893:handle_wc] from node14 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],33][btl_openib_component.c:2893:handle_wc] from node21 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],12][btl_openib_component.c:2893:handle_wc] from node18 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],6][btl_openib_component.c:2893:handle_wc] from node13 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],3][btl_openib_component.c:2893:handle_wc] from node16 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],37][btl_openib_component.c:2893:handle_wc] from node19 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],18][btl_openib_component.c:2893:handle_wc] from node10 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
--------------------------------------------------------------------------
mpirun has exited due to process rank 32 with PID 28610 on
node node21.cluster exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[[10802,1],16][btl_openib_component.c:2893:handle_wc] from node12 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37784704 opcode 0 qp_idx 0
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
[node11:28202] [[10802,0],0]-[[10802,0],2] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
[node11:28202] [[10802,0],0]-[[10802,0],11] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
[node11:28202] [[10802,0],0]-[[10802,0],3] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
[node11:28202] [[10802,0],0]-[[10802,0],13] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
[node11:28202] [[10802,0],0]-[[10802,0],4] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
[node11:28202] 13 more processes have sent help message help-mpi-btl-openib.txt 
/ pp rnr retry exceeded
[node11:28202] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages
forrtl: error (78): process killed (SIGTERM)

#
# Topology file: generated on Thu Jan 15 14:53:53 2009
#
# Max of 2 hops discovered
# Initiated from node 0002c90200268638 port 0002c90200268639

vendid=0x2c9
devid=0xb924
sysimgguid=0xb8cffff002347
switchguid=0xb8cffff002347
Switch  24 "S-000b8cffff002347"         # "MT47396 Infiniscale-III Mellanox 
Technologies" base port 0 lid 1 lmc 0
[24]    "H-0002c90200267924"[1]         # "node24 HCA-1" lid 18
[23]    "H-0002c9020026771c"[1]         # "node23 HCA-1" lid 17
[22]    "H-0002c90200268648"[1]         # "node22 HCA-1" lid 16
[20]    "H-0002c90200267774"[1]         # "node20 HCA-1" lid 14
[19]    "H-0002c9020026796c"[1]         # "node19 HCA-1" lid 13
[18]    "H-0002c90200230e48"[1]         # "node18 HCA-1" lid 12
[17]    "H-0002c9020021de24"[1]         # "node17 HCA-1" lid 11
[16]    "H-0002c90200230dd4"[1]         # "node16 HCA-1" lid 10
[15]    "H-0002c9020022b38c"[1]         # "node15 HCA-1" lid 121
[14]    "H-0002c9020022b3b8"[1]         # "node14 HCA-1" lid 5
[13]    "H-0002c9020025420c"[1]         # "node13 HCA-1" lid 6
[12]    "H-0002c9020022b398"[1]         # "node12 HCA-1" lid 97
[11]    "H-0002c9020022b3b0"[1]         # "node11 HCA-1" lid 89
[10]    "H-0002c9020022b3c8"[1]         # "node10 HCA-1" lid 81
[9]     "H-0002c9020022b330"[1]         # "node9 HCA-1" lid 73
[8]     "H-0002c9020022b3cc"[1]         # "node8 HCA-1" lid 65
[7]     "H-0002c9020022b3dc"[1]         # "node7 HCA-1" lid 3
[6]     "H-0002c9020022b334"[1]         # "node6 HCA-1" lid 4
[5]     "H-0002c9020022b3a4"[1]         # "node5 HCA-1" lid 41
[4]     "H-0002c9020022b380"[1]         # "node4 HCA-1" lid 33
[3]     "H-0002c9020020d75c"[1]         # "node3 HCA-1" lid 25
[2]     "H-0002c902002140ac"[1]         # "node2 HCA-1" lid 2
[1]     "H-0002c9020020d604"[1]         # "node1 HCA-1" lid 9
[21]    "H-0002c90200268638"[1]         # "node21 HCA-1" lid 15

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c90200267927
caguid=0x2c90200267924
Ca      1 "H-0002c90200267924"          # "node24 HCA-1"
[1]     "S-000b8cffff002347"[24]                # lid 18 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020026771f
caguid=0x2c9020026771c
Ca      1 "H-0002c9020026771c"          # "node23 HCA-1"
[1]     "S-000b8cffff002347"[23]                # lid 17 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020026864b
caguid=0x2c90200268648
Ca      1 "H-0002c90200268648"          # "node22 HCA-1"
[1]     "S-000b8cffff002347"[22]                # lid 16 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c90200267777
caguid=0x2c90200267774
Ca      1 "H-0002c90200267774"          # "node20 HCA-1"
[1]     "S-000b8cffff002347"[20]                # lid 14 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020026796f
caguid=0x2c9020026796c
Ca      1 "H-0002c9020026796c"          # "node19 HCA-1"
[1]     "S-000b8cffff002347"[19]                # lid 13 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c90200230e4b
caguid=0x2c90200230e48
Ca      1 "H-0002c90200230e48"          # "node18 HCA-1"
[1]     "S-000b8cffff002347"[18]                # lid 12 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020021de27
caguid=0x2c9020021de24
Ca      1 "H-0002c9020021de24"          # "node17 HCA-1"
[1]     "S-000b8cffff002347"[17]                # lid 11 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c90200230dd7
caguid=0x2c90200230dd4
Ca      1 "H-0002c90200230dd4"          # "node16 HCA-1"
[1]     "S-000b8cffff002347"[16]                # lid 10 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b38f
caguid=0x2c9020022b38c
Ca      1 "H-0002c9020022b38c"          # "node15 HCA-1"
[1]     "S-000b8cffff002347"[15]                # lid 121 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3bb
caguid=0x2c9020022b3b8
Ca      1 "H-0002c9020022b3b8"          # "node14 HCA-1"
[1]     "S-000b8cffff002347"[14]                # lid 5 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020025420f
caguid=0x2c9020025420c
Ca      1 "H-0002c9020025420c"          # "node13 HCA-1"
[1]     "S-000b8cffff002347"[13]                # lid 6 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b39b
caguid=0x2c9020022b398
Ca      1 "H-0002c9020022b398"          # "node12 HCA-1"
[1]     "S-000b8cffff002347"[12]                # lid 97 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3b3
caguid=0x2c9020022b3b0
Ca      1 "H-0002c9020022b3b0"          # "node11 HCA-1"
[1]     "S-000b8cffff002347"[11]                # lid 89 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3cb
caguid=0x2c9020022b3c8
Ca      1 "H-0002c9020022b3c8"          # "node10 HCA-1"
[1]     "S-000b8cffff002347"[10]                # lid 81 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b333
caguid=0x2c9020022b330
Ca      1 "H-0002c9020022b330"          # "node9 HCA-1"
[1]     "S-000b8cffff002347"[9]         # lid 73 lmc 0 "MT47396 Infiniscale-III 
Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3cf
caguid=0x2c9020022b3cc
Ca      1 "H-0002c9020022b3cc"          # "node8 HCA-1"
[1]     "S-000b8cffff002347"[8]         # lid 65 lmc 0 "MT47396 Infiniscale-III 
Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3df
caguid=0x2c9020022b3dc
Ca      1 "H-0002c9020022b3dc"          # "node7 HCA-1"
[1]     "S-000b8cffff002347"[7]         # lid 3 lmc 0 "MT47396 Infiniscale-III 
Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b337
caguid=0x2c9020022b334
Ca      1 "H-0002c9020022b334"          # "node6 HCA-1"
[1]     "S-000b8cffff002347"[6]         # lid 4 lmc 0 "MT47396 Infiniscale-III 
Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3a7
caguid=0x2c9020022b3a4
Ca      1 "H-0002c9020022b3a4"          # "node5 HCA-1"
[1]     "S-000b8cffff002347"[5]         # lid 41 lmc 0 "MT47396 Infiniscale-III 
Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b383
caguid=0x2c9020022b380
Ca      1 "H-0002c9020022b380"          # "node4 HCA-1"
[1]     "S-000b8cffff002347"[4]         # lid 33 lmc 0 "MT47396 Infiniscale-III 
Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020020d75f
caguid=0x2c9020020d75c
Ca      1 "H-0002c9020020d75c"          # "node3 HCA-1"
[1]     "S-000b8cffff002347"[3]         # lid 25 lmc 0 "MT47396 Infiniscale-III 
Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c902002140af
caguid=0x2c902002140ac
Ca      1 "H-0002c902002140ac"          # "node2 HCA-1"
[1]     "S-000b8cffff002347"[2]         # lid 2 lmc 0 "MT47396 Infiniscale-III 
Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020020d607
caguid=0x2c9020020d604
Ca      1 "H-0002c9020020d604"          # "node1 HCA-1"
[1]     "S-000b8cffff002347"[1]         # lid 9 lmc 0 "MT47396 Infiniscale-III 
Mellanox Technologies" lid 1

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020026863b
caguid=0x2c90200268638
Ca      1 "H-0002c90200268638"          # "node21 HCA-1"
[1]     "S-000b8cffff002347"[21]                # lid 15 lmc 0 "MT47396 
Infiniscale-III Mellanox Technologies" lid 1

Re: [OMPI users] Problem with openmpi and infiniband

Reply via email to