Jeff Squyres wrote:
On Jan 7, 2009, at 6:28 PM, Biagio Lucini wrote:
[[5963,1],13][btl_openib_component.c:2893:handle_wc] from node24 to:
node11 error polling LP CQ with status RECEIVER NOT READY RETRY
EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
Ah! If we're dealing a RNR retry exceeded, this is *usually* a physical
layer problem on the IB fabric.
Have you run a complete layer 0 / physical set of diagnostics on the
fabric to know that it is completely working properly?
Once again, apologies for the delayed answer, but I always need to find
a free spot to perform checks without disrupting the activity of the
other users, who seem to be happy with the present status (this includes
the other users of infiniband).
What I have done is to run the Intel MPI Benchmark in a stress-mode over
40 nodes and then on exactly the same nodes my code. The errors for my
code are attached. I do not attach the Intel benchmark file, since it is
100k and might upset someone, but I can send it on request. If I pick a
random test:
#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 40
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
Mbytes/sec
0 1000 19.70 20.37 19.87
0.00
1 1000 12.80 13.61 13.25
0.28
2 1000 12.94 13.73 13.39
0.56
4 1000 12.93 13.24 13.14
1.15
8 1000 12.46 12.89 12.65
2.37
16 1000 14.59 15.35 15.00
3.98
32 1000 12.83 13.42 13.26
9.09
64 1000 13.17 13.49 13.31
18.10
128 1000 13.83 14.40 14.20
33.90
256 1000 16.47 17.34 16.89
56.33
512 1000 22.72 23.29 22.99
83.85
1024 1000 35.09 36.30 35.72
107.62
2048 1000 71.28 72.46 71.91
107.81
4096 1000 139.78 141.55 140.72
110.38
8192 1000 237.86 240.13 239.10
130.14
16384 1000 481.37 486.15 484.10
128.56
32768 1000 864.89 872.48 869.35
143.27
65536 640 1607.97 1629.53 1620.19
153.42
131072 320 3106.92 3196.91 3160.10
156.40
262144 160 5970.66 6333.02 6185.35
157.90
524288 80 16322.10 18509.40 17627.17
108.05
1048576 40 31194.17 40981.73 37056.97
97.60
2097152 20 38023.90 77308.80 61021.08
103.48
4194304 10 20423.82 143447.80 84832.93
111.54
------------------------------------------------------------------
As you can see, the Intel benchmark runs fine on this set
of nodes; I have been running it for a few hours without any problem. On
the other hands, my job still has this problem. To recap:
both are compiled with openmpi, the benchmark looks fine and my job
refuses to establish communication among processes without giving any
error message with OMPI 1.2.x (various x) while gives the attached error
message with 1.3rc2.
I have tried ibcheckerrors, which reports:
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkDowned = 20 (threshold 10)
#warn: counter XmtDiscards = 65535 (threshold 100)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies)
port all: FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies)
port 10: FAILED
# Checked Switch: nodeguid 0x000b8cffff002347 with failure
#warn: counter XmtDiscards = 65535 (threshold 100)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies)
port 1: FAILED
## Summary: 25 nodes checked, 0 bad nodes found
## 48 ports checked, 2 ports have errors beyond threshold
Admittedly, not encouraging. The output of ibnetdiscover is attached.
I should had that the cluster (including infiniband) is currently being
used. Unfortunately, my experience with infiniband is not adequate to
Any further clue on possible problems is very welcome.
Many thanks for your attention,
Biagio
--
=========================================================
Dr. Biagio Lucini
Department of Physics, Swansea University
Singleton Park, SA2 8PP Swansea (UK)
Tel. +44 (0)1792 602284
=========================================================
[node17:25443] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node17_0/10802/1/shared_mem_pool.node17
failed with errno=2
[node21:28610] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node21_0/10802/1/shared_mem_pool.node21
failed with errno=2
[node10:29396] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node10_0/10802/1/shared_mem_pool.node10
failed with errno=2
[node24:02084] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node24_0/10802/1/shared_mem_pool.node24
failed with errno=2
[node19:01502] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node19_0/10802/1/shared_mem_pool.node19
failed with errno=2
[node12:31509] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node12_0/10802/1/shared_mem_pool.node12
failed with errno=2
[node22:10933] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node22_0/10802/1/shared_mem_pool.node22
failed with errno=2
[node23:18518] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node23_0/10802/1/shared_mem_pool.node23
failed with errno=2
[node9:26098] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node9_0/10802/1/shared_mem_pool.node9
failed with errno=2
[node16:27655] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node16_0/10802/1/shared_mem_pool.node16
failed with errno=2
[node15:27478] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node15_0/10802/1/shared_mem_pool.node15
failed with errno=2
[node18:21742] mca_common_sm_mmap_init: open
/tmp/10955.1.gold/openmpi-sessions-kstrings@node18_0/10802/1/shared_mem_pool.node18
failed with errno=2
[[10802,1],32][btl_openib_component.c:2893:handle_wc] from node21 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
--------------------------------------------------------------------------
The OpenFabrics "receiver not ready" retry count on a per-peer
connection between two MPI processes has been exceeded. In general,
this should not happen because Open MPI uses flow control on per-peer
connections to ensure that receivers are always ready when data is
sent.
This error usually means one of two things:
1. There is something awry within the network fabric itself.
2. A bug in Open MPI has caused flow control to malfunction.
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Below is some information about the host that raised the error and the
peer to which it was connected:
Local host: node21
Local device: mthca0
Peer host: node11
You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
[[10802,1],36][btl_openib_component.c:2893:handle_wc] from node19 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],1][btl_openib_component.c:2893:handle_wc] from node9 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],38][btl_openib_component.c:2893:handle_wc] from node20 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],35][btl_openib_component.c:2893:handle_wc] from node23 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],10][btl_openib_component.c:2893:handle_wc] from node15 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],14][btl_openib_component.c:2893:handle_wc] from node14 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],33][btl_openib_component.c:2893:handle_wc] from node21 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],12][btl_openib_component.c:2893:handle_wc] from node18 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],6][btl_openib_component.c:2893:handle_wc] from node13 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],3][btl_openib_component.c:2893:handle_wc] from node16 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],37][btl_openib_component.c:2893:handle_wc] from node19 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
[[10802,1],18][btl_openib_component.c:2893:handle_wc] from node10 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
--------------------------------------------------------------------------
mpirun has exited due to process rank 32 with PID 28610 on
node node21.cluster exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[[10802,1],16][btl_openib_component.c:2893:handle_wc] from node12 to: node11
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status
number 13 for wr_id 37784704 opcode 0 qp_idx 0
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
[node11:28202] [[10802,0],0]-[[10802,0],2] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
[node11:28202] [[10802,0],0]-[[10802,0],11] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
[node11:28202] [[10802,0],0]-[[10802,0],3] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
[node11:28202] [[10802,0],0]-[[10802,0],13] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
[node11:28202] [[10802,0],0]-[[10802,0],4] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
[node11:28202] 13 more processes have sent help message help-mpi-btl-openib.txt
/ pp rnr retry exceeded
[node11:28202] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
forrtl: error (78): process killed (SIGTERM)
#
# Topology file: generated on Thu Jan 15 14:53:53 2009
#
# Max of 2 hops discovered
# Initiated from node 0002c90200268638 port 0002c90200268639
vendid=0x2c9
devid=0xb924
sysimgguid=0xb8cffff002347
switchguid=0xb8cffff002347
Switch 24 "S-000b8cffff002347" # "MT47396 Infiniscale-III Mellanox
Technologies" base port 0 lid 1 lmc 0
[24] "H-0002c90200267924"[1] # "node24 HCA-1" lid 18
[23] "H-0002c9020026771c"[1] # "node23 HCA-1" lid 17
[22] "H-0002c90200268648"[1] # "node22 HCA-1" lid 16
[20] "H-0002c90200267774"[1] # "node20 HCA-1" lid 14
[19] "H-0002c9020026796c"[1] # "node19 HCA-1" lid 13
[18] "H-0002c90200230e48"[1] # "node18 HCA-1" lid 12
[17] "H-0002c9020021de24"[1] # "node17 HCA-1" lid 11
[16] "H-0002c90200230dd4"[1] # "node16 HCA-1" lid 10
[15] "H-0002c9020022b38c"[1] # "node15 HCA-1" lid 121
[14] "H-0002c9020022b3b8"[1] # "node14 HCA-1" lid 5
[13] "H-0002c9020025420c"[1] # "node13 HCA-1" lid 6
[12] "H-0002c9020022b398"[1] # "node12 HCA-1" lid 97
[11] "H-0002c9020022b3b0"[1] # "node11 HCA-1" lid 89
[10] "H-0002c9020022b3c8"[1] # "node10 HCA-1" lid 81
[9] "H-0002c9020022b330"[1] # "node9 HCA-1" lid 73
[8] "H-0002c9020022b3cc"[1] # "node8 HCA-1" lid 65
[7] "H-0002c9020022b3dc"[1] # "node7 HCA-1" lid 3
[6] "H-0002c9020022b334"[1] # "node6 HCA-1" lid 4
[5] "H-0002c9020022b3a4"[1] # "node5 HCA-1" lid 41
[4] "H-0002c9020022b380"[1] # "node4 HCA-1" lid 33
[3] "H-0002c9020020d75c"[1] # "node3 HCA-1" lid 25
[2] "H-0002c902002140ac"[1] # "node2 HCA-1" lid 2
[1] "H-0002c9020020d604"[1] # "node1 HCA-1" lid 9
[21] "H-0002c90200268638"[1] # "node21 HCA-1" lid 15
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c90200267927
caguid=0x2c90200267924
Ca 1 "H-0002c90200267924" # "node24 HCA-1"
[1] "S-000b8cffff002347"[24] # lid 18 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020026771f
caguid=0x2c9020026771c
Ca 1 "H-0002c9020026771c" # "node23 HCA-1"
[1] "S-000b8cffff002347"[23] # lid 17 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020026864b
caguid=0x2c90200268648
Ca 1 "H-0002c90200268648" # "node22 HCA-1"
[1] "S-000b8cffff002347"[22] # lid 16 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c90200267777
caguid=0x2c90200267774
Ca 1 "H-0002c90200267774" # "node20 HCA-1"
[1] "S-000b8cffff002347"[20] # lid 14 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020026796f
caguid=0x2c9020026796c
Ca 1 "H-0002c9020026796c" # "node19 HCA-1"
[1] "S-000b8cffff002347"[19] # lid 13 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c90200230e4b
caguid=0x2c90200230e48
Ca 1 "H-0002c90200230e48" # "node18 HCA-1"
[1] "S-000b8cffff002347"[18] # lid 12 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020021de27
caguid=0x2c9020021de24
Ca 1 "H-0002c9020021de24" # "node17 HCA-1"
[1] "S-000b8cffff002347"[17] # lid 11 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c90200230dd7
caguid=0x2c90200230dd4
Ca 1 "H-0002c90200230dd4" # "node16 HCA-1"
[1] "S-000b8cffff002347"[16] # lid 10 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b38f
caguid=0x2c9020022b38c
Ca 1 "H-0002c9020022b38c" # "node15 HCA-1"
[1] "S-000b8cffff002347"[15] # lid 121 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3bb
caguid=0x2c9020022b3b8
Ca 1 "H-0002c9020022b3b8" # "node14 HCA-1"
[1] "S-000b8cffff002347"[14] # lid 5 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020025420f
caguid=0x2c9020025420c
Ca 1 "H-0002c9020025420c" # "node13 HCA-1"
[1] "S-000b8cffff002347"[13] # lid 6 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b39b
caguid=0x2c9020022b398
Ca 1 "H-0002c9020022b398" # "node12 HCA-1"
[1] "S-000b8cffff002347"[12] # lid 97 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3b3
caguid=0x2c9020022b3b0
Ca 1 "H-0002c9020022b3b0" # "node11 HCA-1"
[1] "S-000b8cffff002347"[11] # lid 89 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3cb
caguid=0x2c9020022b3c8
Ca 1 "H-0002c9020022b3c8" # "node10 HCA-1"
[1] "S-000b8cffff002347"[10] # lid 81 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b333
caguid=0x2c9020022b330
Ca 1 "H-0002c9020022b330" # "node9 HCA-1"
[1] "S-000b8cffff002347"[9] # lid 73 lmc 0 "MT47396 Infiniscale-III
Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3cf
caguid=0x2c9020022b3cc
Ca 1 "H-0002c9020022b3cc" # "node8 HCA-1"
[1] "S-000b8cffff002347"[8] # lid 65 lmc 0 "MT47396 Infiniscale-III
Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3df
caguid=0x2c9020022b3dc
Ca 1 "H-0002c9020022b3dc" # "node7 HCA-1"
[1] "S-000b8cffff002347"[7] # lid 3 lmc 0 "MT47396 Infiniscale-III
Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b337
caguid=0x2c9020022b334
Ca 1 "H-0002c9020022b334" # "node6 HCA-1"
[1] "S-000b8cffff002347"[6] # lid 4 lmc 0 "MT47396 Infiniscale-III
Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b3a7
caguid=0x2c9020022b3a4
Ca 1 "H-0002c9020022b3a4" # "node5 HCA-1"
[1] "S-000b8cffff002347"[5] # lid 41 lmc 0 "MT47396 Infiniscale-III
Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020022b383
caguid=0x2c9020022b380
Ca 1 "H-0002c9020022b380" # "node4 HCA-1"
[1] "S-000b8cffff002347"[4] # lid 33 lmc 0 "MT47396 Infiniscale-III
Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020020d75f
caguid=0x2c9020020d75c
Ca 1 "H-0002c9020020d75c" # "node3 HCA-1"
[1] "S-000b8cffff002347"[3] # lid 25 lmc 0 "MT47396 Infiniscale-III
Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c902002140af
caguid=0x2c902002140ac
Ca 1 "H-0002c902002140ac" # "node2 HCA-1"
[1] "S-000b8cffff002347"[2] # lid 2 lmc 0 "MT47396 Infiniscale-III
Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020020d607
caguid=0x2c9020020d604
Ca 1 "H-0002c9020020d604" # "node1 HCA-1"
[1] "S-000b8cffff002347"[1] # lid 9 lmc 0 "MT47396 Infiniscale-III
Mellanox Technologies" lid 1
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c9020026863b
caguid=0x2c90200268638
Ca 1 "H-0002c90200268638" # "node21 HCA-1"
[1] "S-000b8cffff002347"[21] # lid 15 lmc 0 "MT47396
Infiniscale-III Mellanox Technologies" lid 1