I searched the FAQ and google but couldn't come up with a solution to
this problem.
My problem is that when one MPI execution host dies or the network
connection goes down the job is not aborted. Instead the remaining
processes continue to eat 100% CPU indefinitely. How can I make jobs
abort in these cases?
I use OpenMPI 1.3.2. We have a myrinet network and I use mtl/mx for
mpi
communication. We also use gridengine 6.2u3. The output from the
running
job indicates that the remaining processes detect a timeout trying to
communicate with the (dead) host cl120.foi.se. But why do they not
terminate after this failure?
Thanks.
Max retransmit retries reached (1000) for message
type (1): send_small
state (0x14): buffered dead
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=1, endpoint=1, seqnum=0x2b8f
matched_val: 0x0004000d_fffffff4
slength=48, xfer_length=48
seg: 0x7fffe11ff830,48
caller: 0xdb
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (1): send_small
state (0x14): buffered dead
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=116, endpoint=1, seqnum=0x3726
matched_val: 0x00040001_fffffff4
slength=48, xfer_length=48
seg: 0x7ffff124b7b0,48
caller: 0x9b
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (1): send_small
state (0x14): buffered dead
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=1, endpoint=0, seqnum=0x1048
matched_val: 0x00040006_fffffff4
slength=48, xfer_length=48
seg: 0x7fffc6470eb0,48
caller: 0x70
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (1): send_small
state (0x14): buffered dead
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=1, endpoint=1, seqnum=0xd53
matched_val: 0x00040007_fffffff4
slength=48, xfer_length=48
seg: 0x1f54360,48
caller: 0xda
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (1): send_small
state (0x14): buffered dead
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=116, endpoint=0, seqnum=0x376c
matched_val: 0x00040000_fffffff4
slength=48, xfer_length=48
seg: 0x82ec040,48
caller: 0x12
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (1): send_small
state (0x14): buffered dead
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=1, endpoint=0, seqnum=0x2746
matched_val: 0x0004000c_fffffff4
slength=48, xfer_length=48
seg: 0x1116f410,48
caller: 0x30
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (1): send_small
state (0x14): buffered dead
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=1, endpoint=1, seqnum=0x18de
matched_val: 0x00250001_fffffff4
slength=104, xfer_length=104
seg: 0x181c3100,104
caller: 0x18
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (2): send_medium
state (0x14): buffered dead
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=116, endpoint=0, seqnum=0x3361
matched_val: 0x0004000f_00000010
slength=7168, xfer_length=7168
seg: 0x23e8a838,7168
caller: 0x7e
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (2): send_medium
state (0x14): buffered dead
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=116, endpoint=1, seqnum=0x3361
matched_val: 0x0004000f_00000010
slength=560, xfer_length=560
seg: 0x23ec9fe0,560
caller: 0x2d
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (2): send_medium
state (0x14): buffered dead
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=1, endpoint=1, seqnum=0x3361
matched_val: 0x0004000c_0000000d
slength=840, xfer_length=840
seg: 0x1a471a90,840
caller: 0xf9
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (3): send_large
state (0x0):
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=1, endpoint=1, seqnum=0xad1
matched_val: 0x00040006_00000007
slength=133504, xfer_length=79352
seg: 0x1b0daae0,133504
local_rdma_id: 6e
caller: 0xe6
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (2): send_medium
state (0x14): buffered dead
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=116, endpoint=0, seqnum=0x3361
matched_val: 0x00040001_00000002
slength=5992, xfer_length=5992
seg: 0x1b136890,5992
caller: 0x9f
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
type (3): send_large
state (0x0):
requeued: 1000 (timeout=501000ms)
dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
partner: peer_index=1, endpoint=0, seqnum=0xad1
matched_val: 0x00040007_00000008
slength=134400, xfer_length=134400
seg: 0xb1d5600,134400
local_rdma_id: 82
caller: 0xc4
Was trying to contact
00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users