I searched the FAQ and google but couldn't come up with a solution to this problem.

My problem is that when one MPI execution host dies or the network connection goes down the job is not aborted. Instead the remaining processes continue to eat 100% CPU indefinitely. How can I make jobs abort in these cases?

I use OpenMPI 1.3.2. We have a myrinet network and I use mtl/mx for mpi communication. We also use gridengine 6.2u3. The output from the running job indicates that the remaining processes detect a timeout trying to communicate with the (dead) host cl120.foi.se. But why do they not terminate after this failure?

Thanks.

Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=1, seqnum=0x2b8f
       matched_val: 0x0004000d_fffffff4
       slength=48, xfer_length=48
       seg: 0x7fffe11ff830,48
       caller: 0xdb

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=116, endpoint=1, seqnum=0x3726
       matched_val: 0x00040001_fffffff4
       slength=48, xfer_length=48
       seg: 0x7ffff124b7b0,48
       caller: 0x9b

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=0, seqnum=0x1048
       matched_val: 0x00040006_fffffff4
       slength=48, xfer_length=48
       seg: 0x7fffc6470eb0,48
       caller: 0x70

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=1, seqnum=0xd53
       matched_val: 0x00040007_fffffff4
       slength=48, xfer_length=48
       seg: 0x1f54360,48
       caller: 0xda

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=116, endpoint=0, seqnum=0x376c
       matched_val: 0x00040000_fffffff4
       slength=48, xfer_length=48
       seg: 0x82ec040,48
       caller: 0x12

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=0, seqnum=0x2746
       matched_val: 0x0004000c_fffffff4
       slength=48, xfer_length=48
       seg: 0x1116f410,48
       caller: 0x30

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=1, seqnum=0x18de
       matched_val: 0x00250001_fffffff4
       slength=104, xfer_length=104
       seg: 0x181c3100,104
       caller: 0x18

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=116, endpoint=0, seqnum=0x3361
       matched_val: 0x0004000f_00000010
       slength=7168, xfer_length=7168
       seg: 0x23e8a838,7168
       caller: 0x7e

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=116, endpoint=1, seqnum=0x3361
       matched_val: 0x0004000f_00000010
       slength=560, xfer_length=560
       seg: 0x23ec9fe0,560
       caller: 0x2d

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=1, seqnum=0x3361
       matched_val: 0x0004000c_0000000d
       slength=840, xfer_length=840
       seg: 0x1a471a90,840
       caller: 0xf9

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (3): send_large
       state (0x0):
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=1, seqnum=0xad1
       matched_val: 0x00040006_00000007
       slength=133504, xfer_length=79352
       seg: 0x1b0daae0,133504
       local_rdma_id: 6e
       caller: 0xe6

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=116, endpoint=0, seqnum=0x3361
       matched_val: 0x00040001_00000002
       slength=5992, xfer_length=5992
       seg: 0x1b136890,5992
       caller: 0x9f

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (3): send_large
       state (0x0):
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=0, seqnum=0xad1
       matched_val: 0x00040007_00000008
       slength=134400, xfer_length=134400
       seg: 0xb1d5600,134400
       local_rdma_id: 82
       caller: 0xc4

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59 (cl120.foi.se:0) disconnected

Reply via email to