On Aug 17, 2009, at 2:43 PM, Jeff Squyres wrote:

George / Myricom --

Does the MX MTL abort if it gets a "disconnected" error back from libmyriexpress?

Short answer: yes.

Long answer:

The messages below indicate that these processes were all trying to send to cl120. It did not ack their messages after 1000 resend attempts (each retry is attempted with a 0.5 second interval) which is about 8.3 minutes (500 seconds).

The messages also indicate that the message was a send_small which means it was 128 bytes or less. MX has MPI like semantics and allow for completion after the message has been either buffered or delivered. In this case, it was buffered and OMPI was most likely able to complete it successfully. The message was not able to be delivered, however, and its timeout caused MX to fail all future sends to that host. On the next mx_isend(), OMPI will detect a failure.

Since it does not detect failure, my guess is that the process has not tried to send again to that host. They then end up waiting forever.

They can change MX's behavior so that it does not complete a send until the receiver has acked it by exporting:

MX_ZOMBIE_SEND=0

This will hurt benchmark performance, but real application performance should not be affected.

The question is, however, why is cl120 not acking messages? What is the application? What MPI calls does this application use?

Scott

On Aug 11, 2009, at 7:07 AM, Oskar Enoksson wrote:

I searched the FAQ and google but couldn't come up with a solution to
this problem.

My problem is that when one MPI execution host dies or the network
connection goes down the job is not aborted. Instead the remaining
processes continue to eat 100% CPU indefinitely. How can I make jobs
abort in these cases?

I use OpenMPI 1.3.2. We have a myrinet network and I use mtl/mx for mpi communication. We also use gridengine 6.2u3. The output from the running
job indicates that the remaining processes detect a timeout trying to
communicate with the (dead) host cl120.foi.se. But why do they not
terminate after this failure?

Thanks.

Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=1, seqnum=0x2b8f
       matched_val: 0x0004000d_fffffff4
       slength=48, xfer_length=48
       seg: 0x7fffe11ff830,48
       caller: 0xdb

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=116, endpoint=1, seqnum=0x3726
       matched_val: 0x00040001_fffffff4
       slength=48, xfer_length=48
       seg: 0x7ffff124b7b0,48
       caller: 0x9b

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=0, seqnum=0x1048
       matched_val: 0x00040006_fffffff4
       slength=48, xfer_length=48
       seg: 0x7fffc6470eb0,48
       caller: 0x70

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=1, seqnum=0xd53
       matched_val: 0x00040007_fffffff4
       slength=48, xfer_length=48
       seg: 0x1f54360,48
       caller: 0xda

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=116, endpoint=0, seqnum=0x376c
       matched_val: 0x00040000_fffffff4
       slength=48, xfer_length=48
       seg: 0x82ec040,48
       caller: 0x12

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=0, seqnum=0x2746
       matched_val: 0x0004000c_fffffff4
       slength=48, xfer_length=48
       seg: 0x1116f410,48
       caller: 0x30

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (1): send_small
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=1, seqnum=0x18de
       matched_val: 0x00250001_fffffff4
       slength=104, xfer_length=104
       seg: 0x181c3100,104
       caller: 0x18

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=116, endpoint=0, seqnum=0x3361
       matched_val: 0x0004000f_00000010
       slength=7168, xfer_length=7168
       seg: 0x23e8a838,7168
       caller: 0x7e

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=116, endpoint=1, seqnum=0x3361
       matched_val: 0x0004000f_00000010
       slength=560, xfer_length=560
       seg: 0x23ec9fe0,560
       caller: 0x2d

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=1, seqnum=0x3361
       matched_val: 0x0004000c_0000000d
       slength=840, xfer_length=840
       seg: 0x1a471a90,840
       caller: 0xf9

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (3): send_large
       state (0x0):
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=1, seqnum=0xad1
       matched_val: 0x00040006_00000007
       slength=133504, xfer_length=79352
       seg: 0x1b0daae0,133504
       local_rdma_id: 6e
       caller: 0xe6

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=116, endpoint=0, seqnum=0x3361
       matched_val: 0x00040001_00000002
       slength=5992, xfer_length=5992
       seg: 0x1b136890,5992
       caller: 0x9f

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
       type (3): send_large
       state (0x0):
       requeued: 1000 (timeout=501000ms)
       dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
       partner: peer_index=1, endpoint=0, seqnum=0xad1
       matched_val: 0x00040007_00000008
       slength=134400, xfer_length=134400
       seg: 0xb1d5600,134400
       local_rdma_id: 82
       caller: 0xc4

Was trying to contact
       00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to