[OMPI users] How to make a job abort when one host dies?

2009-08-11 Thread Oskar Enoksson
I searched the FAQ and google but couldn't come up with a solution to this problem. My problem is that when one MPI execution host dies or the network connection goes down the job is not aborted. Instead the remaining processes continue to eat 100% CPU indefinitely. How can I make jobs abort

Re: [OMPI users] How to make a job abort when one host dies?

2009-08-18 Thread Oskar Enoksson
Scott Atchley wrote: Long answer: The messages below indicate that these processes were all trying to send to cl120. It did not ack their messages after 1000 resend attempts (each retry is attempted with a 0.5 second interval) which is about 8.3 minutes (500 seconds). The messages also

Re: [OMPI users] users Digest, Vol 1321, Issue 6

2009-08-18 Thread Oskar Enoksson
Patrick Geoffray wrote: > Hi Oskar, > > Oskar Enoksson wrote: >> The reason in this case was that cl120 had some kind of hardware >> problem, perhaps memory error or myrinet NIC hardware error. The system >> hung. >> >> I will try MX_ZOMBIE_SEND=0

[OMPI users] Very poor performance with btl sm on twin nehalem servers with Mellanox Technologies MT26428 (ConnectX)

2010-05-11 Thread Oskar Enoksson
I have a cluster with two Intel Xeon Nehalem E5520 CPU per server (quad-core, 2.27GHz). The interconnect is 4xQDR Infiniband (Mellanox ConnectX). I have compiled and installed OpenMPI 1.4.2. The kernel is 2.6.32.2 and I have compiled the kernel myself. I use gridengine 6.2u5. Openmpi was compiled

Re: [OMPI users] Very poor performance with btl sm on twin nehalem servers with Mellanox Technologies MT26428 (ConnectX)

2010-05-11 Thread Oskar Enoksson
Sorry, the kernel is 2.6.32.12, not 2.6.32.2. And I forgot to mention the system is CentOS 5.4. And further ... 25MB/s is after tweaking btl_sm_num_fifos=8 and btl_sm_eager_limit=65536. Without those the rate is 9MB/s for 1MB packets and 1.5MB/s for 10kB packets :-( On 05/11/2010 08:19 PM, Oskar