I searched the FAQ and google but couldn't come up with a solution to
this problem.
My problem is that when one MPI execution host dies or the network
connection goes down the job is not aborted. Instead the remaining
processes continue to eat 100% CPU indefinitely. How can I make jobs
abort
Scott Atchley wrote:
Long answer:
The messages below indicate that these processes were all trying to
send to cl120. It did not ack their messages after 1000 resend
attempts (each retry is attempted with a 0.5 second interval) which is
about 8.3 minutes (500 seconds).
The messages also
Patrick Geoffray wrote:
> Hi Oskar,
>
> Oskar Enoksson wrote:
>> The reason in this case was that cl120 had some kind of hardware
>> problem, perhaps memory error or myrinet NIC hardware error. The system
>> hung.
>>
>> I will try MX_ZOMBIE_SEND=0
I have a cluster with two Intel Xeon Nehalem E5520 CPU per server
(quad-core, 2.27GHz). The interconnect is 4xQDR Infiniband (Mellanox
ConnectX).
I have compiled and installed OpenMPI 1.4.2. The kernel is 2.6.32.2 and
I have compiled the kernel myself. I use gridengine 6.2u5. Openmpi was
compiled
Sorry, the kernel is 2.6.32.12, not 2.6.32.2. And I forgot to mention
the system is CentOS 5.4.
And further ... 25MB/s is after tweaking btl_sm_num_fifos=8 and
btl_sm_eager_limit=65536. Without those the rate is 9MB/s for 1MB
packets and 1.5MB/s for 10kB packets :-(
On 05/11/2010 08:19 PM, Oskar