Re: [OMPI users] Checkpointing an MPI application with OMPI

Maxime Boissonneault Wed, 30 Jan 2013 09:18:47 -0500

Le 2013-01-29 21:02, Ralph Castain a écrit :

On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault<maxime.boissonnea...@calculquebec.ca<mailto:maxime.boissonnea...@calculquebec.ca>> wrote:
While our filesystem and management nodes are on UPS, our computenodes are not. With one average generic (power/cooling mostly)failure every one or two months, running for weeks is just asking fortrouble. If you add to that typical dimm/cpu/networking failures (Iestimated about 1 node goes down per day because of some sorthardware failure, for a cluster of 960 nodes). With these numbers, ajob running on 32 nodes for 7 days has a ~35% chance of failingbefore it is done.
I've been running this in my head all day - it just doesn't fitexperience, which really bothered me. So I spent a little time runningthe calculation, and I came up with a number much lower (more likearound 5%). I'm not saying my rough number is correct, but it is atleast a little closer to what we see in the field.
Given that there are a lot of assumptions required when doing thesecalculations, I would like to suggest you conduct a very simply andquick experiment before investing tons of time on FT solutions. Allyou have to do is:

Thanks for the calculation. However, this is a cluster that I manage, Ido not use it per say, and running such statistical jobs on a large partof the cluster for a long period of time is impossible. We do have thenumbers however. The cluster has 960 nodes. We experience roughly onepower or cooling failure per month or two months. Assuming one suchfailure per two months, if you run for 1 month, you have a 50% chanceyour job will be killed before it ends. If you run for 2 weeks, 25%,etc. These are very rough estimates obviously, but it is way more than 5%.

In addition to that, we have a failure rate of ~0.1%/day, meaning thatout of 960, on average, one node will have a hardware failure every day.Most of the time, this is a failure of one of the dimms. Consideringeach node has 12 dimms of 2GB of memory, it means a dimm failure rate of~0.0001 per day. I don't know if that's bad or not, but this is roughlywhat we have.

If it turns out you see power failure problems, then a simple,low-cost, ride-thru power stabilizer might be a good solution.Flywheels and capacitor-based systems can provide support formomentary power quality issues at reasonably low costs for a clusterof your size.

I doubt there is anything low cost for a 330 kW system, and in any case,hardware upgrade is not an option since this a mid-life cluster. Again,as I said, the filesystem (2 x 500 TB lustre partitions) and themanagement nodes are on UPS, but there is no way to put the computenodes on UPS.

If your node hardware is the problem, or you decide you do want/needto pursue an FT solution, then you might look at the OMPI-basedsolutions from parties such as http://fault-tolerance.org or theMPICH2 folks.

Thanks for the tip.

Best regards,

Maxime

Re: [OMPI users] Checkpointing an MPI application with OMPI

Reply via email to