On 15 April 2010 17:50, guy keren <c...@actcom.co.il> wrote: > > and - test test test..... > > many people fail to test their "highly-available" setup, and as a result, > think they are 'highly available" when they are not.
Great point! Earlier at my current position we failed to test that the fail-over works on one of our servers after an upgrade and got beaten the next time the primary went south. Since then it's our standard practice to: 1. update the current stand-by 2. switch over (effectively updates the service to the new version). 3. Wait a while and see that everything is fine (the current stand-by still ready with the old proven version). 4. Update the current secondary. 5. Switch back to make sure HA works right. (6. For good measure - switch again). We also do all of this through an elaborate set of scripts around xen, kickstart and puppet (keeping the xen guest images in LV's, came handy when we had to roll-back) so all this deployment procedure is completely automated and repeatable so we exercise not just the in-house built software and its configuration setup but also the deployment procedure itself. We reached a stage where a single operations engineer upgrades our entire production system of about 50 virtual servers across 18 physical servers in three mornings of running automatic scripts (i.e. no need for manual configuration changes). (We stick to mornings because of a rule not to schedule production changes in the afternoons or before a weekend, unless absolutely necessary). --Amos _______________________________________________ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il