On Oct 29, 2013, at 7:14 PM, Mathew Snyder <mathew.sny...@gmail.com<mailto:mathew.sny...@gmail.com>> wrote: Unfortunately, this is a contract and while we, as the integrator, should be the prime we are left as a sub to to the data services provider. It's backassward and has caused innumerable headaches, but it is what it is and we are unable to divert some of our compute to other providers. This is, no doubt, one reason why heads will be rolling.
-Mathew On Tue, Oct 29, 2013 at 2:18 PM, Tracy Reed <tr...@ultraviolet.org<mailto:tr...@ultraviolet.org>> wrote: ...When the machine comes back up sometimes it has fsck errors. These are usually resolved with fsck -y although I hesitate to make that happen automatically. And then there are always the mysql tables in need of repair, services which did not come back up automatically, etc. Finding a non-manual way to handle this when you have thousands of VMs is a very hard problem. It sounds like you already have some kind of mechanism to hit all 1400 servers and run a script (perhaps parallel ssh with pre-established keys). You may want to do some more data collection prior to going to the drastic route of rebooting them "blind." For example, you don't want to discover that some of these systems are using DHCP-assigned network addresses, DNS server settings, etc. If you have the scripting capability that it sounds like you do, it would be worth spending a little time capturing /etc/sysconfig/{network,network-scripts/*} and a few other things like /etc/fstab and a current 'df' output) before trying this. Perhaps even a 'ps' list from all of them so you can grep them all for database processes that will require more hands-on recovery later. Many of the systems will come up fine; some may stop on boot awaiting manual intervention. Some may decide that their disk device is not present, or perhaps even that their network interfaces are not the ones you thought they were. So there is a lot of risk to this, and you are going to need to keep score somehow so you know which ones to come back to later (and which ones you can live without for a while.) Unless you have huge swaths of identical machines within that pool of 1400 VMs, you're probably going to want to manually observe the first several dozen before considering trusting any kind of automated mass reboot-and-fsck with 'shutdown -rF now' approach. And you're of course going to want to prioritize VMs housing common services before the rest of the population (e.g. ntp, dns, ldap, common NFS shares, databases, etc.) You certainly have my sympathies; several of us have been through similar war stories. Don't rush into this; you can do a whole lot more damage trying to automate your recovery procedure and missing something. Good luck, - Dave
_______________________________________________ Tech mailing list Tech@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/