On Oct 29, 2013, at 7:14 PM, Mathew Snyder 
<mathew.sny...@gmail.com<mailto:mathew.sny...@gmail.com>> wrote:
Unfortunately, this is a contract and while we, as the integrator, should be 
the prime we are left as a sub to to the data services provider. It's 
backassward and has caused innumerable headaches, but it is what it is and we 
are unable to divert some of our compute to other providers. This is, no doubt, 
one reason why heads will be rolling.

-Mathew

On Tue, Oct 29, 2013 at 2:18 PM, Tracy Reed 
<tr...@ultraviolet.org<mailto:tr...@ultraviolet.org>> wrote:
...When the machine comes back up sometimes it has fsck errors. These are 
usually
resolved with fsck -y although I hesitate to make that happen automatically.

And then there are always the mysql tables in need of repair, services which
did not come back up automatically, etc.

Finding a non-manual way to handle this when you have thousands of VMs is a
very hard problem.

It sounds like you already have some kind of mechanism to hit all 1400 servers 
and run a script (perhaps parallel ssh with pre-established keys).  You may 
want to do some more data collection prior to going to the drastic route of 
rebooting them "blind."  For example, you don't want to discover that some of 
these systems are using DHCP-assigned network addresses, DNS server settings, 
etc.  If you have the scripting capability that it sounds like you do, it would 
be worth spending a little time capturing 
/etc/sysconfig/{network,network-scripts/*} and a few other things like 
/etc/fstab and a current 'df' output) before trying this.  Perhaps even a 'ps' 
list from all of them so you can grep them all for database processes that will 
require more hands-on recovery later.

Many of the systems will come up fine; some may stop on boot awaiting manual 
intervention.  Some may decide that their disk device is not present, or 
perhaps even that their network interfaces are not the ones you thought they 
were.  So there is a lot of risk to this, and you are going to need to keep 
score somehow so you know which ones to come back to later (and which ones you 
can live without for a while.)  Unless you have huge swaths of identical 
machines within that pool of 1400 VMs, you're probably going to want to 
manually observe the first several dozen before considering trusting any kind 
of automated mass reboot-and-fsck with 'shutdown -rF now' approach.  And you're 
of course going to want to prioritize VMs housing common services before the 
rest of the population (e.g. ntp, dns, ldap, common NFS shares, databases, etc.)

You certainly have my sympathies; several of us have been through similar war 
stories.  Don't rush into this; you can do a whole lot more damage trying to 
automate your recovery procedure and missing something.

Good luck,

- Dave
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to