[lopsa-tech] Fwd: Need ideas/suggestions for bringing several VMs back online after an outage

Mathew Snyder Tue, 29 Oct 2013 17:55:51 -0700

None of them use DHCP. We statically assign IPs to all of our systems and
hard configure DNS and other network settings.


We also always remediate our own servers first as they do provide services
to all of the other VMs (DNS, NTP, SMTP, etc). We have no NFS shares that
we provide. Any NFS configuration is done by each customer at their own
risk and prerogative.

As for the scripting, we don't push out any commands such as shutdown. The
scripts simply report which servers have mounts in an unexpected RO state.
With that information the shutdowns and remediations are conducted in a
controlled and deliberate manner focusing on higher priority items first
(the aforementioned services-providing VMs followed by Production systems).

Finding systems in need of manual intervention after a reboot is not an
issue. Based on previous experience with this issue from this provider, we
rarely see systems not respond to a reboot with fsck or from booting into
rescue mode and performing a fsck through that. Essentially, any manual
intervention will be limited if not non-existent.

-Mathew

"When you do things right, people won't be sure you've done anything at
all." - God; Futurama

"We'll get along much better once you accept that you're wrong and neither
am I." - Me


On Tue, Oct 29, 2013 at 5:42 PM, Dave Caplinger <
[email protected]> wrote:

> On Oct 29, 2013, at 7:14 PM, Mathew Snyder <[email protected]>
> wrote:
>
> Unfortunately, this is a contract and while we, as the integrator, should
> be the prime we are left as a sub to to the data services provider. It's
> backassward and has caused innumerable headaches, but it is what it is and
> we are unable to divert some of our compute to other providers. This is, no
> doubt, one reason why heads will be rolling.
>
> -Mathew
>
> On Tue, Oct 29, 2013 at 2:18 PM, Tracy Reed <[email protected]> wrote:
>
>> ...When the machine comes back up sometimes it has fsck errors. These are
>> usually
>>
>> resolved with fsck -y although I hesitate to make that happen
>> automatically.
>>
>> And then there are always the mysql tables in need of repair, services
>> which
>> did not come back up automatically, etc.
>>
>> Finding a non-manual way to handle this when you have thousands of VMs is
>> a
>> very hard problem.
>>
>
> It sounds like you already have some kind of mechanism to hit all 1400
> servers and run a script (perhaps parallel ssh with pre-established keys).
>  You may want to do some more data collection prior to going to the drastic
> route of rebooting them "blind."  For example, you don't want to discover
> that some of these systems are using DHCP-assigned network addresses, DNS
> server settings, etc.  If you have the scripting capability that it sounds
> like you do, it would be worth spending a little time capturing
> /etc/sysconfig/{network,network-scripts/*} and a few other things like
> /etc/fstab and a current 'df' output) before trying this.  Perhaps even a
> 'ps' list from all of them so you can grep them all for database processes
> that will require more hands-on recovery later.
>
> Many of the systems will come up fine; some may stop on boot awaiting
> manual intervention.  Some may decide that their disk device is not
> present, or perhaps even that their network interfaces are not the ones you
> thought they were.  So there is a lot of risk to this, and you are going to
> need to keep score somehow so you know which ones to come back to later
> (and which ones you can live without for a while.)  Unless you have huge
> swaths of identical machines within that pool of 1400 VMs, you're probably
> going to want to manually observe the first several dozen before
> considering trusting any kind of automated mass reboot-and-fsck with
> 'shutdown -rF now' approach.  And you're of course going to want to
> prioritize VMs housing common services before the rest of the population
> (e.g. ntp, dns, ldap, common NFS shares, databases, etc.)
>
> You certainly have my sympathies; several of us have been through similar
> war stories.  Don't rush into this; you can do a whole lot more damage
> trying to automate your recovery procedure and missing something.
>
> Good luck,
>
> - Dave
>

_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

[lopsa-tech] Fwd: Need ideas/suggestions for bringing several VMs back online after an outage

Reply via email to