Re: [lopsa-tech] Need ideas/suggestions for bringing several VMs back online after an outage

matthewhall Tue, 29 Oct 2013 12:29:05 -0700

Sorry to hear of your troubles ... I trust it doesn't fall squarely(politically) on you and your co-worker's shoulders.


May I enquire as to the nature of the filesystems on these VMs?

It surprises me that a sudden inability to write to the block devicebeneath is causing such hassle at the FS layer, ext3 upward (as isstandard under RH) has a pretty robust journal system.


Maybe I've just been extremely lucky :)



On 29.10.2013 18:32, Mathew Snyder wrote:

We recently went through a very difficult situation (both technically
and politically) as a result of poor infrastructure design and
implementation by our data services provider. On Saturday, a network
issue caused our entire environment to be offline and we are still
dealing with straggler issues while our customers verify their
applications are online and functioning correctly. All told, we were
offline for well over a day and a half.

Not only should this never have happened, but it was third or fourth
time this year. After the last outage the provider was given a strict

mandate from our contracting agency to ensure it never happenedagain.



The impact, while obvious on the surface goes even deeper. We have

over 1400 servers, a vast majority of which are Red Hat. All butabout

50 of them are VMs. The VMs are stored and run from backend storage.
The backend storage is connected to the compute nodes via the
aforementioned, poorly designed and implemented infrastructure. When
the network goes out, the compute loses its storage and the servers
are left in a very precarious state.

We end up having to run reports against all of our VMs to determine

which have been affected and left in a read-only state. This issimple

enough. My colleague has written a script which is executed remotely
against all of our systems and provides this feedback.

We are then left with the worst part of it. We must then log in to
each system via the console, reboot to our provisioning network and

load up the rescue environment to perform manual filesystem checksandrepairs. Doing this for 1400 servers is, needless to say, a chorewhen

there isnt a more robust solution in place.

What makes this worse is the fact that we dont have access to
Vcenter, Vsphere, or any of the infrastructure/storage/etc.

Im at a loss as to how to make recovery from such an outage more
expeditious so Im hoping someone here can provide some guidance.

Has anyone else dealt with a similar situation or at least have
insight into steps we can take and tools we can implement to make our
lives easier?

-Mathew

"When you do things right, people wont be sure youve done anything at
all." - God; Futurama

"Well get along much better once you accept that youre wrong and
neither am I." - Me
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech

This list provided by the League of Professional SystemAdministrators

 http://lopsa.org/


_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/

Re: [lopsa-tech] Need ideas/suggestions for bringing several VMs back online after an outage

Reply via email to