Sorry to hear of your troubles ... I trust it doesn't fall squarely (politically) on you and your co-worker's shoulders.

May I enquire as to the nature of the filesystems on these VMs?
It surprises me that a sudden inability to write to the block device beneath is causing such hassle at the FS layer, ext3 upward (as is standard under RH) has a pretty robust journal system.

Maybe I've just been extremely lucky :)



On 29.10.2013 18:32, Mathew Snyder wrote:
We recently went through a very difficult situation (both technically
and politically) as a result of poor infrastructure design and
implementation by our data services provider. On Saturday, a network
issue caused our entire environment to be offline and we are still
dealing with straggler issues while our customers verify their
applications are online and functioning correctly. All told, we were
offline for well over a day and a half.

Not only should this never have happened, but it was third or fourth
time this year. After the last outage the provider was given a strict
mandate from our contracting agency to ensure it never happened again.


The impact, while obvious on the surface goes even deeper. We have
over 1400 servers, a vast majority of which are Red Hat. All but about
50 of them are VMs. The VMs are stored and run from backend storage.
The backend storage is connected to the compute nodes via the
aforementioned, poorly designed and implemented infrastructure. When
the network goes out, the compute loses its storage and the servers
are left in a very precarious state.

We end up having to run reports against all of our VMs to determine
which have been affected and left in a read-only state. This is simple
enough. My colleague has written a script which is executed remotely
against all of our systems and provides this feedback.

We are then left with the worst part of it. We must then log in to
each system via the console, reboot to our provisioning network and
load up the rescue environment to perform manual filesystem checks and repairs. Doing this for 1400 servers is, needless to say, a chore when
there isnt a more robust solution in place.

What makes this worse is the fact that we dont have access to
Vcenter, Vsphere, or any of the infrastructure/storage/etc.

Im at a loss as to how to make recovery from such an outage more
expeditious so Im hoping someone here can provide some guidance.

Has anyone else dealt with a similar situation or at least have
insight into steps we can take and tools we can implement to make our
lives easier?

-Mathew

"When you do things right, people wont be sure youve done anything at
all." - God; Futurama

"Well get along much better once you accept that youre wrong and
neither am I." - Me
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to