Our first UPS many years ago was only able to power all equipment in the
computer room and the building chilled water pumps for 15 minutes and
had no generator backup. This was a big improvement over no UPS,
because 99% of our utility glitches at the time were at most a few
seconds, and if they ever lasted more than 2 minutes you could be pretty
sure they would longer than the UPS battery power. We had documented
procedures and automation in place with NETVIEW and the freebie
NETINIT/NETSTOP CBT tool for mainframe system shut down in under 10
minutes, and if the power still wasn't back by then, to proceed to power
down the mainframe and various other equipment. There isn't any
documented procedure for orderly non-mainframe server shut down other
than to locate all servers still making noise or light, locate power
button, and press.
The same basic procedures were used in the event of loss of cooling to
the computer room, although typically there was more time to react as
long as the chilled water was still flowing. But with the old water
cooled beasts, if the building chilled water pumps also failed, there
was at best a minute or two - insufficient time to attempt orderly
system shut down. In those cases just stopping the processor followed
by a power down was always deemed preferable than testing whether the
processor's thermal cut off would prevent damage.
Those procedures and the system shut down automation were kept in place
even after we eventually got emergency generator capability capable of
supplying the entire building because there is still the possibility of
environmental system failure or a failure in bringing the generator on
line. The same automation that was originally created for emergency
shut down is used regularly for shut down for scheduled IPL's. We do
not test or train for actual hardware power down, so this now only gets
tested on very rare occasions when disruptive maintenance must be done
on the building power or environmental systems.
Power-up procedures are also documented, but again seldom tested.
Someone from Tech Services is always on call and would be involved if a
power down or power up is ever required. Tech Services is also involved
in physical planning and installation of all hardware in computer room
and in a good position to know how to handle anything too recent to have
made it to the formal power-down, power-up documentation.
Joel C Ewing
On 12/03/2010 11:45 AM, Darth Keller wrote:
An interesting question came up this morning - all your multiple power
sources have just failed. Your generator(s) started but, for whatever
reason, have also failed. You're now on battery power and have 23 minutes
to power everything off as gracefully as possible. Do you have procedures
in place to do it?
I don't even want to think about all the open-systems stuff, my head would
explode. But I don't think this is a trivial exercise even from the
mainframe side. I'm thinking you almost have to think about this in the
same way you would approach planning a DR event. Maybe you have a couple
of scenarios -
1. I know I'm going to lose power in 3 hours.
2. I know I've only got 23 minutes& the clock is already running.
Do you get as much of the software shut off as possible& just let the
hardware take care of itself?
Do other companies have plans in place for this? Is it reviewed with some
frequency? Pretty hard to test unless you have your own DR site, but do
you at least do periodic walk-through's as an exercise? Do you have or
need a procedure on how to restart after an EPO of any duration?
Have I found a new career path or should I just ask for my medications to
be adjusted?
...
--
Joel C. Ewing, Fort Smith, AR [email protected]
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html