Our first UPS many years ago was only able to power all equipment in the computer room and the building chilled water pumps for 15 minutes and had no generator backup. This was a big improvement over no UPS, because 99% of our utility glitches at the time were at most a few seconds, and if they ever lasted more than 2 minutes you could be pretty sure they would longer than the UPS battery power. We had documented procedures and automation in place with NETVIEW and the freebie NETINIT/NETSTOP CBT tool for mainframe system shut down in under 10 minutes, and if the power still wasn't back by then, to proceed to power down the mainframe and various other equipment. There isn't any documented procedure for orderly non-mainframe server shut down other than to locate all servers still making noise or light, locate power button, and press.

The same basic procedures were used in the event of loss of cooling to the computer room, although typically there was more time to react as long as the chilled water was still flowing. But with the old water cooled beasts, if the building chilled water pumps also failed, there was at best a minute or two - insufficient time to attempt orderly system shut down. In those cases just stopping the processor followed by a power down was always deemed preferable than testing whether the processor's thermal cut off would prevent damage.

Those procedures and the system shut down automation were kept in place even after we eventually got emergency generator capability capable of supplying the entire building because there is still the possibility of environmental system failure or a failure in bringing the generator on line. The same automation that was originally created for emergency shut down is used regularly for shut down for scheduled IPL's. We do not test or train for actual hardware power down, so this now only gets tested on very rare occasions when disruptive maintenance must be done on the building power or environmental systems.

Power-up procedures are also documented, but again seldom tested.

Someone from Tech Services is always on call and would be involved if a power down or power up is ever required. Tech Services is also involved in physical planning and installation of all hardware in computer room and in a good position to know how to handle anything too recent to have made it to the formal power-down, power-up documentation.
   Joel C Ewing

On 12/03/2010 11:45 AM, Darth Keller wrote:
An interesting question came up this morning -  all your multiple power
sources have just failed.  Your generator(s) started but, for whatever
reason, have also failed.  You're now on battery power and have 23 minutes
to power everything off as gracefully as possible.  Do you have procedures
in place to do it?

I don't even want to think about all the open-systems stuff, my head would
explode.  But I don't think this is a trivial exercise even from the
mainframe side.   I'm thinking you almost have to think about this in the
same way you would approach planning a DR event.  Maybe you have a couple
of scenarios -

1.  I know I'm going to lose power in 3 hours.

2. I know I've only got 23 minutes&  the clock is already running.

Do you get as much of the software shut off as possible&  just let the
hardware take care of itself?

Do other companies have plans in place for this?  Is it reviewed with some
frequency?  Pretty hard to test unless you have your own DR site, but do
you at least do periodic walk-through's as an exercise?   Do you have or
need a procedure on how to restart after an EPO of any duration?


Have I found a new career path or should I just ask for my medications  to
be adjusted?


...

--
Joel C. Ewing, Fort Smith, AR        [email protected]

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to