Re: Dummy query on processor failover

Timothy Sipples Tue, 18 Dec 2018 00:44:57 -0800

Radoslaw Skorupka:
>Let's say a CPU returns false results like 2x2=5. How to recognize
>the result is false?

The IBM Z (and LinuxONE) system handles all that for you, and without
operating system involvement. Nowadays, thanks to the wonders of
microelectronic miniaturization, that's through intensive, thorough
integrity checking at all critical instruction execution steps baked deep
into every processor, and with tons of "transistor budget" spent on
integrity checking and other RAS characteristics. The design philosophy is
to push error handling as far down in the "stack" as possible, and that's
what actually happens.

Yes, z/OS has an amazing amount of wonderful error handling and recovery
logic, but the design philosophy (and reality) is "never" to invoke it, to
handle issues such as exceedingly rare core failures even without z/OS
having to do anything, or even necessarily to be aware anything happened.
It's a defense in depth strategy, to require multiple very long tail risks
to happen together, simultaneously, before any error surfaces to the OS for
handling.

Moreover, the system doesn't even necessarily bother notifying you that
something happened that was automatically handled with aplomb. If a stray
cosmic ray flipped a bit, triggered an integrity violation, caused an
instruction retry, and then everything continued normally for an eternity
(in processor terms) without the operating system having to do anything,
should alarm bells ring so that you can spend (waste) your time chasing
that ghost (cosmic ray)? Probably not. So there are certain categories of
anomalous, infrequent, handled, and inconsequential events that don't even
raise any system eyebrows, as it were. I don't know exactly what they are,
it probably varies by model, and IBM might not even tell you. And there's
tremendous design sense in that approach, too, because invoking some sort
of notification logic for inconsequential events could, all by itself,
cause consequential errors. There's a lot of care and long-term field
experience that goes into making these design decisions, as I understand
it. The basic analogy here is that you shouldn't yell "Fire!" in a crowded
theater (or even an uncrowded one) unless there really could be a fire,
because the very act of yelling "Fire!" could cause more harm than good.

The only currently marketed (as I write this) IBM Z or LinuxONE machine
models that can be (but certainly don't have to be) ordered and configured
without spare main processor cores are the IBM z13s (2965-N10 only) and the
IBM LinuxONE Rockhopper (2965-L10 only). However, every uncharacterized
core is a spare. You can order a single machine with 169 spare cores if you
wish. To do that you'd order an IBM z14 or IBM LinuxONE Emperor II with one
characterized core and 169 physically present but uncharacterized cores.
That's probably not a configuration you should order, but you can if you
insist.

--------------------------------------------------------------------------------------------------------
Timothy Sipples
IT Architect Executive, Industry Solutions, IBM Z & LinuxONE
E-Mail: sipp...@sg.ibm.com

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Dummy query on processor failover

Reply via email to