David:

This goes back to the late 70's. We had a 168MP that after installation got intermittent S0C4's. It seemed related to load of the system.
The CE diagnostic's couldn't find a thing.
One of our sysprogs wrote a reasonably S0C4 proof program (It did a load and delete of IEF21WSD repeatedly) so IBM considered it reasonable proof and the skies darkened as they say with IBM types and the found after a day that one of the tri leads was to long. (sorry don't remember how much to long was ) but the signal to the high speed buffer wasn't right.
Damnedest hardware bug I ever saw.

Ed

On Jan 7, 2014, at 7:59 PM, David Crayford wrote:

On 8/01/2014 12:05 AM, Scott Ford wrote:
I agree with Joel. PC based platforms in my experience has been very hardware error prone, maybe due to the components. Like Joel, I haven't seen a hardware failure in the Z/OS world since the 70s.

I've seen quite a few hardware failures on mainframes, they happen quite frequently. They almost never cause an outage because there is redundancy. Most of the time we didn't even know we had a failure until IBM contacted us to let us know they had dispatched an engineer. Almost all enterprise systems are the same, even x86. They have n+1 redundancy for hardware components and clustering for HA. Your friendly IBM salesman will be only too happy to talk to you about an x86 high availability hardware/software platform.

Of course, the data center behemoths like google, facebook, amazon et all choose to buy the cheapest bare metal commodity components with redundancy done by the software. At that scale it's the only model that makes economic sense.


Scott ford
www.identityforge.com
from my IPAD




On Jan 7, 2014, at 9:59 AM, David Crayford <[email protected]> wrote:

On 07/01/2014, at 6:57 AM, "Joel C. Ewing" <[email protected]> wrote:

The first step to successful diagnosing and repair of a software failure is to be certain it IS a software issue and not some random hardware
glitch.  This is made more difficult in the Intel world by the very
thing that makes these platforms affordable - a multitude of
manufacturers of motherboards, memory, hardware interface cards and
peripherals all applying their own concept of "acceptable" engineering
design while trying to make fast and cheap hardware.
Is that still the case today? Even cheap x86 blades have machine check architecture which can signal software on hardware failures. It must be over a decade or so since IBM started stuffing mainframe quality RAM modules into x86 servers, chipkill etc. 90% of server failures were due to RAM errors. You don't have to search too far to find 99.999 platforms running Intel. You'll pay for it though. -------------------------------------------------------------------- --
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM- MAIN
--------------------------------------------------------------------- -
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM- MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to