David:
This goes back to the late 70's. We had a 168MP that after
installation got intermittent S0C4's. It seemed related to load of
the system.
The CE diagnostic's couldn't find a thing.
One of our sysprogs wrote a reasonably S0C4 proof program (It did a
load and delete of IEF21WSD repeatedly) so IBM considered it
reasonable proof and the skies darkened as they say with IBM types
and the found after a day that one of the tri leads was to long.
(sorry don't remember how much to long was ) but the signal to the
high speed buffer wasn't right.
Damnedest hardware bug I ever saw.
Ed
On Jan 7, 2014, at 7:59 PM, David Crayford wrote:
On 8/01/2014 12:05 AM, Scott Ford wrote:
I agree with Joel. PC based platforms in my experience has been
very hardware error prone, maybe due to the components. Like Joel,
I haven't seen a hardware failure in the Z/OS world since the 70s.
I've seen quite a few hardware failures on mainframes, they happen
quite frequently. They almost never cause an outage because there
is redundancy. Most of the time we didn't even know we had a
failure until IBM contacted us to let us know they had dispatched
an engineer. Almost all enterprise systems are the same, even x86.
They have n+1 redundancy for hardware components and clustering for
HA. Your friendly IBM salesman will be only too happy to talk to
you about an x86 high availability hardware/software platform.
Of course, the data center behemoths like google, facebook, amazon
et all choose to buy the cheapest bare metal commodity components
with redundancy done by the software. At that scale it's the only
model that makes economic sense.
Scott ford
www.identityforge.com
from my IPAD
On Jan 7, 2014, at 9:59 AM, David Crayford <[email protected]>
wrote:
On 07/01/2014, at 6:57 AM, "Joel C. Ewing" <[email protected]> wrote:
The first step to successful diagnosing and repair of a software
failure
is to be certain it IS a software issue and not some random
hardware
glitch. This is made more difficult in the Intel world by the very
thing that makes these platforms affordable - a multitude of
manufacturers of motherboards, memory, hardware interface cards and
peripherals all applying their own concept of "acceptable"
engineering
design while trying to make fast and cheap hardware.
Is that still the case today? Even cheap x86 blades have machine
check architecture which can signal software on hardware
failures. It must be over a decade or so since IBM started
stuffing mainframe quality RAM modules into x86 servers, chipkill
etc. 90% of server failures were due to RAM errors. You don't
have to search too far to find 99.999 platforms running Intel.
You'll pay for it though.
--------------------------------------------------------------------
--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-
MAIN
---------------------------------------------------------------------
-
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-
MAIN
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN