Tomas Bodzar wrote:
> You just think that it's running perfectly under Linux ;-) See eg. this post
> http://marc.info/?l=openbsd-misc&m=125783114503531&w=2

I've been waiting for an excuse to update that story... :)

First of all, I want you to note that was posted in November.  It is
now March, almost four months later, and it had been going on for
quite some time back in November.

Recap:
Bad firmware -> locking system.
New firmware -> rebooting system.
Newer firmware -> still reboots, now trashes file systems
Newer firmware -> still reboots, trashes file systems less often.
At time of that posting, new firmware which has diagnostic code in it
to capture critical info so Adaptec can figure out why their cards are
crashing my system.

So, for a couple months, things were going pretty well.  We got a few
crashes out of the system and data to the vendor to pass up to
Adaptec, but no really big events.  Then one weekend, one of the
machines falls over and can't get back up.  I figure "surprise", VPN
into work, remove it from the cluster, and I'll worry about it Monday.


Ok, now look at this from Adaptec's perspective...  You have pissed
off your customer and your customer's customer. You can't find the
problem, so you have asked them to run special diagnostic firmware to
have them help you do your job.  What can you possibly do to further
impress them with your incompetence now?


So Monday, I go into work, cable up the machine and...it's hung in the
RAID controller boot (not the system boot, but since HW manufacturers
think it is so f*ing cool that OSs boot, of course they want their
RAID controller to have a well advertised boot process too).  And it
hangs.  Not even trying to read an OS off the disks, just hung.  Power
off, back on, still hangs.  Reseat card, still hangs.

I call our vendor, tell 'em the symptoms, they agree that it is the
RAID controller that failed.  I start thinking, well, maybe I was a
little hard on Adaptec, publicly bashing them like this and in
reality, maybe I just had a defective RAID card all along.  It might
explain why a large majority (though certainly not all!) of the
crashes happened on this one machine...and now the card is totally
dead.  Hm.  Maybe just bad hardware.  I'm starting to consider how
I'll word my semi-retraction.

Then the phone rings, it's my regular contact at the system vendor.
He's telling me there's something really strange going on, as these
cards are popping all over the country, all at people who have been
running the diagnostic firmware.  They can't believe the conclusion,
but it seems like there's a time bomb in the diagnostic firmware.
They have a call in to Adaptec, but the guy responsible for the
diagnostic firmware is on vacation, and it takes 'em a while to track
the guy down, "but it is possible".  Sure enough, a couple hours
later, I get a call back that confirms the firmware is actively
killing our cards, and thank goodness that I upgraded them over a
period of days and not all in a short period of time, and I do an
emergency reversion of all the other systems.

How do you top your past levels of incompetence now?  Thank your
victim..er..customers who are helping you debug your product by
time-bombing the device so that sixty days after install, your adapter
breaks.  Can you top that?  Yeah.  Don't tell anyone about the time
bomb -- don't tell the VAR, or the end user, "if you help us debug our
crappy product, don't let it run this way for 60 days, or your
computer will start doing space heater imitations".

(One could argue that they topped that one step further by actually
locking the boot process so one could not even boot up the firmware
update disk and downgrade the firmware to something that sucks less,
but I am willing to pass that off as a bug, not deliberate).


Think about this a bit.  These people DELIBERATELY put a feature in
their firmware to STOP me (and a lot of other people) from using this
card.  Legit user, but they felt that I was entitled to help them
debug their shit for no more than sixty days.  They worked hard at
putting this feature in.  This isn't a piece of software that has
access to the resources of a computer, like real-time clocks and
writable disks.  This is a fucking RAID controller, which they managed
to build a persistent time bomb into so that after 60 days of
operation, it destroyed itself!! (and again, note: it didn't just
crash and need to be power cycled, it DAMAGED THE CARD).  This took
some effort -- I can't think of any other reason to have a RTC in a
RAID card.  I also somehow doubt that the coder who did this sat down
and wrote the time bomb AFTER he was charged with coming up with the
diagnostic firmware.  No, I rather suspect he grabbed some
off-the-shelf code, something they put routinely into their diagnostic
and troubleshooting systems, but wasn't intended to get out into the
general public.  They obviously care more about things OTHER than your
system integrity and reliability.  This coder made an error in
judgment, but they obviously had the tools laying around for some reason.


Now, tell me again how horrible it is that OpenBSD doesn't let you
trust your data (and OpenBSD's reputation) to these incompetent assholes?

(and compare this to the VMware August Surprise...another company who
was more afraid of you running their software against their will than
about time bombs escaping into the wild.  I'm STUNNED that people
still consider VMware an business grade product rather than a cute
development toy after that event exposed their thought process so
publicly)


Current status: system is running on non-diagnostic firmware.  Adaptec
and the our mail system vendor can produce this problem in the lab
with a couple days work, so they are closer to a good solution(?), but
 not fixed yet.  Vendor has come out with a new version of their mail
system software which handles corrupted file systems much better than
the old versions did (and works on a new line of hardware which uses
a different RAID vendor).  But we are still about six months into this
problem and it still exists.

It does bring one part of the OpenBSD stance on aac(4) into question,
though.  The real reason they aren't giving us the errata for these
products may not be that they don't wish to or are embarrassed by how
bad it is, but that they don't even understand the problems in the
product themselves.  Doesn't change the conclusion: Adaptec products
can't be trusted (though might be suitable for vmware servers).

Nick.

Reply via email to