Tomas Bodzar wrote: > You just think that it's running perfectly under Linux ;-) See eg. this post > http://marc.info/?l=openbsd-misc&m=125783114503531&w=2
I've been waiting for an excuse to update that story... :) First of all, I want you to note that was posted in November. It is now March, almost four months later, and it had been going on for quite some time back in November. Recap: Bad firmware -> locking system. New firmware -> rebooting system. Newer firmware -> still reboots, now trashes file systems Newer firmware -> still reboots, trashes file systems less often. At time of that posting, new firmware which has diagnostic code in it to capture critical info so Adaptec can figure out why their cards are crashing my system. So, for a couple months, things were going pretty well. We got a few crashes out of the system and data to the vendor to pass up to Adaptec, but no really big events. Then one weekend, one of the machines falls over and can't get back up. I figure "surprise", VPN into work, remove it from the cluster, and I'll worry about it Monday. Ok, now look at this from Adaptec's perspective... You have pissed off your customer and your customer's customer. You can't find the problem, so you have asked them to run special diagnostic firmware to have them help you do your job. What can you possibly do to further impress them with your incompetence now? So Monday, I go into work, cable up the machine and...it's hung in the RAID controller boot (not the system boot, but since HW manufacturers think it is so f*ing cool that OSs boot, of course they want their RAID controller to have a well advertised boot process too). And it hangs. Not even trying to read an OS off the disks, just hung. Power off, back on, still hangs. Reseat card, still hangs. I call our vendor, tell 'em the symptoms, they agree that it is the RAID controller that failed. I start thinking, well, maybe I was a little hard on Adaptec, publicly bashing them like this and in reality, maybe I just had a defective RAID card all along. It might explain why a large majority (though certainly not all!) of the crashes happened on this one machine...and now the card is totally dead. Hm. Maybe just bad hardware. I'm starting to consider how I'll word my semi-retraction. Then the phone rings, it's my regular contact at the system vendor. He's telling me there's something really strange going on, as these cards are popping all over the country, all at people who have been running the diagnostic firmware. They can't believe the conclusion, but it seems like there's a time bomb in the diagnostic firmware. They have a call in to Adaptec, but the guy responsible for the diagnostic firmware is on vacation, and it takes 'em a while to track the guy down, "but it is possible". Sure enough, a couple hours later, I get a call back that confirms the firmware is actively killing our cards, and thank goodness that I upgraded them over a period of days and not all in a short period of time, and I do an emergency reversion of all the other systems. How do you top your past levels of incompetence now? Thank your victim..er..customers who are helping you debug your product by time-bombing the device so that sixty days after install, your adapter breaks. Can you top that? Yeah. Don't tell anyone about the time bomb -- don't tell the VAR, or the end user, "if you help us debug our crappy product, don't let it run this way for 60 days, or your computer will start doing space heater imitations". (One could argue that they topped that one step further by actually locking the boot process so one could not even boot up the firmware update disk and downgrade the firmware to something that sucks less, but I am willing to pass that off as a bug, not deliberate). Think about this a bit. These people DELIBERATELY put a feature in their firmware to STOP me (and a lot of other people) from using this card. Legit user, but they felt that I was entitled to help them debug their shit for no more than sixty days. They worked hard at putting this feature in. This isn't a piece of software that has access to the resources of a computer, like real-time clocks and writable disks. This is a fucking RAID controller, which they managed to build a persistent time bomb into so that after 60 days of operation, it destroyed itself!! (and again, note: it didn't just crash and need to be power cycled, it DAMAGED THE CARD). This took some effort -- I can't think of any other reason to have a RTC in a RAID card. I also somehow doubt that the coder who did this sat down and wrote the time bomb AFTER he was charged with coming up with the diagnostic firmware. No, I rather suspect he grabbed some off-the-shelf code, something they put routinely into their diagnostic and troubleshooting systems, but wasn't intended to get out into the general public. They obviously care more about things OTHER than your system integrity and reliability. This coder made an error in judgment, but they obviously had the tools laying around for some reason. Now, tell me again how horrible it is that OpenBSD doesn't let you trust your data (and OpenBSD's reputation) to these incompetent assholes? (and compare this to the VMware August Surprise...another company who was more afraid of you running their software against their will than about time bombs escaping into the wild. I'm STUNNED that people still consider VMware an business grade product rather than a cute development toy after that event exposed their thought process so publicly) Current status: system is running on non-diagnostic firmware. Adaptec and the our mail system vendor can produce this problem in the lab with a couple days work, so they are closer to a good solution(?), but not fixed yet. Vendor has come out with a new version of their mail system software which handles corrupted file systems much better than the old versions did (and works on a new line of hardware which uses a different RAID vendor). But we are still about six months into this problem and it still exists. It does bring one part of the OpenBSD stance on aac(4) into question, though. The real reason they aren't giving us the errata for these products may not be that they don't wish to or are embarrassed by how bad it is, but that they don't even understand the problems in the product themselves. Doesn't change the conclusion: Adaptec products can't be trusted (though might be suitable for vmware servers). Nick.