Wietse Venema put forth on 10/22/2009 8:03 PM: > Stan Hoeppner: >> The point I was attempting to make is that, even with todays fast disks, >> on a heavily loaded Postfix server, a 6 fold decrease in disk throughput >> due to an obscure bug like this would likely wreak havoc for a few >> hours, if not days, depending on the skill and experience of the OP, >> before the problem were found and fixed. Ergo, we should never rule out >> the rare/obscure/unlikely possible causes of problems that pop up. > > I guess that the lesson from this is: don't install bleeding-edge > kernels on servers that people depend on. Pretty much every OS > distribution has a QA process that catches such anomalies before > too many people suffer.
I wholeheartedly agree. But I'd throw in the caveat that even stable point releases, not just bleeding edge kernels, can come with 'hidden' changes that break things, such as in the case I mentioned above, which was a stable point release, well behind the bleeding Linux kernel of the time. Thus, I've always feared, and never used, thing like Redhat's 'up2date' or SuSE's AutoYast for updating critical components like kernels and libc. > I have been doing UNIX since 1985. I have learned to be careful. I've only been using *nix since around 2000. I tend to be very conservative in this regard as well. Unfortunately there is always the potential that we might get burned as we rely so heavily on software written and bug checked by others, and sometimes bugs don't 'surface' until the software gets pounded on by the greater user base. One such bug was another LSI Logic SCSI driver problem that only showed up on Linux kernels running as VMware ESX guests. IIRC, a SCSI 'busy' was changed to SCSI 'wait' for no logical reason by the LSI driver team, which didn't cause problems on real hardware. But once Linux was virtualized on the VMware hypervisor, it caused the Linux kernel to remount filesystems readonly due to receiving excessive SCSI 'waits' during fiber channel SAN transactions, thus wreaking tons of havoc. I ran into this issue on a SLES 10 LAMP server. Luckily I was still in testing, not production. Took me a few days to figure it out, and I was successful only because other end users had already gone through this and posted on the VMware forum. This coincided with the Linux kernel version that initially shipped with SLES 10. And this was supposedly a 'stable' kernel that had been thoroughly tested. From what I heard at the time, Ford Motor company wasn't so lucky. They had upgraded all their SLES9 VMware guests to SLES10 and discovered this bug in production. Ouch! -- Stan