Wietse Venema put forth on 10/22/2009 8:03 PM:
> Stan Hoeppner:
>> The point I was attempting to make is that, even with todays fast disks,
>> on a heavily loaded Postfix server, a 6 fold decrease in disk throughput
>> due to an obscure bug like this would likely wreak havoc for a few
>> hours, if not days, depending on the skill and experience of the OP,
>> before the problem were found and fixed.  Ergo, we should never rule out
>> the rare/obscure/unlikely possible causes of problems that pop up.
> 
> I guess that the lesson from this is: don't install bleeding-edge
> kernels on servers that people depend on. Pretty much every OS
> distribution has a QA process that catches such anomalies before
> too many people suffer.

I wholeheartedly agree.  But I'd throw in the caveat that even stable
point releases, not just bleeding edge kernels, can come with 'hidden'
changes that break things, such as in the case I mentioned above, which
was a stable point release, well behind the bleeding Linux kernel of the
time.  Thus, I've always feared, and never used, thing like Redhat's
'up2date' or SuSE's AutoYast for updating critical components like
kernels and libc.

> I have been doing UNIX since 1985. I have learned to be careful.

I've only been using *nix since around 2000.  I tend to be very
conservative in this regard as well.  Unfortunately there is always the
potential that we might get burned as we rely so heavily on software
written and bug checked by others, and sometimes bugs don't 'surface'
until the software gets pounded on by the greater user base.

One such bug was another LSI Logic SCSI driver problem that only showed
up on Linux kernels running as VMware ESX guests.  IIRC, a SCSI 'busy'
was changed to SCSI 'wait' for no logical reason by the LSI driver team,
which didn't cause problems on real hardware.  But once Linux was
virtualized on the VMware hypervisor, it caused the Linux kernel to
remount filesystems readonly due to receiving excessive SCSI 'waits'
during fiber channel SAN transactions, thus wreaking tons of havoc.

I ran into this issue on a SLES 10 LAMP server.  Luckily I was still in
testing, not production. Took me a few days to figure it out, and I was
successful only because other end users had already gone through this
and posted on the VMware forum.  This coincided with the Linux kernel
version that initially shipped with SLES 10.  And this was supposedly a
'stable' kernel that had been thoroughly tested.  From what I heard at
the time, Ford Motor company wasn't so lucky.  They had upgraded all
their SLES9 VMware guests to SLES10 and discovered this bug in
production.  Ouch!

--
Stan

Reply via email to