On Tue, 2 May 2006, Kris Kennaway wrote:
Ditto, same thing with the recent nve fixes. Why release known broken
code when there are tested patches available? Whats the worst that will
happen? It wont work? Thats already the case...
<...>
OK, I can't speak to that issue specifically.
Generally, though, the worst that can happen is "you fix one problem
affecting a subset of users and replace it with a larger problem affecting a
larger subset of users".
If there's doubt about the impact of a change, 10 seconds before the release
is not the appropriate time to cram it in.
<...>
I just want to comment a bit on this issue, because I've seen a number of
posts on FreeBSD mailing lists over the last few years that suggest that there
may be some misunderstandings about software development and releases
processes.
The invariant that needs to be understood is that all software is buggy;
arguments have been made that the number of bugs increases linearly with code
size, and there have also been arguments made that the number of bugs
increases with code complexity, so you can see a non-linear increase in bugs
with code growth. This means that you're talking about several bugs per
thousand lines of code in most software, and for code that contains millions
of lines of code (such as the FreeBSD kernel, Linux kernel, Apache, PhP,
MySQL, PostgreSQL, Windows, Word, iTunes, etc), you're talking thousands or
tens of thousands of bugs. And that's in a static version of the code, not
even taking into account new features in an active code base that are still
being "debugged"!
Bugs fall into a lot of different categories, but from the perspective of risk
management, it's useful to think of them in two categories: latent bugs, which
are unreported, unobserved, or occur only in exceptional or generally
untriggered circumstances, and non-latent bugs, which have been reported, are
triggered in practice, etc. The tricky ones are the latent bugs, because you
may not know that they are there, or you may know that they are there but
trigger so infrequently or in such unusual edge cases that they almost might
as well not be there.
Release engineering is really about two things: structuring/nurturing the
process of developing releases (tracking issues, identifying people to fix
them, testing, branch management, building, etc), and risk management. The
risk management aspect is that you want to improve the quality of the release
by taking actions, typically adopting source changes, which may improve
testing results. Each change potentially affects both visible and latent
bugs. Bug fixes in one piece of code may change the timing of the code, the
side effects, undocumented assumptions, or simply allow access to code
previously not executed because the bug prevented it. If you allow a bug fix
into the tree, you risk uncovering new bugs. So the choice isn't "Accept a
bug fix or not", it's "Will accepting this bug fix generally improve or reduce
quality of the release" -- i.e., will the change fix the bug it is claimed to
fix, and will it result in lots of latent bugs suddenly becoming visible.
Particular with hardware drivers like nve, this is non-trivial, because the
behavior of the hardware is very subtle, there's lots of variety in the
shipped hardware, and the vendor is (or appears) highly unsupportive. The
result is that if you tweak a register or minor piece of behavior, it
dramatically improve support for a particular piece of hardware, but break all
the rest. The only way to mitigate this risk is through extensive testing,
and extensive testing takes a lot of time. And by a lot of time, I mean, a
long release cycle. So if we want to adopt a fix that is high risk -- i.e.,
is believed will interact in subtle ways that affect different machines
differently -- we need to make the change early in the release cycle, not at
the end. If we make it at the end, we are shipping code that is effectively
untested on a large number of systems. Sure, it will fix one, but if it
breaks the rest, is it worth it? The only alternative is to restart the
testing process, which in the case of high-risk drivers, means adding months
to the release cycle.
And you can see where this is leading: if you significantly delay the release
cycle for each minor bug, you will never release. At some point, you have to
make the decision "although this release isn't perfect, we'll never release if
we don't ship now". I know that sounds like a bad thing, but you'll find that
that practice is not only found in every part of the software industry, but
it's also impossible to avoid, since bug-free software is impossible to avoid.
When you look at the RC2 release notes Scott recently sent, he identifies four
bugs that he believes won't be fixed in time for the release. He decided that
this was the case using risk management: each bug actually likely represents
several bugs with the same features, in highly complex code. This means that
they will take a significant amount of time to fix, and that each fix is high
risk, as it is likely to reveal latent bugs. This means that each fix will
require a lot of testing -- months of testing, in fact. So the choice is
really, do we release 6.1, or do we skip it and do a 6.2 in a few months. As
the release engineer, Scott has concluded that releasing now offers a great
benefit to many people, although the bugs present may penalize some. Mind
you, in some cases the bugs also exist in 6.0, so they don't represent
regressions, so much as bugs that continue to persist. I agree with his
conclusion: things like locking interactions in VFS are incredibly
complicated, requiring extensive analysis and work to fix and test. Trying to
fix them for 6.1 is unrealistic. They can be fixed in the next few weeks,
tested for a month or two, and then merged to the RELENG_6_1 branch as errata
fixes, similar to security advisories.
It's all about trade-offs. People are welcome to (and frequently do) disagree
with our analysis and choice on the trade-offs, but what I'm trying to
emphasize in this e-mail is that these trade-offs are a reality. They can't
be ignored: bug-free releases of software can't be shipped because they don't
exist, and therefore the argument (decision) is always about where the right
balance is. Arguing for waiting to ship until every last bug is fixed is
arguing never to release software -- bugs are present in all software, and not
all latent either -- that's why products have errata notes (as does FreeBSD),
patch levels, etc. Don't believe this means we don't think fixing bugs is
important, and that we don't spend long days and nights (and more days and
more nights) working on it.
FWIW, if you look at the release process of any other commercial or open
source software product, you'll see the same thing. Either there's no bug
database, or there's a very large database. If there's no database, it's
because the developer isn't being honest about there being bugs, or they have
no testing. If there's a huge database, they are, and they're not all going
to get shipped. Software authors select bugs to fix based on the impact of
the bugs and their ability to fix them. I'd like to think we care more than
some, but caring isn't enough to make computer software development perfect,
or it would have happened a long time ago :-).
Thanks,
Robert N M Watson
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"