Re: quota deadlock on 6.1-RC1

Robert Watson Wed, 03 May 2006 03:26:32 -0700


On Tue, 2 May 2006, Kris Kennaway wrote:

Ditto, same thing with the recent nve fixes. Why release known broken
code when there are tested patches available? Whats the worst that will
happen? It wont work? Thats already the case...

<...>

OK, I can't speak to that issue specifically.
Generally, though, the worst that can happen is "you fix one problemaffecting a subset of users and replace it with a larger problem affecting alarger subset of users".
If there's doubt about the impact of a change, 10 seconds before the releaseis not the appropriate time to cram it in.

<...>

I just want to comment a bit on this issue, because I've seen a number ofposts on FreeBSD mailing lists over the last few years that suggest that theremay be some misunderstandings about software development and releasesprocesses.

The invariant that needs to be understood is that all software is buggy;arguments have been made that the number of bugs increases linearly with codesize, and there have also been arguments made that the number of bugsincreases with code complexity, so you can see a non-linear increase in bugswith code growth. This means that you're talking about several bugs perthousand lines of code in most software, and for code that contains millionsof lines of code (such as the FreeBSD kernel, Linux kernel, Apache, PhP,MySQL, PostgreSQL, Windows, Word, iTunes, etc), you're talking thousands ortens of thousands of bugs. And that's in a static version of the code, noteven taking into account new features in an active code base that are stillbeing "debugged"!

Bugs fall into a lot of different categories, but from the perspective of riskmanagement, it's useful to think of them in two categories: latent bugs, whichare unreported, unobserved, or occur only in exceptional or generallyuntriggered circumstances, and non-latent bugs, which have been reported, aretriggered in practice, etc. The tricky ones are the latent bugs, because youmay not know that they are there, or you may know that they are there buttrigger so infrequently or in such unusual edge cases that they almost mightas well not be there.

Release engineering is really about two things: structuring/nurturing theprocess of developing releases (tracking issues, identifying people to fixthem, testing, branch management, building, etc), and risk management. Therisk management aspect is that you want to improve the quality of the releaseby taking actions, typically adopting source changes, which may improvetesting results. Each change potentially affects both visible and latentbugs. Bug fixes in one piece of code may change the timing of the code, theside effects, undocumented assumptions, or simply allow access to codepreviously not executed because the bug prevented it. If you allow a bug fixinto the tree, you risk uncovering new bugs. So the choice isn't "Accept abug fix or not", it's "Will accepting this bug fix generally improve or reducequality of the release" -- i.e., will the change fix the bug it is claimed tofix, and will it result in lots of latent bugs suddenly becoming visible.

Particular with hardware drivers like nve, this is non-trivial, because thebehavior of the hardware is very subtle, there's lots of variety in theshipped hardware, and the vendor is (or appears) highly unsupportive. Theresult is that if you tweak a register or minor piece of behavior, itdramatically improve support for a particular piece of hardware, but break allthe rest. The only way to mitigate this risk is through extensive testing,and extensive testing takes a lot of time. And by a lot of time, I mean, along release cycle. So if we want to adopt a fix that is high risk -- i.e.,is believed will interact in subtle ways that affect different machinesdifferently -- we need to make the change early in the release cycle, not atthe end. If we make it at the end, we are shipping code that is effectivelyuntested on a large number of systems. Sure, it will fix one, but if itbreaks the rest, is it worth it? The only alternative is to restart thetesting process, which in the case of high-risk drivers, means adding monthsto the release cycle.

And you can see where this is leading: if you significantly delay the releasecycle for each minor bug, you will never release. At some point, you have tomake the decision "although this release isn't perfect, we'll never release ifwe don't ship now". I know that sounds like a bad thing, but you'll find thatthat practice is not only found in every part of the software industry, butit's also impossible to avoid, since bug-free software is impossible to avoid.

When you look at the RC2 release notes Scott recently sent, he identifies fourbugs that he believes won't be fixed in time for the release. He decided thatthis was the case using risk management: each bug actually likely representsseveral bugs with the same features, in highly complex code. This means thatthey will take a significant amount of time to fix, and that each fix is highrisk, as it is likely to reveal latent bugs. This means that each fix willrequire a lot of testing -- months of testing, in fact. So the choice isreally, do we release 6.1, or do we skip it and do a 6.2 in a few months. Asthe release engineer, Scott has concluded that releasing now offers a greatbenefit to many people, although the bugs present may penalize some. Mindyou, in some cases the bugs also exist in 6.0, so they don't representregressions, so much as bugs that continue to persist. I agree with hisconclusion: things like locking interactions in VFS are incrediblycomplicated, requiring extensive analysis and work to fix and test. Trying tofix them for 6.1 is unrealistic. They can be fixed in the next few weeks,tested for a month or two, and then merged to the RELENG_6_1 branch as erratafixes, similar to security advisories.

It's all about trade-offs. People are welcome to (and frequently do) disagreewith our analysis and choice on the trade-offs, but what I'm trying toemphasize in this e-mail is that these trade-offs are a reality. They can'tbe ignored: bug-free releases of software can't be shipped because they don'texist, and therefore the argument (decision) is always about where the rightbalance is. Arguing for waiting to ship until every last bug is fixed isarguing never to release software -- bugs are present in all software, and notall latent either -- that's why products have errata notes (as does FreeBSD),patch levels, etc. Don't believe this means we don't think fixing bugs isimportant, and that we don't spend long days and nights (and more days andmore nights) working on it.

FWIW, if you look at the release process of any other commercial or opensource software product, you'll see the same thing. Either there's no bugdatabase, or there's a very large database. If there's no database, it'sbecause the developer isn't being honest about there being bugs, or they haveno testing. If there's a huge database, they are, and they're not all goingto get shipped. Software authors select bugs to fix based on the impact ofthe bugs and their ability to fix them. I'd like to think we care more thansome, but caring isn't enough to make computer software development perfect,or it would have happened a long time ago :-).


Thanks,

Robert N M Watson
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: quota deadlock on 6.1-RC1

Reply via email to