Re: PCI DMA lockups in 3.2 (3.3 maybe?)

Matthew Dillon Sat, 4 Dec 1999 22:47:30 -0800

:> 
:>     He didn't say this until after the situation had started to degrade.
:> 
:>     Besides, he's right.  3.x has serious problems.
:
:All running software has serious problems, that's why it is never considered
:done.  Taking the time to enumerate specific problems that are currently 
:plaguing an installation is the only way anyone can possibly hope to help.
:Problems reports of "It don't work" are helpful to absolutely noone.

    This simply isn't true.  I have written plenty of software (large
    projects) that do not have serious problems and, in fact, some do not
    have any known problems at all.  I have written several operating systems
    and one of them is least as complex as the FreeBSD core (but not as 
    complex if you count drivers) which are bug-free (that is, there have
    been no recorded crashes and except for feature updates have never been
    rebooted).

    FreeBSD can become 'bug free' insofar as it is possible to become bug
    free.  You have to believe that it can happen or it won't.   I believe
    it can -- my personal goal for the project is to make the core bug free 
    and uncrashable (and here I mean only with a network and disk driver, 
    and not all the other drivers out there which would be an impossible
    task).  Since I've actually *written* bug-free and uncrashable OS cores
    I am confident that it is possible to do with FreeBSD.

    Many of the issues relating to FreeBSD's instability and the many bugs
    in the core have nothing to do with continuing development work
    per-say, but instead has to do with an attitude that allows major
    pollution to be introduced into the code to optimize very specific
    cases (which destabilizes the source at the same time), and the lack of 
    proper documention within the source code.  It is precisely these two
    things which I have concentrated on the most - by rewriting where 
    necessary, generalizing optimizations (and ripping quite a few out of
    the VM system entirely), and documenting the hell out of any procedure
    I modify with succinct comments.

    There are two good examples of code pollution and, needless to say, they
    have been responsible for a huge number of bugs over the years.  Hundreds
    of bugs at least.  The first example is all the VM hacking that was
    done to accomodate partial cache instantiation and, most noteably,
    partial byte-range writes for NFS.  So far this year I have managed to
    rip about half of those hacks out at relatively little cost (a few 
    esoteric NFS write cases will be slower is all and buffer cache writing
    is slightly slower due to the extra system process, but hopefully made up
    by the move to an O(1) algorithm (previously an O(N^2) algorithm).

    The second example is the VFS layer implemenation and, most especially, 
    VOP_LOOKUP(). VOP_LOOKUP() has caused no end of trouble but the VFS layer
    implementation with all of its locking assumptions and return requirements
    has made filesystem design problematic at best.  There is enormous 
    complexity in the lookup, directory scanning, VFS cache code that hides
    bugs and that could be removed with a rewrite.

    In general, it is possible to fix these problems but some of those fixes
    require significant rewriting.  You have to be willing to rewrite and
    take your lumps up front or you may be faced with a situation where
    new problems are found with a subsystem for years to come.  The best 
    example of this in my case is the getnewbuf() code.  The code was 
    originally optimized with so many 'hacks' that it created at least half
    a dozen serious bugs in the system.  When I first rewrote it I encountered
    a huge amount of resistance from certain people who believed (wrongly) that
    rewriting would create more bugs then it fixed.  While a few bugs were
    introduced (that's the 'taking your lumps part), the generalization of
    the code made finding and fixing them much, much easier and this will
    ultimately lead to a better track record down the road.

    I applaud the removal of dead code that has been going on, though I have
    major problems with the way some of it has been gone about.  Compared
    to what some committers have been doing recently, the dead code removal
    that Alan and I had done to the VM system earlier in the year was a walk
    in the part.  I am dead set against 'hiding' bugs by trying to cache
    around them instead of fixing them, which is essentially the category 
    in which I put most of the recent changes to procfs and /bin/ps.

    It may seem counter-productive, but in order to fix bugs and make the
    system stable we actually need to cause the bugs to come to light
    more quickly and in a manner that is so blazingly obvious that we can
    fix them more quickly.  Hence the reason for putting KASSERT()'s all
    throughout the VM system (which led to the discovery that VM pages were
    being put on the cache queue while still dirty and led to a fix for
    a serious filesystem corruption bug, amoung other things).  When I did
    that some people screamed at me because they thought it would make the
    system unstable, but how many panics have we ever seen from it?  

    I am happy to see other people start to do the same thing.

    So, I think it *IS* possible to make FreeBSD sufficiently bug-free that
    people become 'surprised' when they are able to crash a box running it.

                                                -Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message
Re: PCI DMA lockups in 3.2 (3.3 maybe?)

Reply via email to