At semi-regular intervals through the last twelve years I have run through the man directories of entire full-boat Linux distributions running doclifter on every page and kicking fix patches upstream to clean up markup that cannot be structurally lifted to DocBook.
Some individual portions of the cleanup were quite large. Last year I fixed up the X man pages. Yes, *all* of them. You can look at statistics from this effort here: http://www.catb.org/~esr/doclifter/bugs.html Note that they understate the number of fixes I have shipped - I wasn't keeping exact records from the beginning, and there were some groups (of which the largest was the X pages) that I fixed in place in their repositories; those didn't get counted in my patch statistics. One way or another, I think it is now a safe bet that I have fixed more broken man pages than any other single person ever. Here are some facts about the defect patterns that I think are interesting: * There's a perceptible correlation between the origin date of a page and the (closely correlated) complexity and defect density of its markup. Older pages have more complex and more troff-aware markup, with more bugs. Newer pages use fewer troff-level requests and have fewer bugs. * The single most wretched hive of scum and villainy throughout the corpus is the markup for command synopses. There are no semantic macros for these, so people come up with endlessly inventive and perverse ways to make them come out right presentationally. Bugs in the resulting tangle are very very common, semantic-lifting it was the last major victory of doclifter's parser, and that cost more complexity than the entire rest of man markup (and ms, and me, and mm) put together. * No matter which distro you choose or how many packages you add, the percentage of man pages that pass strict validation by doclifter in their unaltered form now hovers at around 93%. * The corresponding percentage after patching by me is 99.85%. Yes, only 0.15% of man pages are so broken or over-complex that they can't be patched into validating strictly. (This is much better than I was expecting when I started.) * The first number, 93%, was about 10% lower when I started shipping patches (before that I dealt with exceptions solely by enhancing doclifter). Each time I successfully halved it (by enhancing doclifter or successfully pushing patches upstream) was about as much effort as all previous halvings combined. * About two years ago I reached the point where essentially all the gains from making doclifter's parser better were captured. Now I can *only* reduce that percentage further by jawboning man page maintainers into fixing their crappy markup. One respect in which you need to be careful about interpreting these numbers is that doclfter's validation criteria do not correpond exactly to "defects" in a groff-centric view of the corpus. They are different in two ways: 1. There is a class of 'recoverable' common markup defects which silently produce minor misbehavior from groff, but not a fatal error. An example is landing a string-enclosing single quote at the left margin - this will be interpreted as a request leader. These are 'recoverable' in that doclifter knows how to fix them automatically before the stage where it generates XML, and does not necessarily issue a warning on them. 2. Some presentation-level constructs that are perfectly valid groff throw warnings in doclifter because there is no way to lift them into structural markup. An example is ".ce". I have to treat these as errors and try to patch them out. However, these cases are not actually all that common. I'd say they affect 4-5% percent of pages at most. The next interesting thing is what the corpus looks like once you patch out unnecessary low-level troff markup - replacing, for example, tables kludged together inside .nf/fi blocks with proper TBL markup. (I think I've spent hundreds of hours doing that one thing!) What's interesting is that the resulting cleaned-up man markup uses a remarkably small set of low-level groff requests, and employs those in very stereotyped ways. (This is a good thing, or my grand plan to webify everything wouould fail.) Here's what I mean by stereotyped: overwhelmingly the most common extension macro pair declared in man pages is used to set a code example or screen shot, unfilled in fixed-width text. This is usually called .EX/EE or (less often) .DS/.DE In fact, if .EX/.EE were in the standard macros, the real-world use cases of .de would almost disappear. With only a small handful of exceptions (few enough to count on fingers and toes) that is *all* that real-world man-page authors use macro definitions for. Note, by the way, that groff's own pages are the single worst and nastiest exceptions to this general rule of simplicity. There are impacted layers of hideous macrology in there, only some of which I have successfully simplified to something reasonable. Most of the other few exceptions are either old-time GNU projects or venerable tools from BSD-land. None of them are quite as bad as he geroff pages. Now I hope it is clearer why I have begun to think in terms of actually enforcing hygienic macro use and blocking out low-level requests. What can be done by social-engineering page maintainers into cleaning up their acts voluntarily has in fact been done already, with considerable success. It only took me more than ten years of grinding at the problem! The remaining holdouts are going to need a boot up the butt to overcome their inertia. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> No matter how one approaches the figures, one is forced to the rather startling conclusion that the use of firearms in crime was very much less when there were no controls of any sort and when anyone, convicted criminal or lunatic, could buy any type of firearm without restriction. Half a century of strict controls on pistols has ended, perversely, with a far greater use of this weapon in crime than ever before. -- Colin Greenwood, in the study "Firearms Control", 1972