[Groff] Defect patterns in real-world man-pages

Eric S. Raymond Tue, 04 Mar 2014 21:56:30 -0800

At semi-regular intervals through the last twelve years I have run
through the man directories of entire full-boat Linux distributions
running doclifter on every page and kicking fix patches upstream to
clean up markup that cannot be structurally lifted to DocBook.


Some individual portions of the cleanup were quite large.  Last year I
fixed up the X man pages.  Yes, *all* of them.

You can look at statistics from this effort here:

    http://www.catb.org/~esr/doclifter/bugs.html

Note that they understate the number of fixes I have shipped - I
wasn't keeping exact records from the beginning, and there were some
groups (of which the largest was the X pages) that I fixed in place
in their repositories; those didn't get counted in my patch
statistics.

One way or another, I think it is now a safe bet that I have fixed
more broken man pages than any other single person ever.

Here are some facts about the defect patterns that I think are interesting:

* There's a perceptible correlation between the origin date of a
  page and the (closely correlated) complexity and defect density of
  its markup.  Older pages have more complex and more troff-aware 
  markup, with more bugs. Newer pages use fewer troff-level requests
  and have fewer bugs.

* The single most wretched hive of scum and villainy throughout the
  corpus is the markup for command synopses.  There are no semantic
  macros for these, so people come up with endlessly inventive and
  perverse ways to make them come out right presentationally.  Bugs in
  the resulting tangle are very very common, semantic-lifting it was
  the last major victory of doclifter's parser, and that cost more
  complexity than the entire rest of man markup (and ms, and me, and
  mm) put together.

* No matter which distro you choose or how many packages you add, the 
  percentage of man pages that pass strict validation by doclifter 
  in their unaltered form now hovers at around 93%.

* The corresponding percentage after patching by me is 99.85%.  Yes,
  only 0.15% of man pages are so broken or over-complex that they
  can't be patched into validating strictly.  (This is much better
  than I was expecting when I started.)

* The first number, 93%, was about 10% lower when I started shipping
  patches (before that I dealt with exceptions solely by enhancing
  doclifter).  Each time I successfully halved it (by enhancing
  doclifter or successfully pushing patches upstream) was about as
  much effort as all previous halvings combined.

* About two years ago I reached the point where essentially all the gains
  from making doclifter's parser better were captured.  Now I can *only*
  reduce that percentage further by jawboning man page maintainers into
  fixing their crappy markup.

One respect in which you need to be careful about interpreting these
numbers is that doclfter's validation criteria do not correpond
exactly to "defects" in a groff-centric view of the corpus.  They
are different in two ways:

1. There is a class of 'recoverable' common markup defects which
silently produce minor misbehavior from groff, but not a fatal
error. An example is landing a string-enclosing single quote at the
left margin - this will be interpreted as a request leader.  These are
'recoverable' in that doclifter knows how to fix them automatically
before the stage where it generates XML, and does not necessarily
issue a warning on them.

2. Some presentation-level constructs that are perfectly valid groff 
throw warnings in doclifter because there is no way to lift them into
structural markup.  An example is ".ce".  I have to treat these as
errors and try to patch them out.

However, these cases are not actually all that common.  I'd say they
affect 4-5% percent of pages at most.

The next interesting thing is what the corpus looks like once you
patch out unnecessary low-level troff markup - replacing, for example,
tables kludged together inside .nf/fi blocks with proper TBL markup.
(I think I've spent hundreds of hours doing that one thing!)

What's interesting is that the resulting cleaned-up man markup uses a 
remarkably small set of low-level groff requests, and employs those
in very stereotyped ways.  (This is a good thing, or my grand plan to
webify everything wouould fail.)

Here's what I mean by stereotyped: overwhelmingly the most common 
extension macro pair declared in man pages is used to set 
a code example or screen shot, unfilled in fixed-width text. This
is usually called .EX/EE or (less often) .DS/.DE

In fact, if .EX/.EE were in the standard macros, the real-world use
cases of .de would almost disappear.  With only a small handful of
exceptions (few enough to count on fingers and toes) that is *all*
that real-world man-page authors use macro definitions for.

Note, by the way, that groff's own pages are the single worst and
nastiest exceptions to this general rule of simplicity.  There are
impacted layers of hideous macrology in there, only some of which I
have successfully simplified to something reasonable.

Most of the other few exceptions are either old-time GNU projects 
or venerable tools from BSD-land. None of them are quite as bad 
as he geroff pages.

Now I hope it is clearer why I have begun to think in terms of
actually enforcing hygienic macro use and blocking out low-level
requests.  What can be done by social-engineering page maintainers
into cleaning up their acts voluntarily has in fact been done already,
with considerable success.  It only took me more than ten years of
grinding at the problem!

The remaining holdouts are going to need a boot up the butt to
overcome their inertia.
-- 
                <a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>

No matter how one approaches the figures, one is forced to the rather
startling conclusion that the use of firearms in crime was very much
less when there were no controls of any sort and when anyone,
convicted criminal or lunatic, could buy any type of firearm without
restriction.  Half a century of strict controls on pistols has ended,
perversely, with a far greater use of this weapon in crime than ever
before.
        -- Colin Greenwood, in the study "Firearms Control", 1972

[Groff] Defect patterns in real-world man-pages

Reply via email to