Re: mini-book manual pages through multi-.so pages (i.e., the old proc(5) page)

Ingo Schwarze Thu, 25 Sep 2025 09:21:15 -0700

Hello Branden,

G. Branden Robinson wrote on Thu, Sep 25, 2025 at 04:15:02AM -0500:
> At 2025-09-25T02:02:24+0200, Ingo Schwarze wrote:


>> On the other hand, for mdoc(7), the situation is much worse than
>> for man(7) in so far as the macro order .Dd .Dt .Os used to be
>> mere convention, and any other order of these three macros used
>> to be equally valid.  Groff-1.23 utterly broke that and now always
>> starts a new manual page at .Dd, so every manual page with a different
>> macro order is now totally broken with groff.

> I broke it, and I broke it for a reason.  When formatting for paginated
> output devices (anything that isn't a terminal or HTML--the only output
> formats _mandoc_ natively supports[0]),
> [0] As I understand it, _mandoc_(1)'s PDF support comes from using an
>     external tool to generate it from HTML.  That approach has
>     significant limitations from a typesetting perspective.

Absolutely not.  The mandoc -Tps and -Tpdf output modes are implemented
as a submodule "term_ps.c" of the terminal-output module "term.c".
The module "term_ps.c" directly generates valid PostScript and PDF
code from the abstract man(7) and mdoc(7) syntax trees using knowledge
about the syntax and semantics of the PostScript and PDF stack-based
Turing-complete programming languages.  HTML is not involved in any
way, and the only program that mandoc(1) ever runs execve(2) on
is the pager, and only in man(1) = "mandoc -a" mode.

I totally agree that generating PDF from HTML would be a bad idea.
You can be forgiven for the misunderstanding in so far your final
conclusion is not that far from the truth: while -T ps and -T pdf
generate syntactically valid and superficially acceptable code, the
quality of that code is rather low from a typesetting perspective.

While mandoc(1) also natively supports the -T tree, -T man,
and -T markdown output modes, those are not typesetting modes either,
so you are right that mandoc(1) has poor typesetting support,
much poorer than it needs to have as a consequence of its
development goals - even though it will likely never reach
groff(1) or Heirloom levels of typesetting quality, it could
do much better when given some love.
It does support paginated output in -T ps and -T pdf though,
including page headers and footers on every page (as opposed
to only for each *manual* page like in terminal output -
mandoc terminal output, in groff terminology, is always
continuous, but that does not apply to -T ps nor to -T pdf).

> when the formatter starts a new
> _man_(7) or _mdoc_(7) document, it must break the page.
> What happens at a page break?  The page footer gets populated.
> What populates the page footer?
> In _man_, various arguments to the `TH` call populate it.
> In _mdoc_, this same information is spread over multiple macro calls.
> 
> Before the macro package can break the page and write the footer, the
> data that populate the page footer must be in a well-defined state.
> In other words, you don't want some of it to come from document A and
> other bits of it to come from document B.

All true.

> If `Dd`, `Dt`, and `Os` can appear in arbitrary order, you risk
> producing an incorrect page footer, sticking some of document n+1's data
> at the bottom of the last page of document n.  I know this because I saw
> it happen.

Only if you concatenate sources.

> Possibly I could have added support for some kind of transitional state
> to _groff_'s _mdoc_ package, and deferred the page break until all 3
> macros had appeared regardless of ordering,

That creates new, different problems: what if one of the macros is
missing?  Then you would never start the new page at all?
In particular, it is easy to imagine a page where .Dd is missing, if
a page author (unwisely) decided displaying a date doesn't matter.

> but that would have added
> complicated logic.  My impression is that you're not a fan of
> complicated logic, as a rule.

Yes!  :)

I think the whole idea of formatting multiple pages in one go
is misguided because it creates the both untractable and entirely
unnecessary problem that you describe of finding page starts - also
note that in a manual page, .TH is not necessarily the first roff(7)
request.  The trouble is unnecessary because you *do* actually
know where the pages start and end - otherwise you could not
concatenate them in the first place.  You are artificially
wiping out information that you actually have and then jump
through no end of hoops attempting to recover the lost information
that in fact can no longer be recovered.

Heck, even something as simple as inserting an undocumented,
implementation-detail private macro

  .page_start_private_macro_to_please_branden

between pages instead of just recklessly cat(1)ing them might
mitgate some of the trouble (though i admit i didn't spend too much
thought on the idea and could be missing something).

The proper way to create a book from manual pages is to generate
each manual page seperately and then concatenate the resulting
PostScript or PDF documents with an appropriate external tool other
than *roff(1).  There is no problem with page numbers or tables of
contents because formatting manual pages does not print pages numbers
or tables of contents anyway, not even when (unwisely) formatted
in one go.

> In my opinion, the segregation of `Dd`, `Dt`, and `Os` was a blunder in
> _mdoc_'s design

Yes, i mostly agree, though i would weight the reasons why it was
a blunder slightly different.  The worst part is that .Os turned
out to not be particularly useful at all, for any purpose, and is
very hard to make useful in any context.  And then, it was a blunder
that Cynthia did not specify a hard requirement on the order of
these macros.  Had such an order been documented and uncompromisingly
enforced by the code, the segregation would have done little harm,
even though not being particularly useful - it would have become
a mere bikeshed whether you prefer one macro with several positional
arguments or a group of macros with an mnemonic name each.

> for precisely the reason above.  The siren call of
> "semantic markup" was so loud that, in this case, it drowned out the
> murmur of practical typesetting considerations.  _mdoc_ should have had
> a `Th`.  There was no reason to spread this information over multiple
> calls; the macros are not "parsed" or "callable".
> 
> And as we've seen, the semantics of `Os` are readily distinguishable
> from the mnemonic its name suggestively dangles.
> 
> Furthermore, _mdoc_ documents that deviate from the canonical/
> (conventional?) order seem rare.  In a FreeBSD bug report raising this
> issue,[1] Wolfram Schneider identified only 15 pages in the
> base/core/whatever system (all from 1 package, I think: krb5), and 371
> out of about 15,000 in the ports collection.

Wow.  That's _way_ more than i would have expected - almost 400
real-world pages that got broken in FreeBSD alone?  If that is
really true and not merely a miscounting of some kind - for example,
FreeBSD uses MLINKS, i.e. ln(1) hard links between manual pages
to add additional names (or "topics", as you would say) because
in FreeBSD, man(1) is a home-grown shell script rather than the
full mandoc man(1) implementation, and MLINKS could result in
counting the same page multiple times if the counting is done
too naively.  Beware, i'm not calling Wolfram naive, to the
contrary, he is usually fairly good at finding and reporting
bugs and at doing analysis.

Since mandoc(1) has been warning about unconventional preamble
macro orderings for many years, i'm pretty sure there isn't a
single instance in the OpenBSD base system (i didn't check
recently, but i almost certainly checked many years ago).

> That's 2.4% of all _mdoc_
> pages in the ports.  (Since the ports will have a lot of _man_
> pages--I'll wager _significantly_ more than they do _mdoc_ pages--the
> proportion of affected pages is, if not negligible, then nearly so.[3])
> [1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274132
> [2] https://lists.gnu.org/archive/html/bug-groff/2025-09/msg00122.html
> [3] It's possible Wolfram counted _all_ man pages in the ports,
>     regardless of macro language, in which case 2.4% is likely an
>     accurate figure.  He didn't share his method, and I don't have an
>     easy way to crawl the entire FreeBSD ports collection.  I once
>     started to download a Git repository of it.  I interrupted it
>     because it looked like it was going to take all day, and too much
>     disk space.

Heh.  You only downloaded the ports Git repo, and even that seemed
large to you?  That repo doesn't even contain *any* of the code
nor any of the manuals nor any of the build systems - it purely
consists of meta instructions how to download the actual code
including the actual build systems, and it contains build system
wrappers explaining how to run the diverse build systems.  Do not
attempt a bulk build at home, unless you have a powerful cluster
of fast, modern machines, several days of time, and know exactly
what you are doing.  It is akin to running a full build of Debian,
including a full build of *all* Debian packages, including all
optional packages.

Even i never attempted an OpenBSD bulk build, and i have no access
to any build cluster that would even be remotely adequate for trying
it.  And the FreeBSD ports tree is significantly larger than the
OpenBSD one, probably at least twice the size.

> If someone does actually regard this as a defect in _groff_, they can
> say so.  I have not yet seen anyone make this claim.

I dimly recall complaining about the mdoc(7) preamble regressions
years ago, and i dimly recall your reply as something along the
lines of "that would be too hard to fix", so i mostly gave up on
it - and recently marked the related tests in the mandoc regression
suite as "broken in groff-1.23", to help me move on with other tasks.

>>> I find recent groff(1) being quite able to handle multi-.TH pages

>> Branden has invested massive effort into making it kind-of work,

> It should _totally_ work.  I have confidence in my automated tests.
> I urge you to file bug reports if you identify defects.

>> in fact so massive that i have totally lost track of what is going on.

> Have you read the code?

No.  Why should i?  I have not even read the documentation how it
is all supposed to work because the user interface design (before
even starting to think about the implementation) is so complicated
that i gave up on even reading the documentation, or the discussions
how it should be designed.

Even merely stubbing out and deactiviting only those parts of the
new API that cause undesirable behaviour already caused
non-negligible effort for me, even while ignoring the (likely much
larger) parts that are merely needless for OpenBSD purposes
but without triggering any obvious harm.

> Where would explanatory comments be helpful?
> In my assessment, anything we would have to do to unwind inline font
> family or type size changes in _mdoc_ documents is going to be more
> intrusive and complex than support for "PDF booking".
> 
> > If i remember correctly, he has invented lots of new registers
> > along with lots of novel rules how to use them to make it work,
> > wrapping himself into elaborate nets of overengineering and
> > resulting in long discussions in various bug tracker tickets
> > about how it is all supposed to work.  I refrained from reading
> > most of that - too hard to understand and not really relevant for
> > any practical purpose that i care about.
> 
> Defect reports have been made in the past and, when confirmed, they
> often lead to discussion.  Is that unusual in your experience?
> 
> I'm open to proposed refactorings that keep all the tests passing.  If a
> "simplification" or "right-sizing" of the "overengineering" causes tests
> to fail, then the simplification is illusory--assuming you accept the
> premise that formatting a collection of man pages as a PDF document, or
> in printed and bound form, is not a crazy thing to do.
> 
> What it is, is outside of _mandoc_(1)'s mission.  But as you've quite
> recently noted, it's not outside of _groff_'s.[2]

Actually, with mandoc man(1),

   $ man -Tpdf true false > tmp.pdf

does result in an output file that both xpdf(1) and gv(1) display
just fine as a two-page document (with each manual page on one
page of the "book").  I'm not entirely sure the syntax is valid:

   $ grep -n PDF- tmp.pdf
  1:%PDF-1.1
  447:%PDF-1.1

I'm too lazy to check right now whether that is valid syntax,
reading the 750 page PDF specification is always a bit of challenge,
but in case it is not valid syntax, it can certainly be fixed
*without* requiring concatenation of the input files, making
sure that mandoc continues to process every input file entirely
separately, without any spillover from one file to the next,
and without any ambiguity where one manual page ends and the
next begins.

Now mandoc(1) does all of that by itself, having a badly non-Unix-style
monotithic software architure approach.  The traditional way roff
operates is in Unix-style through cooperation of many small tools
that each do one particular job well, so the concatenation should
almost certainly be done *after* the troff(1) and *after* the
postprocessor stage.  In fact, in an internal sense, that is true
even on the inside of mandoc(1): the concatenation happens in main.c
after all the parsers - broadly similar to the troff and macro
processing stages - and the formatter - broadly similar to the
postprocessing stage - have run.  But even teaching term_ps.c
to not spew duplicate %PDF- headers and duplicate object directories(?),
if that should be invalid syntax, would likely not be all that hard.

> To address the gripe you raise above about `Dd`, `Dt`, and `Os` would
> require--guess what?--more registers (and/or strings) and more
> complexity.  Is that what you want?

No.  Then again, what mandoc(1) does - not for it's own choice, but
for compatibility with pre-1.23 groff - is not *that* complicated:
Mandoc distinguishes (and groff used to distinguish) two parsing
phases: a preamble phase and a content phase.  The phase transition
is triggered when the first output happens - in mandoc, that's
modeled by encountering the first non-premble mdoc(7) or man(7)
macro or the first text (i.e. non-request non-macro) input line -
but that's an implementation detail, roff would likely use some
concept of traps instead.  When the phase transition happens, the
first header line is printed, with whatever content is available
at that time.  The content of footer lines, and of header lines
on subsequent pages, can still be changed by (some) preamble macros
occuring late, i.e. when already in the body phase - the details
are not particularly consistent but have been implemented for
compatibility with pre-1.23 groff.  I would certainly be open to
making all this more consistent and even simpler.

By the way, purely internal, undocumented strings and registers
are not a problem, instead they are mere implementation details
(well, simple code is a virtue, too, but not nearly as crucial
as simple UI and documentation).  Documented strings and registers
and instructions how to use them is what i call "complexity".

Yours,
  Ingo

Re: mini-book manual pages through multi-.so pages (i.e., the old proc(5) page)

Reply via email to