Hello Branden, G. Branden Robinson wrote on Thu, Sep 25, 2025 at 04:15:02AM -0500: > At 2025-09-25T02:02:24+0200, Ingo Schwarze wrote:
>> On the other hand, for mdoc(7), the situation is much worse than >> for man(7) in so far as the macro order .Dd .Dt .Os used to be >> mere convention, and any other order of these three macros used >> to be equally valid. Groff-1.23 utterly broke that and now always >> starts a new manual page at .Dd, so every manual page with a different >> macro order is now totally broken with groff. > I broke it, and I broke it for a reason. When formatting for paginated > output devices (anything that isn't a terminal or HTML--the only output > formats _mandoc_ natively supports[0]), > [0] As I understand it, _mandoc_(1)'s PDF support comes from using an > external tool to generate it from HTML. That approach has > significant limitations from a typesetting perspective. Absolutely not. The mandoc -Tps and -Tpdf output modes are implemented as a submodule "term_ps.c" of the terminal-output module "term.c". The module "term_ps.c" directly generates valid PostScript and PDF code from the abstract man(7) and mdoc(7) syntax trees using knowledge about the syntax and semantics of the PostScript and PDF stack-based Turing-complete programming languages. HTML is not involved in any way, and the only program that mandoc(1) ever runs execve(2) on is the pager, and only in man(1) = "mandoc -a" mode. I totally agree that generating PDF from HTML would be a bad idea. You can be forgiven for the misunderstanding in so far your final conclusion is not that far from the truth: while -T ps and -T pdf generate syntactically valid and superficially acceptable code, the quality of that code is rather low from a typesetting perspective. While mandoc(1) also natively supports the -T tree, -T man, and -T markdown output modes, those are not typesetting modes either, so you are right that mandoc(1) has poor typesetting support, much poorer than it needs to have as a consequence of its development goals - even though it will likely never reach groff(1) or Heirloom levels of typesetting quality, it could do much better when given some love. It does support paginated output in -T ps and -T pdf though, including page headers and footers on every page (as opposed to only for each *manual* page like in terminal output - mandoc terminal output, in groff terminology, is always continuous, but that does not apply to -T ps nor to -T pdf). > when the formatter starts a new > _man_(7) or _mdoc_(7) document, it must break the page. > What happens at a page break? The page footer gets populated. > What populates the page footer? > In _man_, various arguments to the `TH` call populate it. > In _mdoc_, this same information is spread over multiple macro calls. > > Before the macro package can break the page and write the footer, the > data that populate the page footer must be in a well-defined state. > In other words, you don't want some of it to come from document A and > other bits of it to come from document B. All true. > If `Dd`, `Dt`, and `Os` can appear in arbitrary order, you risk > producing an incorrect page footer, sticking some of document n+1's data > at the bottom of the last page of document n. I know this because I saw > it happen. Only if you concatenate sources. > Possibly I could have added support for some kind of transitional state > to _groff_'s _mdoc_ package, and deferred the page break until all 3 > macros had appeared regardless of ordering, That creates new, different problems: what if one of the macros is missing? Then you would never start the new page at all? In particular, it is easy to imagine a page where .Dd is missing, if a page author (unwisely) decided displaying a date doesn't matter. > but that would have added > complicated logic. My impression is that you're not a fan of > complicated logic, as a rule. Yes! :) I think the whole idea of formatting multiple pages in one go is misguided because it creates the both untractable and entirely unnecessary problem that you describe of finding page starts - also note that in a manual page, .TH is not necessarily the first roff(7) request. The trouble is unnecessary because you *do* actually know where the pages start and end - otherwise you could not concatenate them in the first place. You are artificially wiping out information that you actually have and then jump through no end of hoops attempting to recover the lost information that in fact can no longer be recovered. Heck, even something as simple as inserting an undocumented, implementation-detail private macro .page_start_private_macro_to_please_branden between pages instead of just recklessly cat(1)ing them might mitgate some of the trouble (though i admit i didn't spend too much thought on the idea and could be missing something). The proper way to create a book from manual pages is to generate each manual page seperately and then concatenate the resulting PostScript or PDF documents with an appropriate external tool other than *roff(1). There is no problem with page numbers or tables of contents because formatting manual pages does not print pages numbers or tables of contents anyway, not even when (unwisely) formatted in one go. > In my opinion, the segregation of `Dd`, `Dt`, and `Os` was a blunder in > _mdoc_'s design Yes, i mostly agree, though i would weight the reasons why it was a blunder slightly different. The worst part is that .Os turned out to not be particularly useful at all, for any purpose, and is very hard to make useful in any context. And then, it was a blunder that Cynthia did not specify a hard requirement on the order of these macros. Had such an order been documented and uncompromisingly enforced by the code, the segregation would have done little harm, even though not being particularly useful - it would have become a mere bikeshed whether you prefer one macro with several positional arguments or a group of macros with an mnemonic name each. > for precisely the reason above. The siren call of > "semantic markup" was so loud that, in this case, it drowned out the > murmur of practical typesetting considerations. _mdoc_ should have had > a `Th`. There was no reason to spread this information over multiple > calls; the macros are not "parsed" or "callable". > > And as we've seen, the semantics of `Os` are readily distinguishable > from the mnemonic its name suggestively dangles. > > Furthermore, _mdoc_ documents that deviate from the canonical/ > (conventional?) order seem rare. In a FreeBSD bug report raising this > issue,[1] Wolfram Schneider identified only 15 pages in the > base/core/whatever system (all from 1 package, I think: krb5), and 371 > out of about 15,000 in the ports collection. Wow. That's _way_ more than i would have expected - almost 400 real-world pages that got broken in FreeBSD alone? If that is really true and not merely a miscounting of some kind - for example, FreeBSD uses MLINKS, i.e. ln(1) hard links between manual pages to add additional names (or "topics", as you would say) because in FreeBSD, man(1) is a home-grown shell script rather than the full mandoc man(1) implementation, and MLINKS could result in counting the same page multiple times if the counting is done too naively. Beware, i'm not calling Wolfram naive, to the contrary, he is usually fairly good at finding and reporting bugs and at doing analysis. Since mandoc(1) has been warning about unconventional preamble macro orderings for many years, i'm pretty sure there isn't a single instance in the OpenBSD base system (i didn't check recently, but i almost certainly checked many years ago). > That's 2.4% of all _mdoc_ > pages in the ports. (Since the ports will have a lot of _man_ > pages--I'll wager _significantly_ more than they do _mdoc_ pages--the > proportion of affected pages is, if not negligible, then nearly so.[3]) > [1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274132 > [2] https://lists.gnu.org/archive/html/bug-groff/2025-09/msg00122.html > [3] It's possible Wolfram counted _all_ man pages in the ports, > regardless of macro language, in which case 2.4% is likely an > accurate figure. He didn't share his method, and I don't have an > easy way to crawl the entire FreeBSD ports collection. I once > started to download a Git repository of it. I interrupted it > because it looked like it was going to take all day, and too much > disk space. Heh. You only downloaded the ports Git repo, and even that seemed large to you? That repo doesn't even contain *any* of the code nor any of the manuals nor any of the build systems - it purely consists of meta instructions how to download the actual code including the actual build systems, and it contains build system wrappers explaining how to run the diverse build systems. Do not attempt a bulk build at home, unless you have a powerful cluster of fast, modern machines, several days of time, and know exactly what you are doing. It is akin to running a full build of Debian, including a full build of *all* Debian packages, including all optional packages. Even i never attempted an OpenBSD bulk build, and i have no access to any build cluster that would even be remotely adequate for trying it. And the FreeBSD ports tree is significantly larger than the OpenBSD one, probably at least twice the size. > If someone does actually regard this as a defect in _groff_, they can > say so. I have not yet seen anyone make this claim. I dimly recall complaining about the mdoc(7) preamble regressions years ago, and i dimly recall your reply as something along the lines of "that would be too hard to fix", so i mostly gave up on it - and recently marked the related tests in the mandoc regression suite as "broken in groff-1.23", to help me move on with other tasks. >>> I find recent groff(1) being quite able to handle multi-.TH pages >> Branden has invested massive effort into making it kind-of work, > It should _totally_ work. I have confidence in my automated tests. > I urge you to file bug reports if you identify defects. >> in fact so massive that i have totally lost track of what is going on. > Have you read the code? No. Why should i? I have not even read the documentation how it is all supposed to work because the user interface design (before even starting to think about the implementation) is so complicated that i gave up on even reading the documentation, or the discussions how it should be designed. Even merely stubbing out and deactiviting only those parts of the new API that cause undesirable behaviour already caused non-negligible effort for me, even while ignoring the (likely much larger) parts that are merely needless for OpenBSD purposes but without triggering any obvious harm. > Where would explanatory comments be helpful? > In my assessment, anything we would have to do to unwind inline font > family or type size changes in _mdoc_ documents is going to be more > intrusive and complex than support for "PDF booking". > > > If i remember correctly, he has invented lots of new registers > > along with lots of novel rules how to use them to make it work, > > wrapping himself into elaborate nets of overengineering and > > resulting in long discussions in various bug tracker tickets > > about how it is all supposed to work. I refrained from reading > > most of that - too hard to understand and not really relevant for > > any practical purpose that i care about. > > Defect reports have been made in the past and, when confirmed, they > often lead to discussion. Is that unusual in your experience? > > I'm open to proposed refactorings that keep all the tests passing. If a > "simplification" or "right-sizing" of the "overengineering" causes tests > to fail, then the simplification is illusory--assuming you accept the > premise that formatting a collection of man pages as a PDF document, or > in printed and bound form, is not a crazy thing to do. > > What it is, is outside of _mandoc_(1)'s mission. But as you've quite > recently noted, it's not outside of _groff_'s.[2] Actually, with mandoc man(1), $ man -Tpdf true false > tmp.pdf does result in an output file that both xpdf(1) and gv(1) display just fine as a two-page document (with each manual page on one page of the "book"). I'm not entirely sure the syntax is valid: $ grep -n PDF- tmp.pdf 1:%PDF-1.1 447:%PDF-1.1 I'm too lazy to check right now whether that is valid syntax, reading the 750 page PDF specification is always a bit of challenge, but in case it is not valid syntax, it can certainly be fixed *without* requiring concatenation of the input files, making sure that mandoc continues to process every input file entirely separately, without any spillover from one file to the next, and without any ambiguity where one manual page ends and the next begins. Now mandoc(1) does all of that by itself, having a badly non-Unix-style monotithic software architure approach. The traditional way roff operates is in Unix-style through cooperation of many small tools that each do one particular job well, so the concatenation should almost certainly be done *after* the troff(1) and *after* the postprocessor stage. In fact, in an internal sense, that is true even on the inside of mandoc(1): the concatenation happens in main.c after all the parsers - broadly similar to the troff and macro processing stages - and the formatter - broadly similar to the postprocessing stage - have run. But even teaching term_ps.c to not spew duplicate %PDF- headers and duplicate object directories(?), if that should be invalid syntax, would likely not be all that hard. > To address the gripe you raise above about `Dd`, `Dt`, and `Os` would > require--guess what?--more registers (and/or strings) and more > complexity. Is that what you want? No. Then again, what mandoc(1) does - not for it's own choice, but for compatibility with pre-1.23 groff - is not *that* complicated: Mandoc distinguishes (and groff used to distinguish) two parsing phases: a preamble phase and a content phase. The phase transition is triggered when the first output happens - in mandoc, that's modeled by encountering the first non-premble mdoc(7) or man(7) macro or the first text (i.e. non-request non-macro) input line - but that's an implementation detail, roff would likely use some concept of traps instead. When the phase transition happens, the first header line is printed, with whatever content is available at that time. The content of footer lines, and of header lines on subsequent pages, can still be changed by (some) preamble macros occuring late, i.e. when already in the body phase - the details are not particularly consistent but have been implemented for compatibility with pre-1.23 groff. I would certainly be open to making all this more consistent and even simpler. By the way, purely internal, undocumented strings and registers are not a problem, instead they are mere implementation details (well, simple code is a virtue, too, but not nearly as crucial as simple UI and documentation). Documented strings and registers and instructions how to use them is what i call "complexity". Yours, Ingo