Hi, G. Branden Robinson wrote on Thu, Aug 17, 2023 at 06:44:14PM -0500: > At 2023-08-17T21:12:35+0200, Alejandro Colomar wrote:
>> The problem is that at no point you can have the .roff source, after >> the man(7) macros have been expanded. Would it be possible to split >> the groff(1) pipeline to have one more preprocessor, let's call it >> woman(1) (because man(1) is already taken), so that it translates >> man(7) to roff(7)? > In other words, you want to see what a *roff document looks like after > all macro expansions have been (recursively) performed. > > I wanted this, too, back in 2017 when I first started working on groff. > > The short answer is "no". > > The longer answer is that this is hard because GNU troff, like AT&T > troff, never builds a complete syntax tree for the document the way > "modern" document formatters do. nroff and troff were written and > deployed on DEC PDP-11 machines that are today considered embedded > microcontroller environments. Therefore they handled as little input at > one time as possible. Roughly, this meant that input was collected, > macro-expanded as soon as it was seen, and then as soon as it was time > to break an output line, a lot of formatter state related to parsing was > flushed, and it started reading input again. > > Understanding *roff a little better 6 years later, I can more easily > imagine ways to run AT&T troff out of memory on a PDP-11. Ultra-long > diversions would be one way,[1] > [1] Nobody _except_ mandoc(1) seems to handle this well. Credit where > it's due. https://savannah.gnu.org/bugs/?64229 Praise is usually nice to have, but i must admit this particular praise surprises me on more than one level. :-) https://man.openbsd.org/roff.7#di says: di divname Begin a diversion. Currently unsupported. [by mandoc(1)] I'm not completely convinced not supporting a particular request at all amounts to "handling it well". Besides, $ time { printf '.di foo\n.nf\n'; yes abcdefghijklm; } | mandoc mandoc: Cannot allocate memory 0m07.61s real 0m05.67s user 0m01.81s system i.e. infinite input crashes mandoc - admittedly via err(3) after malloc(3) returns NULL, which is relatively controlled, but still a crash. But GNU troff isn't actually *that* much worse: $ time { printf '.di foo\n.nf\n'; yes abcdefghijklm; } | troff Abort trap (core dumped) 0m24.72s real 0m04.43s user 0m03.82s system with this backtrace: _libc_abort at /usr/src/lib/libc/stdlib/abort.c:51 abort_message at .../llvm/libcxxabi/src/abort_message.cpp:78 demangling_terminate_handler at .../libcxxabi/src/cxa_default_handlers.cpp:66 std::__terminate at .../llvm/libcxxabi/src/cxa_handlers.cpp:59 __cxxabiv1::failed_throw (exception_header=0x61079458300) at .../llvm/libcxxabi/src/cxa_exception.cpp:152 __cxa_throw (thrown_object=0x61079458380, tinfo=0x61003540200 <typeinfo for std::bad_alloc>, dest=0x6100353a340 <std::exception::~exception()>) at .../llvm/libcxxabi/src/cxa_exception.cpp:283 operator new at .../llvm/libcxx/src/new.cpp:76 Exiting via abort(3) is also a relatively contolled way of dying. Arguably it's a bit less clean here in troff than in mandoc because signals are involved, and Unix signals are among the worst parts of the C and POSIX programming environment and should be avoided whenever possible, since they are generally fragile and often invite vulnerabilities. But in this case, this is not the fault of GNU troff. This downside merely follows from the choice of the implementation language C++, which suffers from ill-designed, very messy error handling in general. I'm not sure why you see a SIGKILL getting thrown at the troff process on your machine - but i *suspect* that may have nothing to do with GNU troff either and may be an implementation detail of whatever operating system, C++ compiler, and C++ standard library you are using. Sure, on first sight, an explicit abort(3) being called on the C library level *might* look slightly safer than SIGKILL flying around - then again, i'm not really sure it makes a difference. Whether that actually is a security risks depends on many details you did not disclose. Quite possible it isn't. > because formatted diversion contents > have to be kept in memory until they're called for. A multiplicity of > moderately sizes diversions would do it too. Conditional blocks would > be another problem. When encountering a brace escape sequence \{, the > formatter has to scan ahead in the input. Or at least GNU troff does. > Maybe AT&T troff did something clever, but its source code is famously > opaque. > > I'll say it before Ingo does: mandoc(1) (as I understand it) _does_ > build a syntax tree for the entire document before producing output, > which enables some of the nice features that it has. Correct. However, before Alejandro gets carried away with enthusiasm, let me emphasize that is does the opposite of what Alejandro is asking for: He wants all the man(7) macros converted to roff(7) requests. Instead, mandoc *removes* all roff requests from the document such that it gets a pure man(7) syntax tree with (almost) no roff left in it - still making sure most of those roff requests take effect before being removed. Sound impossible, almost paradoxical? Yes it does, but it works surprisingly well all the same. See this 12-year-old presentation, https://www.openbsd.org/papers/bsdcan11-mandoc-openbsd.html in particular page 12 "No way around some low-level roff requests." and page 13 "Desperation lead to success: Paradigmatic switch" Yours, Ingo