Re: Proposed: change `pm` request argument semantics (was: process man(7) (or any other package of macros) without typesetting)

Ingo Schwarze Thu, 17 Aug 2023 18:25:38 -0700

Hi,

G. Branden Robinson wrote on Thu, Aug 17, 2023 at 06:44:14PM -0500:
> At 2023-08-17T21:12:35+0200, Alejandro Colomar wrote:


>> The problem is that at no point you can have the .roff source, after
>> the man(7) macros have been expanded.  Would it be possible to split
>> the groff(1) pipeline to have one more preprocessor, let's call it
>> woman(1) (because man(1) is already taken), so that it translates
>> man(7) to roff(7)?

> In other words, you want to see what a *roff document looks like after
> all macro expansions have been (recursively) performed.
> 
> I wanted this, too, back in 2017 when I first started working on groff.
> 
> The short answer is "no".
> 
> The longer answer is that this is hard because GNU troff, like AT&T
> troff, never builds a complete syntax tree for the document the way
> "modern" document formatters do.  nroff and troff were written and
> deployed on DEC PDP-11 machines that are today considered embedded
> microcontroller environments.  Therefore they handled as little input at
> one time as possible.  Roughly, this meant that input was collected,
> macro-expanded as soon as it was seen, and then as soon as it was time
> to break an output line, a lot of formatter state related to parsing was
> flushed, and it started reading input again.
> 
> Understanding *roff a little better 6 years later, I can more easily
> imagine ways to run AT&T troff out of memory on a PDP-11.  Ultra-long
> diversions would be one way,[1]
> [1] Nobody _except_ mandoc(1) seems to handle this well.  Credit where
>     it's due.  https://savannah.gnu.org/bugs/?64229

Praise is usually nice to have, but i must admit this particular praise
surprises me on more than one level.  :-)

https://man.openbsd.org/roff.7#di says:

  di divname
    Begin a diversion. Currently unsupported.   [by mandoc(1)]

I'm not completely convinced not supporting a particular request
at all amounts to "handling it well".

Besides,

   $ time { printf '.di foo\n.nf\n'; yes abcdefghijklm; } | mandoc
  mandoc: Cannot allocate memory
    0m07.61s real     0m05.67s user     0m01.81s system

i.e. infinite input crashes mandoc - admittedly via err(3) after
malloc(3) returns NULL, which is relatively controlled, but
still a crash.

But GNU troff isn't actually *that* much worse:

   $ time { printf '.di foo\n.nf\n'; yes abcdefghijklm; } | troff
  Abort trap (core dumped) 
    0m24.72s real     0m04.43s user     0m03.82s system

with this backtrace:

  _libc_abort at /usr/src/lib/libc/stdlib/abort.c:51
  abort_message at .../llvm/libcxxabi/src/abort_message.cpp:78
  demangling_terminate_handler at .../libcxxabi/src/cxa_default_handlers.cpp:66
  std::__terminate at .../llvm/libcxxabi/src/cxa_handlers.cpp:59
  __cxxabiv1::failed_throw (exception_header=0x61079458300)
    at .../llvm/libcxxabi/src/cxa_exception.cpp:152
  __cxa_throw (thrown_object=0x61079458380, 
    tinfo=0x61003540200 <typeinfo for std::bad_alloc>, 
    dest=0x6100353a340 <std::exception::~exception()>)
    at .../llvm/libcxxabi/src/cxa_exception.cpp:283
  operator new at .../llvm/libcxx/src/new.cpp:76

Exiting via abort(3) is also a relatively contolled way of dying.
Arguably it's a bit less clean here in troff than in mandoc
because signals are involved, and Unix signals are among the
worst parts of the C and POSIX programming environment and should
be avoided whenever possible, since they are generally fragile
and often invite vulnerabilities.  But in this case, this is not
the fault of GNU troff.  This downside merely follows from the
choice of the implementation language C++, which suffers from
ill-designed, very messy error handling in general.

I'm not sure why you see a SIGKILL getting thrown at the troff process
on your machine - but i *suspect* that may have nothing to do with GNU
troff either and may be an implementation detail of whatever operating
system, C++ compiler, and C++ standard library you are using.  Sure, on
first sight, an explicit abort(3) being called on the C library level
*might* look slightly safer than SIGKILL flying around - then again,
i'm not really sure it makes a difference.  Whether that actually
is a security risks depends on many details you did not disclose.
Quite possible it isn't.

> because formatted diversion contents
> have to be kept in memory until they're called for.  A multiplicity of
> moderately sizes diversions would do it too.  Conditional blocks would
> be another problem.  When encountering a brace escape sequence \{, the
> formatter has to scan ahead in the input.  Or at least GNU troff does.
> Maybe AT&T troff did something clever, but its source code is famously
> opaque.
> 
> I'll say it before Ingo does: mandoc(1) (as I understand it) _does_
> build a syntax tree for the entire document before producing output,
> which enables some of the nice features that it has.

Correct.

However, before Alejandro gets carried away with enthusiasm, let
me emphasize that is does the opposite of what Alejandro is asking
for: He wants all the man(7) macros converted to roff(7) requests.
Instead, mandoc *removes* all roff requests from the document such
that it gets a pure man(7) syntax tree with (almost) no roff left in
it - still making sure most of those roff requests take effect before
being removed.  Sound impossible, almost paradoxical?  Yes it does,
but it works surprisingly well all the same.  See this 12-year-old
presentation,

  https://www.openbsd.org/papers/bsdcan11-mandoc-openbsd.html

  in particular page 12 "No way around some low-level roff requests."
  and page 13 "Desperation lead to success: Paradigmatic switch"

Yours,
  Ingo

Re: Proposed: change `pm` request argument semantics (was: process man(7) (or any other package of macros) without typesetting)

Reply via email to