Hi Alex, At 2023-08-17T21:12:35+0200, Alejandro Colomar wrote: > I've had this desire for a long time, and maybe now I have a strong > reason to ask for it. [...] > The problem is that at no point you can have the .roff source, after > the man(7) macros have been expanded. Would it be possible to split > the groff(1) pipeline to have one more preprocessor, let's call it > woman(1) (because man(1) is already taken), so that it translates > man(7) to roff(7)?
In other words, you want to see what a *roff document looks like after all macro expansions have been (recursively) performed. I wanted this, too, back in 2017 when I first started working on groff. The short answer is "no". The longer answer is that this is hard because GNU troff, like AT&T troff, never builds a complete syntax tree for the document the way "modern" document formatters do. nroff and troff were written and deployed on DEC PDP-11 machines that are today considered embedded microcontroller environments. Therefore they handled as little input at one time as possible. Roughly, this meant that input was collected, macro-expanded as soon as it was seen, and then as soon as it was time to break an output line, a lot of formatter state related to parsing was flushed, and it started reading input again. Understanding *roff a little better 6 years later, I can more easily imagine ways to run AT&T troff out of memory on a PDP-11. Ultra-long diversions would be one way,[1] because formatted diversion contents have to be kept in memory until they're called for. A multiplicity of moderately sizes diversions would do it too. Conditional blocks would be another problem. When encountering a brace escape sequence \{, the formatter has to scan ahead in the input. Or at least GNU troff does. Maybe AT&T troff did something clever, but its source code is famously opaque. I'll say it before Ingo does: mandoc(1) (as I understand it) _does_ build a syntax tree for the entire document before producing output, which enables some of the nice features that it has. I see Lennart has replied with some further exploration of the challenges here. Rather than duplicate his comments, let me move on to something vaguely related but, I hope, potentially useful. Can we do something that might help without re-architecting GNU troff? I think we can. I've been mulling this for months, and now that I'm on the threshold of implementing a `for` request as a string iterator,[2] I think I want something else first, largely to help me test it. I want string/macro/diversion dumper. groff(7): .pm Report, to the standard error stream, the names and sizes in bytes of defined macros, strings, and diversions. groff_diff(7): In AT&T troff the pm request reports macro, string, and diversion sizes in units of 128‐byte blocks, and an argument reduces the report to a sum of the above in the same units. GNU troff ignores any arguments and reports the sizes in bytes. That's fine, but what if we want to look _inside_ a macro, string, or diversion? I propose to implement this: .pm name Report the contents of macro, string, or diversion name to the standard error stream. If name is undefined, an error is produced (to distingush this case from an empty object). Newlines and ordinary characters are written as-is on lines indented one space. Special characters are represented in \[xx] notation regardless of the selected escape character or input syntax. Tabs, leaders, unprintable control characters, and nodes are described on lines with no indentation. I suggest that this won't break existing code because: A. GNU troff has ignored arguments to `pm` for ~33 years; and B. The format of debugging output (`troff -a`, `pm`, `pnr`, `pev`, `ptr`), is not, and likely should not be, rigidly specified. Example of an interactive session using the feature (purely notional, typed into my editor window): $ groff .ds foo hello \(aq apostrophe\" string contents are read in copy mode .pm foo hello \(aq apostrophe .de bar . ft B . nop Hello, world! . ft .. .pm bar .de bar . ft B . nop Hello, world! .. . ft .ds toc*entry 1.1^IIntroduction^Aiii .pm toc*entry 1.1 tab Introduction leader iii .de OB\"noxious old fart who knows tricks . if ^B\\$1^Bfatal^B .ab \" get out in a panic . ex \" exit more calmly .. .pm OB .de OB . if ^B \\$1 ^B fatal ^B .ab . ex .. A problem with the above format is that trailing spaces before newlines would not be obvious. I'm thinking that won't be too hard to address; the dumper can count spaces until it encounters something that isn't a space, newline, or the end of the object. We could then have something like this. .pm OB .de OB . if space newline ^B \\$1 ^B fatal ^B .ab space newline . ex space newline .. It would be more consistent, and possibly better, to just mark all newlines thus. I admit I don't really know yet what I'll be dealing with when it comes to dumping nodes (which will be all over the place in diversions). But, then, that aspect of groff seems to have mystified many over the years.[2] I very much hope that being able to "debug print" them will start to clear away the smoke and confusion. I want to do more than just say that a node has been encountered. I want something like this. .di mydiv ca-fe .ft B heavy .di .pm mydiv node {type=glyph, id='c', font-position=1} node {type=glyph, id='a', font-position=1} node {type=glyph, id='\hy', font-position=1} node {type=glyph, id='f', font-position=1} node {type=glyph, id='e', font-position=1} newline node {type=glyph, id='h', font-position=3} node {type=glyph, id='e', font-position=3} node {type=glyph, id='a', font-position=3} node {type=glyph, id='v', font-position=3} node {type=glyph, id='y', font-position=3} True node data will, I'm sure, be much more complex and verbose. Likely my first cut would be lamer. .pm mydiv node node node node node newline node node node node node But I would want to swiftly improve that to report at least some basic type information about the node. Once I know what that looks like. Any objections? Regards, Branden [1] Nobody _except_ mandoc(1) seems to handle this well. Credit where it's due. https://savannah.gnu.org/bugs/?64229 [2] https://lists.gnu.org/archive/html/groff/2020-10/msg00105.html
signature.asc
Description: PGP signature