Hi Branden,
Thank you very, very much for your dedication and the profound insight
you share with the groff user community.
On my side, I am happy that my experiments brought a hidden (?) issue to
the light of the day. It is interesting to see that, for heaven's sake,
it is ms which stands out in stark contrast, if negatively, to the other
macro packages. Yet, from an investigative point of view, even a failure
can be very helpful indeed.
Anyway the document models of classical text vs. html (do HTML documents
really have _pages_?) are fundamentally different. So for a while, I
thought that the -Thtml option was not at all supposed to be used with
any of the classical macro packages but should have its own set of
requests and macros. *That* was the origin of my question, and that was
why it did not even come to my mind that I should try a different macro
package instead of ms. So it goes.
The last time I wrote that I managed to crash (groff: error: troff:
Aborted (core dumped)) the compile run with the following command line:
$ groff -k myfile.ms -Thtml -w all > myfile.htm
with
.mso s.tmac
in the preamble of the document. (I am too lazy to build complex command
lines even if it's just a cursor up movement that's necessary to invoke
a complex command line again.)
During some experiments, I suddenly noticed that the crashes did not
occur anymore. Since I had introduced multiple small changes in the
course of my edit session, it took me a while to identify the culprit.
I had a title display which more or less went as follows:
.DS C
.AU
A. U. Thor
.TL
Opera Minora
.DE
That display reliably crashed the compile run.
However, if I replaced
.DS C
with
.CD
(same visual result when compiled properly, let's say with -Tpdf)
then the compile run with -Thtml was successful in the sense that it did
not abort prematurely. Very interesting!
Best regards,
Oliver.
On 14/05/2025 20:10, G. Branden Robinson wrote:
Hi Oliver,
At 2025-05-12T12:32:18+0200, Oliver Corff via GNU roff typesetting
system discussion wrote:
for the first time, I am experimenting with the html output features
of groff.
When attempting to compile the attached document (which is compiled
without problem when using any other -T option) by saying
$ groff -k -Thtml TA_html.ms > test.html
The generated html file test.html displays a lot of garbage.
I'm afraid I am missing some basic information here.
I even managed to crash groff (core dump) with longer input files to
be typeset with the ms macro set.
I have some follow-up findings on this problem, which as I noted is an
old one (10+ years) and is proving to be a real devil to track down.
Let me offer first, possibly unhelpfully, that you picked apparently the
single worst macro package to start your experiments with.
The reason is that I see this bug manifest _only_ with the 's' macro
package, and not any of the other full-service ones we supply.
(I didn't check mom(7) because I don't know how to write a minimal
document in it, and I suspect Peter has already checked mom's results
with grohtml(1), at least up to the point where the pre/post-processing
and "devtagging" machinery frustrated progress.)
I'm attaching a set of closely similar documents for the ms, me, mm,
man, and mdoc macros. The only one that shows this word-space
destruction defect is ms. The problem appears to be happening inside
the formatter, since grout output clearly shows the issue when diffing
output to the "html" and "utf8" devices, respectively.
After, that is, a lot of noisy preamble used only for HTML output, which
I'm not sure is a helpful feature (or if it is, why it's restricted to
this output device); maybe it's debugging scaffolding for grohtml-
elated changes to the formatter that were never taken out.
$ diff -U0 MS-HTML MS-UTF8
--- MS-HTML 2025-05-14 12:46:48.252239912 -0500
+++ MS-UTF8 2025-05-14 12:46:51.556223272 -0500
@@ -1 +1 @@
-x T html
+x T utf8
@@ -4,12 +3,0 @@
-x F /home/branden/src/GIT/groff/build/../tmac/troffrc
-x F composite.tmac
-x F fallbacks.tmac
-x F html.tmac
-x Fwww.tmac
-x F devtag.tmac
-x F en.tmac
-x F latin1.tmac
-x F pspic.tmac
-x F pdfpic.tmac
-x F /home/branden/src/GIT/groff/build/../tmac/troffrc-end
-x F html-end.tmac
@@ -17,23 +4,0 @@
-V40
-H240
-DFd
-h1560
-n40 0
-x F -
-x F s.tmac
-x F devtag.tmac
-x F refer-ms.tmac
-x F refer.tmac
-x F de.tmac
-x F trans.tmac
-x F latin1.tmac
-V40
-H0
-x X devtag:.fi 1
-x X devtag:.rj 0
-x X devtag:.in 0
-x X devtag:.ll 24
-x X devtag:.po 0
-x X devtag:.ta L 120
-x X devtag:.ce 0
-x X devtag:.br
@@ -43 +8 @@
-V40
+V280
@@ -45,0 +11 @@
+DFd
@@ -47,3 +13 @@
-n0 0
-V40
-H0
+wh24
@@ -51,4 +15,4 @@
-n0 0
-V2147483480
-H24
-n0 0
+n40 0
+V2560
+H1560
+n40 0
@@ -56 +20 @@
-V2147483600
+V2640
The heart of this issue is changes like this:
x font 1 R
f1
s10
-V40
+V280
H0
md
+DFd
tbaz
-n0 0
-V40
-H0
+wh24
tqux
-n0 0
-V2147483480
-H24
-n0 0
+n40 0
Here we can see in the "utf8" output, a horizontal motion flagged as a
word space "wh24", that is missing from "html" output. We also have
these suspiciously useless 'n0 0' commands in the "html" output.
groff_out(5):
n b a Indicate a break. No action is performed; the command is
present to make the output more easily parsed. The integers
b and a describe the vertical space amounts before and after
the break, respectively. GNU troff issues this command but
groff’s output driver library ignores it. See v and V.
But the weirdest part is that, despite these indications that we have a
problem in the formatter itself, no other macro package causes the
problem, even when formatting very similar output.
$ diff -U0 MS-HTML MM-HTML
--- MS-HTML 2025-05-14 12:46:48.252239912 -0500
+++ MM-HTML 2025-05-14 12:47:16.224099732 -0500
@@ -23 +23 @@
-x F s.tmac
+x F m.tmac
@@ -25 +25 @@
-x F refer-ms.tmac
+x F refer-mm.tmac
@@ -30,2 +30,2 @@
-V40
-H0
+V80
+H168
@@ -35,3 +35,3 @@
-x X devtag:.ll 24
-x X devtag:.po 0
-x X devtag:.ta L 120
+x X devtag:.ll 1440
+x X devtag:.po 168
+x X devtag:.ta L 120 L 240 L 360 L 480 L 600 L 720 L 840 L 960 L 1080 L 1200
L 1320 L 1440
@@ -39 +38,0 @@
-x X devtag:.br
@@ -43,2 +42,2 @@
-V40
-H0
+V80
+H168
@@ -47,3 +46 @@
-n0 0
-V40
-H0
+wh24
@@ -51,4 +48 @@
-n0 0
-V2147483480
-H24
-n0 0
+n40 0
The mm package is having no problem getting the formatter to put 'wh24'
commands on the output, and also causes it to produce 'n' commands that
wouldn't be nilpotent even if they weren't documentary.
Another clue is that the `pline` request I stuck between "baz" and "qux"
in my input produced a populated node list to the standard error stream
in every case except the buggy one. Something odd is going on inside
the formatter; it's not like it (normally) waits until it's seen a word
space to populate the pending output line. Observe:
$ printf 'ab\\c\n.pline\n' | ~/groff-HEAD/bin/groff -a
<beginning of page>
[{"type": "line_start_node", "diversion level": 0, "is_special_node": false},
{"type": "glyph_node", "diversion level": 0, "is_special_node": false, "character":
"a"},
{"type": "glyph_node", "diversion level": 0, "is_special_node": false, "character":
"b"},
{"type": "transparent_dummy_node", "diversion level": 0, "is_special_node":
false}]
ab
My advice for the time being is to select _any_ other full-service macro
package with which to pursue your experiments with grohtml. That feels
pretty lame to say, I admit.
Regards,
Branden
--
Dr. Oliver Corff
Wittelsbacherstr. 5A
10707 Berlin
G E R M A N Y
Tel.: +49-30-85727260
Mail:oliver.co...@email.de