[bug #64360] [PATCH] [gropdf] does not correctly handle white space after 'w' command

G. Branden Robinson Tue, 15 Aug 2023 05:14:46 -0700

Follow-up Comment #29, bug #64360 (project groff):

[comment #27 comment #27:]
> First I'd like to try to reduce the scope of this discussion, since it seems
to have grown in multiple directions.


Sure.
 
> Am I correct in the assumption that the grout files for any given input
would not be identical when produced by different roff implementations?

Yes.  Here's some simple input given to Heirloom Doctools _troff_ and then GNU
_troff_.


$ printf -- '.nf\na b\n-\\-\n' | ./bin/troff -Tps 
x T ps
x res 72000 1 1
x init
V0
p1
x font 1 R /home/branden/heirloom/lib/doctools/font/devps/R.afm 4
x font 2 I /home/branden/heirloom/lib/doctools/font/devps/I.afm 4
x font 3 B /home/branden/heirloom/lib/doctools/font/devps/B.afm 4
x font 4 BI /home/branden/heirloom/lib/doctools/font/devps/BI.afm 4
x font 5 CW /home/branden/heirloom/lib/doctools/font/devps/CW.afm 4
x font 6 H /home/branden/heirloom/lib/doctools/font/devps/H.afm 4
x font 7 HB /home/branden/heirloom/lib/doctools/font/devps/HB.afm 4
x font 8 HX /home/branden/heirloom/lib/doctools/font/devps/HX.afm 4
x font 9 S1 /home/branden/heirloom/lib/doctools/font/devps/S1.afm 516
x font 10 S /home/branden/heirloom/lib/doctools/font/devps/S.afm 1028
s10
f1
x X LC_CTYPE en_US.UTF-8
H72000
V12000
ca
wh7770cb
n12000 0
H72000
V24000
c-
h3330C\-
n12000 0
x trailer
V792000
x stop
$ printf -- '.nf\na b\n-\\-\n' | troff -Tps 
x T ps
x res 72000 1 1
x init
p1
x font 5 TR
f5
s10000
V12000
H72000
md
DFd
ta
wh2500
tb
n12000 0
V24000
H72000
t-
C\-
h5640
n12000 0
x trailer
V792000
x stop


In the Heirloom output, I find the line


wh7770cb


noteworthy for a reason I'll return to.

> I assume this is true, else the output drivers from different
implementations would be interchangeable. I don't believe they are.

I have a _feeling_ that Kernighan might have been reaching for this, but
didn't quite nail down the syntax tightly enough to strictly permit it. 
However, I could be retrojecting my thoughts in 2023 onto his in 1980, when
(a) there was no other device-independent troff implementation, (b) one was
not going to appear for nearly a decade, and (c) given the challenges he
described in CSTR #97, he might have thought it unlikely that another would
ever be created.  As further speculation, (d) in 1980 the eventual divestiture
of the AT&T monopoly was not yet seen as inevitable, and as long as that hope
held, there wasn't much need for anyone to reimplement device-independent
troff since it could be had for low or zero cost.  If anyone held that belief,
it was the most swiftly overturned of these.
 
> This means, whilst they may all be based on the information in cstr#54, each
roff has developed its own private API between the formatter and output
drivers. For this reason the decision on whether this is a change to the groff
version of the API, has to be confined to what is contained in the groff
documentation.

Or we can accept that _groff_'s documentation doesn't adequately describe its
implementation, which I believe I just demonstrated in bug #63544.
 
> Empirical observation shows that groff uses a simple rule of one operation
per line

No.  Glyph output and horizontal motions are frequently mixed when the
`tcommand` directive is not present.  If we remove it from font/devps/DESC, we
get this.


$ printf -- '.nf\na b\n-\\-\n' | ./build/test-groff -T ps -Z | grep cawh
cawh6940


And it is ubiquitous when the "obsolete" (to use Bernd's term) output command
is used.  (I would term it "legacy".)


$ printf -- '.nf\na b\n-\\-\n' | groff -T X100 -Z | grep caw
caw10bh7


Further, (recalling the point I promised to return to above) Heirloom breaks
this rule in yet another respect, putting a `c` command after an `h` command,
whose argument is _not_ fixed-width (the integers it uses are not zero-padded
on the left) with no separation at all.

> and using a single space to avoid a "clashing between the command code and
the arguments without the space.", even though Kernighan states that it is
permitted to use newlines as well as a space and tab for this purpose, none of
our drivers support this.

I believe I have empirically refuted this claim in bug #63544, comment #3.

> The w command is not an operation, it is just a marker for a paddable word
space so following with a newline is against our own documentation:-

I find it unnecessary to bifurcate the class of "output commands" into
"operations" and "non-operations".  It certainly isn't required to explain
present (and, I have to guess, decades-long outstanding) behavior of
_libdriver_.


> in 'gtroff''s intermediate output, every command with
> at least one argument is followed by a line break,


This is demonstrably false, as I showed above.


> thus providing
> excellent readability.


...and that's an unnecessary sales pitch.

> The w command has no arguments so under this rule it should not have a
following new line.

I think you have extrapolated an invalid rule from vague and inaccurate
documentation.

> As regards having a space after the w command, our documentation says:-

> The 'gtroff' output parser, however, is smart about whitespace by making it
> maximally optional.


> Which I take to mean it only uses a space to avoid the "clashes" mentioned
above,

This is another sales pitch, and vague besides.

> and it further says:-

> Commands and arguments with a known, fixed length need
> not be separated by syntactical space.


It does say that.  Unfortunately this claim contradicts one you already
quoted.


> in 'gtroff''s intermediate output, every command with
> at least one argument is followed by a line break,


According to the above, does a command with at one fixed-length argument get
followed by a line break or not?

I don't think that question is answerable without relying on an additional
information channel, like reading the source code or experimenting.

You may be beginning to see why I am critical of this documentation.

> The w command is fixed length, so to satisfy "maximally optional" no space
is used.

As noted previously, I cannot elicit any clear semantics from the modifier
"maximally" here.
 
> I never said that white space cannot follow a w command, but if we change to
include white space after it then that goes against how the groff version of
the API has been documented for many years.

That, I agree with.  Our documentation in this area is inaccurate.  Please
understand that I am exercising restraint by not saying more.

> I believe the change Branden has on his private branch is to output a new
line after a w command, but this bug concerns white space after the w.
Contrary to cstr#54, which classes space/tab/newline as the same, groff does
not allow newline to be used as the white space between a command and its
arguments (this difference is not documented).

That's a good interesting point and one I want to explore with further
testing.  I see no reason _groff_ output drivers _shouldn't_ accept a newline
thus, given the clarity of CSTR #54's wording on the subject.

> If groff 1.23++ is going to use w followed by a new line, none of the
proposed patches is optimal. A loop is no longer required, since no further
commands will be on that line. There is no point in producing code to cater
for situations which will not arise. Gropdf is written to parse grout output
from groff, if that output is altered so that it no longer complies with our
own documentation and gropdf fails to handle it then it is not a bug, but a
change in the API and should involve a change request and at least a wider
discussion than just us three.

I propose for _gropdf_ to accept the same inputs _grops_ does and to interpret
them in a compatible way.  That is all.
 
> Apparently, the reason for wanting to make this change is to "generate
"grout" that is more easily lexically analyzed". Citing posix shell's poor
lexical processing capabilities, I don't see what difference wh2500 and
w\nh2500 makes. If that's what you want, would a simple filter like this
help:-

> [derij@pip busgrap]$ perl -pe 's/(.)(.*)/$1\n$2/ if m/^w/; s/^(.)(\S.*)/$1
$2/mg' zfile
> x T ps
> x res 72000 1 1
> x init
> p 1
> x font 5 TR
> f 5
> s 10000
> V 12000
> H 72000
> m d
> D Fd
> t Deri
> w
> h 2500
> D l 100000 0
> n 12000 0
> x trailer
> V 792000
> x stop


I don't think writing such a tool is desirable.  It "feels" too small to be a
shippable tool--not supportive of its own weight in terms of making it a
proper command with `--help` and `--version` and a man page, or the effort of
coming up with a good name for it; too long to expect anyone to type it; and
too obscure for it to make it into many people's shell start-up files as a
function.

In my opinion, GNU _troff_ should simply produce output that is easy for
humans to read in the first place.

> In fact I'd be very happy to write a proper grout tool with multiple output
options (pretty print, markup, XML). Markup could look like:-

> x T ps                        # for grops
> x res 72000 1 1
> x init
> p1                      # Page 1
> x font 5 TR
> f5                      # Times-Roman
> s10000                  # ps 10
> V12000                  # V 1/6th in
> H72000                  # H 1in
> md                      # Default text colour: Black
> DFd                     # Default fill colour: Black
> tDeri
> wh2500                  # Word Space: 2.5p
> Dl 100000 0             # Line from x,y to x1,y1
> n12000 0                # New Line
> x trailer
> V792000
> x stop


> Using some of the code from gropdf which keeps track of current position,
XML output could tag the x,y position of every element.

That would indeed be useful--a "grout annotator" if you will.  But I think
it's outside the scope of this ticket.

> So it is unnecessary to change the format of grout to achieve what you say
you want.

As noted above and in bug #63544, I don't have to.  We just need to correct
the documentation and align _gropdf_ with the other output drivers here.
 
> The danger in changing the current grout format is we do not know what tools
have been written which parse our current grout format, didn't someone write a
parser which output html/javascript, how do we know our changes won't affect
them.

If they accept what _grops_, _grotty_, _grodvi_, _grohtml_, and so forth do,
then they'll be fine.  They do risk disruption if they relied upon our badly
composed documentation in this area.  If I don't treat CSTR #54 as scripture
as some of our mailing list subscribers do, I'm surely not going to bring a
higher level of reverence to our own, particularly where it has demonstrable
problems.

> Given that there are non-intrusive methods to achieve the result you want, I
hope your hankerings can be satisfactorily assuaged.

I have to reject your conclusion here as ill-premised.  Fortunately, I see no
need to alter _libdriver_ in any way (pending the "newlines everywhere"
research).  My tasks are to (1) revise our erroneous and unclear documentation
and (2) assemble a patch for _gropdf_ that you're willing to accept, assuming
you lack the time or desire to do so yourself.

> AOB
> 
> Many people have praised Branden for his contributions to the
documentation,

I don't think you need two hands to count them.  :P

> as I do, it just felt wrong to see open criticism of a fellow contributers
use of english. I am more than happy for Branden to make our documentation
more "pellucid", but I think it is  nicer to do it without denigrating
previous efforts which were made with the best intentions.

I have tried (I do not claim always successfully) to critique the _code_, not
the person.  As I understand it, this is an aspect of
[https://en.wikipedia.org/wiki/Egoless_programming egoless programming].  You
may have observed Alex Colomar expressing a pretty low opinion of the
`is_family_valid()` function on the _groff_ list recently.  He may not have
known at the time that I had written it.  While I was taken aback at first, I
did not get upset with him, either on the mailing list or privately.  On the
contrary, I largely agreed with his assessment; the code's form arose from an
unfortunate constraint problem we have with our decision (which I guess I'm
the main person driving) to stick to ISO C++98.

The same goes for documentation.  We all put a bit of ourselves into our
written words, but the text is not the author.  Ingo Schwarze is another
_groff_ contributor who pulls no punches when expressing opinions of code. 
But I don't remember seeing him engage in personal attacks.  I expect the same
latitude when evaluating documentation, but I am also prepared to endure
criticism of my own product.  All 3 of us can likely remember Ralph Corderoy's
withering assessments of my emails (particularly their length).  He seemed to
have difficulty believing that I could write concisely.  Whether he ever
actually read any of the documentation I have written for _groff_ (when it
wasn't pitched to the mailing list first--a tiny proportion), I don't know--he
never offered any evidence of having done so.  Ralph's unrelentingly negative
attitude about _groff_ and my work on it (in contrast to other *roffs)
irritated me but I didn't, and don't, let that stop me from considering and
crediting such contributions as he makes.  In other words, I can work with
him.  If Bernd should return, I would expect to be able to work with him, too,
and I hope he'd reciprocate that.

> The latest incarnation of gropdf (in the deri-gropdf-ng git branch, give it
a go :-)) is now 80 lines short of 5000 lines.

So far your work is getting rave reviews.  I'm envious!  :D


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?64360>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

[bug #64360] [PATCH] [gropdf] does not correctly handle white space after 'w' command

Reply via email to