Re: groff 1.23.0.rc4 on Solaris 11 OpenIndiana

2023-04-19 Thread G. Branden Robinson
At 2023-04-17T04:29:47-0500, G. Branden Robinson wrote:
> 1.  The doubled backslashes inside the single-quoted sed expression
> are unnecessary.  They would be required if the escape character
> _hadn't_ been changed, but it was.  This solecism came in that way
> with the first commit of pdfpic.tmac in August 2015.[1]
> 2.  The '\n' in the _replacement_ text of a sed expression is an
> error.[2]  It was part of the same commit.

I have to retract this.  They are necessary.

But the reason is devilish.

I was really wondering about this because even the GNU sed Texinfo
manual, (as far as I can find) doesn't say it is possible for

s/foo/bar\nbaz/

to work to stick a newline into the replacement text.  But it is, and
macOS sed supports this, too.

And that's a damn good thing, as it turns out.

To fully understand the quoting issues involved in using *roff's `sy`
request, we have to understand that after the formatter interprets the
request's arguments, they become a C/C++ language character string that
is processed by a function that reads the argument byte by byte for
passage to the old C library system() function.  With no state tracking
of whether the previous byte seen was '\'.

If Bernd understood this, I wish he had explained it to the rest of us.

So I refactored this byzantine carnival to clarify the situation.  I
tested it on (Debian) GNU/Linux, macOS 12, and Solaris 10 and 11, the
last three via FSF France's compiler farm.

The change is attached, with (more) detailed explanation.

We still won't be able to support Solaris sed, in either its /usr/bin or
/usr/xpg4/bin forms.  Unless we change the way GNU troff processes `sy`
arguments, we will forever need a non-standard extension to sed(1) to
pull of this sort of tomfoolery.  (Or unless POSIX standardizes this
application of the '\', 'n' sequence in sed commands.  If they do, I
hope they apply it to 'a', 'c', and 'i' as well.)

But as it happens, changing GNU troff's `sy` handling coincides with
ideas I had years ago to enable GNU troff to open files with weird
characters in their names...

Regards,
Branden
commit fcfb185d96aaaf123d98696a1402e1f05bf3da24
Author: G. Branden Robinson 
Date:   Mon Apr 17 16:41:33 2023 -0500

[pdfpic]: Fix Savannah #64061.

* tmac/pdfpic.tmac: Refactor to make comprehensible some woefully
  undocumented cleverness and improve efficiency.

  (PDFPIC): Break out flaming-hoop-leaping "clever" bit of `sy` usage
  into its own macro, calling from here and relocating its requests from
  here...

  (pdfpic*system): ...to here.  When using `sy` request to collect and
  munge output of pdfinfo(1), (a) disable the escape character while
  defining the macro; (b) construct the command in a roff string,
  appending to it in discrete, hopefully comprehensible chunks; (c)
  disable the escape character during macro interpretation wherever
  possible (most of it); (d) retain doubled backslashes so that they
  survive subsequent string interpolation; (e) stop using grep(1) in the
  pipeline when sed(1) is perfectly capable of performing its own input
  filtering; (f) invoke sed with '-n' option and emit output only upon a
  successful substitution; (g) use multiple sed expressions with '-e'
  because some sed implementations don't support semicolons after
  test/branch or label commands; and (h) replace unportable POSIX
  character class '[:digit:]' in substitution matching text with
  '[0-9]'.  Annotate portability and escaping challenges.  Tested on
  GNU/Linux, macOS 12, and (with simulated pdfinfo(1) output), on
  Solaris 11.

Even with all of that, there is _still_ a problem; the C++ function that
GNU troff uses to assemble the command string {character by character}
_does not recognize C/C++ string literal escape sequences_.  This means
that you _cannot_ embed "\n" in `sy`'s arguments and have it survive, as
a newline character, into the command string passed to the standard C
library's system(3) function.  ("A\nB" gets encoded as 'A', '\', 'n',
'B', not 'A', '\n', 'B'.)  Unfortunately, this appears to be AT&T
troff-compatible behavior.  But it means that you _cannot_ construct a
portable multi-line replacement text for sed's 's' command.  (Other sed
commands like 'a', 'c', and 'i' will be similarly affected.)  We
therefore (continue to) rely upon a non-standard feature of GNU and
macOS sed, such that the sequence "\n" in replacement text becomes a
newline in sed's pattern space.

Fixes .  Thanks to Bruno Haible
for the report, and to him and Ralph Corderoy for the discussion of
portable sed constructs.

diff --git a/ChangeLog b/ChangeLog
index b755f6dc4..d5f373271 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,48 @@
+2023-04-19  G. Branden Robinson 
+
+	* tmac/pdfpic.tmac: Refactor to make comprehensib

Re: groff 1.23.0.rc4 on Solaris 11 OpenIndiana

2023-04-19 Thread G. Branden Robinson
At 2023-04-17T13:15:50+0100, Ralph Corderoy wrote:
> Why the dance with ‘tprint’?  sed -n s/foo/bar/p

I overlooked that 'p' is a standard replacement flag going back (at
least) to SuSv1/POSIX Issue 4.  Thanks, I'll change that.

> The \ in \.nr isn't needed.  It isn't in the other one.

Right; I already got rid of it.

> To match one or more p's in a BRE, the idiom is pp* rather that p*p.
> Though I'm not sure it's necessary here for the spaces.
> The substitution's address is different from its pattern in that
> ‘Pagesize:’ matches the former but not the latter.

I'll fix that/those too.  I'd prefer to be a bit flexible with the
pattern matching; I don't feel the poppler developers carefully designed
their report format.  For example:

CreationDate:   Sat Mar 25 18:43:18 2023 CDT
ModDate:Sat Mar 25 18:43:18 2023 CDT
UserProperties: no
Page size:  612 x 792 pts (letter)
Page rot:   0
File size:  106745 bytes

Why StudlyCaps for some field names and multi-word phrases for others?
Why abbreviate "rotation" when it will fit, thanks to the lengthy
"UserProperties"?  My money is on people heaping more stuff in without
giving thought to what was already present.

The bottom line is that I don't trust poppler not to change the quantity
of spaces we encounter.  If there were a standard sed 's' command flag
for case-insensitive matching, I'd add that in, too.  But I'll continue
to expect at least one after 'Page' and after 'size'.

> One could
> 
> sed -ne '/^Page *size: *\([0-9.][0-9.]*\) *x *\([0-9.][0-9.]*\).*$/s//.nr 
> pdfpic*width (p;\1)\
> .nr pdpic*height (p;\2)/p'

I think the ultra-long line is difficult to read, so I broke it up into
chunks by assembling a roff string.  That in turn resurrected the
necessity for doubling the backslashes.

> Its idiomatic to have the pipe at the end of the line.

I'm not familiar with this idiom, and don't agree with it.  It may be
obviated now, given the roff string construction approach.

> By design, this also avoids the backslash clutter in the shell.

Not relevant to the contents of a roff macro file.

Thanks for the tips.  It turned out sed wasn't really the problem here,
but an anemic means of constructing C/C++ strings for passage to
system(3).

Regards,
Branden


signature.asc
Description: PGP signature