At 2023-04-17T04:29:47-0500, G. Branden Robinson wrote: > 1. The doubled backslashes inside the single-quoted sed expression > are unnecessary. They would be required if the escape character > _hadn't_ been changed, but it was. This solecism came in that way > with the first commit of pdfpic.tmac in August 2015.[1] > 2. The '\n' in the _replacement_ text of a sed expression is an > error.[2] It was part of the same commit.
I have to retract this. They are necessary. But the reason is devilish. I was really wondering about this because even the GNU sed Texinfo manual, (as far as I can find) doesn't say it is possible for s/foo/bar\nbaz/ to work to stick a newline into the replacement text. But it is, and macOS sed supports this, too. And that's a damn good thing, as it turns out. To fully understand the quoting issues involved in using *roff's `sy` request, we have to understand that after the formatter interprets the request's arguments, they become a C/C++ language character string that is processed by a function that reads the argument byte by byte for passage to the old C library system() function. With no state tracking of whether the previous byte seen was '\'. If Bernd understood this, I wish he had explained it to the rest of us. So I refactored this byzantine carnival to clarify the situation. I tested it on (Debian) GNU/Linux, macOS 12, and Solaris 10 and 11, the last three via FSF France's compiler farm. The change is attached, with (more) detailed explanation. We still won't be able to support Solaris sed, in either its /usr/bin or /usr/xpg4/bin forms. Unless we change the way GNU troff processes `sy` arguments, we will forever need a non-standard extension to sed(1) to pull of this sort of tomfoolery. (Or unless POSIX standardizes this application of the '\', 'n' sequence in sed commands. If they do, I hope they apply it to 'a', 'c', and 'i' as well.) But as it happens, changing GNU troff's `sy` handling coincides with ideas I had years ago to enable GNU troff to open files with weird characters in their names... Regards, Branden
commit fcfb185d96aaaf123d98696a1402e1f05bf3da24 Author: G. Branden Robinson <g.branden.robin...@gmail.com> Date: Mon Apr 17 16:41:33 2023 -0500 [pdfpic]: Fix Savannah #64061. * tmac/pdfpic.tmac: Refactor to make comprehensible some woefully undocumented cleverness and improve efficiency. (PDFPIC): Break out flaming-hoop-leaping "clever" bit of `sy` usage into its own macro, calling from here and relocating its requests from here... (pdfpic*system): ...to here. When using `sy` request to collect and munge output of pdfinfo(1), (a) disable the escape character while defining the macro; (b) construct the command in a roff string, appending to it in discrete, hopefully comprehensible chunks; (c) disable the escape character during macro interpretation wherever possible (most of it); (d) retain doubled backslashes so that they survive subsequent string interpolation; (e) stop using grep(1) in the pipeline when sed(1) is perfectly capable of performing its own input filtering; (f) invoke sed with '-n' option and emit output only upon a successful substitution; (g) use multiple sed expressions with '-e' because some sed implementations don't support semicolons after test/branch or label commands; and (h) replace unportable POSIX character class '[:digit:]' in substitution matching text with '[0-9]'. Annotate portability and escaping challenges. Tested on GNU/Linux, macOS 12, and (with simulated pdfinfo(1) output), on Solaris 11. Even with all of that, there is _still_ a problem; the C++ function that GNU troff uses to assemble the command string {character by character} _does not recognize C/C++ string literal escape sequences_. This means that you _cannot_ embed "\n" in `sy`'s arguments and have it survive, as a newline character, into the command string passed to the standard C library's system(3) function. ("A\nB" gets encoded as 'A', '\', 'n', 'B', not 'A', '\n', 'B'.) Unfortunately, this appears to be AT&T troff-compatible behavior. But it means that you _cannot_ construct a portable multi-line replacement text for sed's 's' command. (Other sed commands like 'a', 'c', and 'i' will be similarly affected.) We therefore (continue to) rely upon a non-standard feature of GNU and macOS sed, such that the sequence "\n" in replacement text becomes a newline in sed's pattern space. Fixes <https://savannah.gnu.org/bugs/?64061>. Thanks to Bruno Haible for the report, and to him and Ralph Corderoy for the discussion of portable sed constructs. diff --git a/ChangeLog b/ChangeLog index b755f6dc4..d5f373271 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,48 @@ +2023-04-19 G. Branden Robinson <g.branden.robin...@gmail.com> + + * tmac/pdfpic.tmac: Refactor to make comprehensible some + woefully undocumented cleverness and improve efficiency. + (PDFPIC): Break out flaming-hoop-leaping "clever" bit of `sy` + usage into its own macro, calling from here and relocating its + requests from here... + (pdfpic*system): ...to here. When using `sy` request to collect + and munge output of pdfinfo(1), (a) disable the escape character + while defining the macro; (b) construct the command in a roff + string, appending to it in discrete, hopefully comprehensible + chunks; (c) disable the escape character during macro + interpretation wherever possible (most of it); (d) retain + doubled backslashes so that they survive subsequent string + interpolation; (e) stop using grep(1) in the pipeline when + sed(1) is perfectly capable of performing its own input + filtering; (f) invoke sed with '-n' option and emit output only + upon a successful substitution; (g) use multiple sed expressions + with '-e' because some sed implementations don't support + semicolons after test/branch or label commands; and (h) replace + unportable POSIX character class '[:digit:]' in substitution + matching text with '[0-9]'. Annotate portability and escaping + challenges. Tested on GNU/Linux, macOS 12, and (with simulated + pdfinfo(1) output), on Solaris 11. + + Even with all of that, there is _still_ a problem; the C++ + function that GNU troff uses to assemble the command string + {character by character} _does not recognize C/C++ string + literal escape sequences_. This means that you _cannot_ embed + "\n" in `sy`'s arguments and have it survive, as a newline + character, into the command string passed to the standard C + library's system(3) function. ("A\nB" gets encoded as 'A', '\', + 'n', 'B', not 'A', '\n', 'B'.) Unfortunately, this appears to + be AT&T troff-compatible behavior. But it means that you + _cannot_ construct a portable multi-line replacement text for + sed's 's' command. (Other sed commands like 'a', 'c', and 'i' + will be similarly affected.) We therefore (continue to) rely + upon a non-standard feature of GNU and macOS sed, such that the + sequence "\n" in replacement text becomes a newline in sed's + pattern space. + + Fixes <https://savannah.gnu.org/bugs/?64061>. Thanks to Bruno + Haible for the report, and to him and Ralph Corderoy for the + discussion of portable sed constructs. + 2023-04-05 G. Branden Robinson <g.branden.robin...@gmail.com> * tmac/tty.tmac: Add angle bracket fallbacks. diff --git a/tmac/pdfpic.tmac b/tmac/pdfpic.tmac index f81b66e7f..5b7e1238c 100644 --- a/tmac/pdfpic.tmac +++ b/tmac/pdfpic.tmac @@ -52,6 +52,41 @@ . rr pdfpic*desired-height .. . +.\" Get image dimensions. The `tr` command to strip null bytes is +.\" distasteful, but its necessity is imposed on us. See +.\" <https://gitlab.freedesktop.org/poppler/poppler/-/issues/776>. +.\" +.\" The following is a circus of portability and escaping constraints. +.\" See <https://savannah.gnu.org/bugs/index.php?64061>. Modify it +.\" only with great caution and after testing with a variety of seds. +.\" (See the "HACKING" file in groff's source distribution.) Keep in +.\" mind that the argument(s) to `sy` are assembled into a C string +.\" and then passed to system(3), which invokes "sh -c" on the string. +.\" +.\" We therefore shut off roff's escape mechanism _twice_: once while +.\" defining the macro, and once while interpreting part of it, to +.\" preserve backslashes that we need in the constructed C string. +.\" +.\" We _still_ must escape the backslashes in the string appendments to +.\" keep them from being interpreted as commencing roff escape sequences +.\" when the string they populate is later interpolated. +.eo +.de pdfpic*system +. ds pdfpic*command pdfinfo \$1 +. eo +. as pdfpic*command " | tr -d '\\000' +. as pdfpic*command " | sed -n -e '/Page *size:/ +. as pdfpic*command s/Page *size: *\\([0-9.]*\\) *x *\([0-9.]*\\).*$/ +. as pdfpic*command .nr pdfpic*width (p;\\1)\\n +. as pdfpic*command .nr pdfpic*height (p;\\2)/' +. as pdfpic*command " -e tprint -e b -e :print -e p +. ec +. as pdfpic*command " > \*[pdfpic*temporary-file] +. sy \*[pdfpic*command] +. rm pdfpic*command +.. +.ec +. .de PDFPIC . if !\\n[.U] \{\ . pdfpic@error use of \\$0 requires GNU troff's unsafe mode \ @@ -161,19 +196,7 @@ skipping '\\$1' . return . \} . ds pdfpic*temporary-file \\*[pdfpic*temporary-directory]/pdfpic\n[$$] -. -. \" Get image dimensions. The `tr` command to strip null bytes is -. \" distasteful, but its necessity is imposed on us. See -. \" <https://gitlab.freedesktop.org/poppler/poppler/-/issues/776>. -. ec @ -. sy pdfinfo @$1 | \ -tr -d '\000' | \ -grep "Page *size" | \ -sed -e 's/Page *size: *\\([[:digit:].]*\\) *x *\\([[:digit:].]*\\).*$/\ -.nr pdfpic*width (p;\\1)\\n\ -.nr pdfpic*height (p;\\2)/' \ -> @*[pdfpic*temporary-file] -. ec +. pdfpic*system \\$1 . if \\n[systat] \{\ . pdfpic@error retrieval of '\\$1' image dimensions failed with \ exit status \\n[systat]; skipping
signature.asc
Description: PGP signature