[bug #65108] [troff] support construction of general file name request arguments

G. Branden Robinson Wed, 04 Sep 2024 15:34:22 -0700

Follow-up Comment #14, bug #65108 (group groff):

[comment #13 comment #13:]
> [comment #12 comment #12:]
> > I feel like we're saying the same thing, or compatible things.
> 
> Quite possibly.
> 
> > A file named "résumé1.ms" might be stored on the file system
> > using either character encoding,
> 
> ...or, as my example attempted to illustrate, _two_ files might be stored,
each using a different encoding.


Yes, a better point than I initially gave it credit for.  So, ideally, we want
GNU _troff_ requests to be able to refer unambiguously to either one.

> Similar to the contents of a file, a filename is just a string of bytes. 
What characters those bytes _mean_ is defined by the encoding.

This, I'll quibble with.  An encoding is simply a map between integers and
abstract characters.  Nowadays, in the post-ISO 8859 watershed when encoding
designers got more woke to the difficulties of large character sets and
clashing cultural interpretations of certain symbols, these abstract
characters tend to have names.  In the innocent days of USAS X.34-1968, one
simply printed a chart with numbered boxes and unnamed glyphs, implying that a
rendering device should "make the characters
[http://koplowicz.com/content/kde-vs-gnome-2 look *just like that*!]" 

Importantly, what distinguishes ISO 10646 from Unicode is that the former is
_only_ a character encoding standard--the aforementioned mapping--whereas
Unicode is a character _set_ standard, the normative responsibilities of which
have cast a surprisingly large penumbra regarded from the perspective of more
innocent 7- and 8-bit character days.
 
> A file can contain metadata to indicate its encoding; if not, there's often
enough context for tools like preconv (or even the system's "file" command) to
correctly guess it.

Right.  But a file _name_ *can't*; not on POSIX systems.  There's no "resource
fork" to indicate this.  The file system may impose an encoding (_maybe_), but
as far as I know there's no portable way to query such information.
 
> The settings of one's terminal and LC_CTYPE environment variable affect how
the string of bytes in a filename is interpreted.

Not always.  And there's the rub.


fopen(3):
       #include <stdio.h>

       FILE *fopen(const char *pathname, const char *mode);



$ sed -n '/^static void do_open/,/^}/p' src/roff/troff/input.cpp
static void do_open(bool append)
{
  symbol stream = get_name(true /* required */);
  if (!stream.is_null()) {
    symbol filename = get_long_name(true /* required */);
    if (!filename.is_null()) {
      errno = 0;
      FILE *fp = fopen(filename.contents(), append ? "a" : "w");
      if (0 /* nullptr */ == fp) {
        error("cannot open file '%1' for %2: %3",
              filename.contents(),
              append ? "appending" : "writing",
              strerror(errno));
        fp = (FILE *)stream_dictionary.remove(stream);
      }
      else
        fp = (FILE *)stream_dictionary.lookup(stream, fp);
      if (fp)
        fclose(fp);
    }
  }
  skip_line();
}


> There may not be enough context to guess.  There's no metadata (that I'm
aware of, though I'd be happy to be wrong) to make the name's encoding
definitive.

Precisely.

The way we're getting at file names is a C string with *no implied encoding*.

They're just bytes.  And GNU _troff_ requests are not expressive enough, at
present, to supply _fopen_() with a sequence of "just bytes".  Mostly, that's
a good thing, because it keeps the formatter's own language more sane.  But
we're limited to printable ASCII characters (with fuzz around the edges, like
space 0x20 and delete 0x7F).  Tabs are right out.  Backslashes...should work? 
Theoretically?  If doubled?  Do we need to double them again for C's sake,
given that it's an escape character there too?  CSTR #54 offers no
specification in this area.

We need an escape hatch, as Kernighan famously noted when critiquing Pascal's
lack of them in CSTR #100.

That escape hatch is what I mean to provide, by repurposing GNU _troff_'s
Unicode special character escape sequence syntax.  That choice I knew would
pinch a little when I made it, because it's not actually representing special
characters here...or even, in this application, Unicode, due to the range
limitation--and that pinch is something I'm feeling now while trying to reach
a meeting of the minds with Deri over what we mean we type these things in
non-formatting contents.
 
> > That's why I want to be able to support:
> > 

> > $ grep -F .so résumé.ms
> > .so r\[u00E9]sum\[u00E9]1.ms
> > .so r\[u00E9]sum\[u00E9]2.ms
> > .so r\[u00E9]sum\[u00E9]3.ms


> 
> Agreed, but I think it's ambiguous which of the two files I created in
comment #11 a construction like this refers to.

My answer is straightforward.  I mean to apply a transformation to
`filename.contents()` in the `do_open()` function above (actually via a helper
function, because I'll need it for bug #64071 too) such that sequences
matching `\[u0000]..\[u00FF]` map to C language octal escapes in the range
\000 to \377.  That transformed string is what I would hand to _fopen_().

Some complications arise:

* \000 itself won't work as "desired".  But this is not a practical problem,
as 50+ years of Unix and C have led no one to expect that they can infix nulls
in any file name anywhere.

* The matter of other C0 controls (so, \001 to \037) is a vexing one.  I would
strongly prefer to stay out of the morass altogether.  To see what I mean, and
if you have an hour or so to spare, peruse
[https://www.austingroupbugs.net/view.php?id=251 Austin Group ticket 251]. 
This issue has received deep attention from experts.

Consequently my plan right now is to reject `\[u0000]` through `\[u001F]`,
inclusive--meaning throw an error diagnostic and abort the request.

> They both, from some viewpoint, have the base filename "résumé".

That viewpoint is not the one taken by _fopen_(), which sees only a sequence
of 8-bit bytes, to which it ascribes no particular meaning.  From that stance,
the Latin-1 vs. UTF-8 encodings of "résumé" plainly differ.

> They can both coexist on the same file system, even in the same directory.

Yes!  And that's why it's good that _fopen_() can tell them apart, and so can
we, if we will meet it on its own terms!


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?65108>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #65108] [troff] support construction of general file name request arguments

Reply via email to