Follow-up Comment #14, bug #65108 (group groff): [comment #13 comment #13:] > [comment #12 comment #12:] > > I feel like we're saying the same thing, or compatible things. > > Quite possibly. > > > A file named "résumé1.ms" might be stored on the file system > > using either character encoding, > > ...or, as my example attempted to illustrate, _two_ files might be stored, each using a different encoding.
Yes, a better point than I initially gave it credit for. So, ideally, we want GNU _troff_ requests to be able to refer unambiguously to either one. > Similar to the contents of a file, a filename is just a string of bytes. What characters those bytes _mean_ is defined by the encoding. This, I'll quibble with. An encoding is simply a map between integers and abstract characters. Nowadays, in the post-ISO 8859 watershed when encoding designers got more woke to the difficulties of large character sets and clashing cultural interpretations of certain symbols, these abstract characters tend to have names. In the innocent days of USAS X.34-1968, one simply printed a chart with numbered boxes and unnamed glyphs, implying that a rendering device should "make the characters [http://koplowicz.com/content/kde-vs-gnome-2 look *just like that*!]" Importantly, what distinguishes ISO 10646 from Unicode is that the former is _only_ a character encoding standard--the aforementioned mapping--whereas Unicode is a character _set_ standard, the normative responsibilities of which have cast a surprisingly large penumbra regarded from the perspective of more innocent 7- and 8-bit character days. > A file can contain metadata to indicate its encoding; if not, there's often enough context for tools like preconv (or even the system's "file" command) to correctly guess it. Right. But a file _name_ *can't*; not on POSIX systems. There's no "resource fork" to indicate this. The file system may impose an encoding (_maybe_), but as far as I know there's no portable way to query such information. > The settings of one's terminal and LC_CTYPE environment variable affect how the string of bytes in a filename is interpreted. Not always. And there's the rub. fopen(3): #include <stdio.h> FILE *fopen(const char *pathname, const char *mode); $ sed -n '/^static void do_open/,/^}/p' src/roff/troff/input.cpp static void do_open(bool append) { symbol stream = get_name(true /* required */); if (!stream.is_null()) { symbol filename = get_long_name(true /* required */); if (!filename.is_null()) { errno = 0; FILE *fp = fopen(filename.contents(), append ? "a" : "w"); if (0 /* nullptr */ == fp) { error("cannot open file '%1' for %2: %3", filename.contents(), append ? "appending" : "writing", strerror(errno)); fp = (FILE *)stream_dictionary.remove(stream); } else fp = (FILE *)stream_dictionary.lookup(stream, fp); if (fp) fclose(fp); } } skip_line(); } > There may not be enough context to guess. There's no metadata (that I'm aware of, though I'd be happy to be wrong) to make the name's encoding definitive. Precisely. The way we're getting at file names is a C string with *no implied encoding*. They're just bytes. And GNU _troff_ requests are not expressive enough, at present, to supply _fopen_() with a sequence of "just bytes". Mostly, that's a good thing, because it keeps the formatter's own language more sane. But we're limited to printable ASCII characters (with fuzz around the edges, like space 0x20 and delete 0x7F). Tabs are right out. Backslashes...should work? Theoretically? If doubled? Do we need to double them again for C's sake, given that it's an escape character there too? CSTR #54 offers no specification in this area. We need an escape hatch, as Kernighan famously noted when critiquing Pascal's lack of them in CSTR #100. That escape hatch is what I mean to provide, by repurposing GNU _troff_'s Unicode special character escape sequence syntax. That choice I knew would pinch a little when I made it, because it's not actually representing special characters here...or even, in this application, Unicode, due to the range limitation--and that pinch is something I'm feeling now while trying to reach a meeting of the minds with Deri over what we mean we type these things in non-formatting contents. > > That's why I want to be able to support: > > > > $ grep -F .so résumé.ms > > .so r\[u00E9]sum\[u00E9]1.ms > > .so r\[u00E9]sum\[u00E9]2.ms > > .so r\[u00E9]sum\[u00E9]3.ms > > Agreed, but I think it's ambiguous which of the two files I created in comment #11 a construction like this refers to. My answer is straightforward. I mean to apply a transformation to `filename.contents()` in the `do_open()` function above (actually via a helper function, because I'll need it for bug #64071 too) such that sequences matching `\[u0000]..\[u00FF]` map to C language octal escapes in the range \000 to \377. That transformed string is what I would hand to _fopen_(). Some complications arise: * \000 itself won't work as "desired". But this is not a practical problem, as 50+ years of Unix and C have led no one to expect that they can infix nulls in any file name anywhere. * The matter of other C0 controls (so, \001 to \037) is a vexing one. I would strongly prefer to stay out of the morass altogether. To see what I mean, and if you have an hour or so to spare, peruse [https://www.austingroupbugs.net/view.php?id=251 Austin Group ticket 251]. This issue has received deep attention from experts. Consequently my plan right now is to reject `\[u0000]` through `\[u001F]`, inclusive--meaning throw an error diagnostic and abort the request. > They both, from some viewpoint, have the base filename "résumé". That viewpoint is not the one taken by _fopen_(), which sees only a sequence of 8-bit bytes, to which it ascribes no particular meaning. From that stance, the Latin-1 vs. UTF-8 encodings of "résumé" plainly differ. > They can both coexist on the same file system, even in the same directory. Yes! And that's why it's good that _fopen_() can tell them apart, and so can we, if we will meet it on its own terms! _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?65108> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
signature.asc
Description: PGP signature