Date: Mon, 8 Nov 2021 18:14:23 -0500 (EST) From: Mouse <mo...@rodents-montreal.org> Message-ID: <202111082314.saa13...@stone.rodents-montreal.org>
| > <slash> is posix speak for '/' | | But is that "Unicode codepoint 47" or "ASCII codepoint 0x2f" or | "whatever the character set in use provides that is a line between | upper right and lower left" or what? That's XBD 6... 6.1 Portable Character Set Conforming implementations shall support one or more coded character sets. Each supported locale shall include the portable character set, which is the set of symbolic names for characters in Table 6-1. This is used to describe characters within the text of POSIX.1-202x. The first eight entries in Table 6-1 and all characters in Table 6-2 (on page 110) are defined in the ISO/IEC 6429: 1992 standard. The rest of the characters in Table 6-1 are defined in the ISO/IEC 10646-1: 2000 standard. Table 6-1 contains all the usual suspects (ascii, mostly non-control), one of which is <slash>, <solidus> / <U002F> SOLIDUS where the columns are Symbolic Name(s) (<slash> and <solidus> here) Glyph ('/') UCS (<U002F>) and Description (SOLIDUS). (My cut & paste doesn't extract tables very well, the borders and column dividers simply go missing...) The "first 8" are (the required, or "portable" control chars) <NUL> <U0000> NULL (NUL) <alert>, <BEL> <U0007> BELL <backspace>, <BS> <U0008> BACKSPACE <tab>, <HT> <U0009> CHARACTER TABULATION <newline>, <LF> <U000A> LINE FEED (LF) <vertical-tab>, <VT> <U000B> LINE TABULATION <form-feed>, <FF> <U000C> FORM FEED (FF) <carriage-return>, <CR> <U000D> CARRIAGE RETURN (CR) (The "glyph" column exists in those entries, but is empty, and cannot be seen at all in this cut & past). the rest of 6-1 includes most of the rest of the chars you'd expect to find in any good ascii char set (the printables, plus <space> (0x20) (which is #9) and also has an empty glyph .. or a glyph field containing just spaces, in a printed/printable form there's no real difference) - aside from those first 8, no control chars, and not DEL. The last two in the table are (big surprise): <right-brace>, <right-curly-bracket> } <U007D> RIGHT CURLY BRACKET <tilde> ~ <U007E> TILDE (sorry for lack of column alignment, I'm too lazy to attempt to fix it, and it probably wouldn't render correctly anyway). Unless you need to know what POSIX decides to call them (the symbolic names) which are mostly obvious, if you have an ascii chart, everything 0x20..0x7e is identical, so there's no need to show them all here (if you don't think you have an ascii chart, look at /usr/share/misc/ascii). Table 6-2 is the "non-portable control chars", and includes the rest of the chars that are in ascii 0x01-0x06, 0x0E-0x1F, and DEL (0x7F). For obvious reasons there is no "glyph" column in 6-2: <SOH> <U0001> START OF HEADING <STX> <U0002> START OF TEXT [....] <IS1>, <US> <U001F> INFORMATION SEPARATOR ONE <DEL> <U007F> DELETE So: | Does POSIX mandate an ASCII superset, for example? Yes (though perhaps subset would be more accurate). Or a unicode (10646) subset, however you want to think of it. But POSIX is describing traditional unix, and there, ascii is king (and however you want to imagine the filesystem filename strings to be represented, in practice, ascii is required for values <= 127, beyond that anything goes - that's outside the standard, both the practical standard, as in what "everyone" does, and the written one). | > 3.141 Filename | | > A sequence of bytes consisting of 1 to {NAME_MAX} bytes used to | > name a file. The bytes composing the name shall not contain the <NUL> | > or <slash> characters. [...] | | I think for some character sets that may be ill-defined, The portable char set is the only one that matters, and that's ascii (or unicode, for code points 0..127 they're the same). | and it definitely contradicts existing practice (which is that the octet | string shall not contain 0x00 or 0x2f octets, As above, <NUL> is 0x00 (U0000) and <slash> is 0x2f (U002f), so what it is saying is exactly what you expect, except that anything which is not using the portable char set is out of scope, and not specified. | >> For example, what happens if you find that you have both, say, ls.0 | >> and %6Cs.0 in a cat1/ directory somewhere? | > Obviously, whenever one picks a character to have special meaning, | > there needs to be a way to encode that character, | | No, that's not what I mean. Yes, I realised that when I saw someone else's reply, which turned on the light bulb in my brain (it's only a couple of watts, incandescent, it is that old - may in fact even be a tallow candle .. it is behind my eyes, so I cannot actually see it). | The point is, man(1) has to find the underlying file. You want to think of this from the encoding point of view, rather than decoding. There should be a canonical encoding, for any given name, which is what man(1) would be looking for. Any files that happen to exist which would decode to the same thing, but aren't canonically encoded would simply not be found. As long as you apply the same canonical encoding routine all the time (when creating files, and when looking for one) there is no issue at all. If someone decides to ignore that and create other files, they still can, they just don't work properly. Tough. The canonical encoding form can be whatever floats your boat, but I'd suggest most chars represent themselves, except when that doesn't work, and in that case we encode that char .. and of course, as in my mistaken interpretation of your question, always encode the magic char). With this, the code to lookup an arbitrary possible name would be if ((fd = open(Man_Encode(name), O_RDONLY)) >= 0) display_page(name, fd); else error("No man page: ", name); or something like that (and of course, fd & open can be replaced by fp & fopen, and in either case, something needs to close the file...) The assumption in this form is that Man_Encode() cannot fail - it might perhaps abort and exit if it gets ENOMEM or similar errors, but if it returns it always returns a pointer to the encoded name. Dealing with memory management for that is someone else's problem... kre