Re: Encoding non-alphanumeric characters in manpage filenames

Robert Elz Tue, 09 Nov 2021 01:08:28 -0800

    Date:        Mon, 8 Nov 2021 18:14:23 -0500 (EST)
    From:        Mouse <mo...@rodents-montreal.org>
    Message-ID:  <202111082314.saa13...@stone.rodents-montreal.org>


  | > <slash> is posix speak for '/'
  |
  | But is that "Unicode codepoint 47" or "ASCII codepoint 0x2f" or
  | "whatever the character set in use provides that is a line between
  | upper right and lower left" or what?

That's XBD 6...

6.1 Portable Character Set

        Conforming implementations shall support one or more coded
        character sets. Each supported locale shall include the portable
        character set, which is the set of symbolic names for characters in
        Table 6-1. This is used to describe characters within the text of
        POSIX.1-202x. The first eight entries in Table 6-1 and all
        characters in Table 6-2 (on page 110) are defined in the
        ISO/IEC 6429: 1992 standard. The rest of the characters in Table 6-1
        are defined in the ISO/IEC 10646-1: 2000 standard.

Table 6-1 contains all the usual suspects (ascii, mostly non-control),
one of which is

                <slash>, <solidus>  / <U002F> SOLIDUS

where the columns are Symbolic Name(s) (<slash> and <solidus> here)
Glyph ('/')  UCS (<U002F>) and  Description (SOLIDUS).

(My cut & paste doesn't extract tables very well, the borders and column
dividers simply go missing...)

The "first 8" are (the required, or "portable" control chars)

<NUL>                    <U0000>  NULL (NUL)
<alert>, <BEL>           <U0007>  BELL
<backspace>, <BS>        <U0008>  BACKSPACE
<tab>, <HT>              <U0009>  CHARACTER TABULATION
<newline>, <LF>          <U000A>  LINE FEED (LF)
<vertical-tab>, <VT>     <U000B>  LINE TABULATION
<form-feed>, <FF>        <U000C>  FORM FEED (FF)
<carriage-return>, <CR>  <U000D>  CARRIAGE RETURN (CR)

(The "glyph" column exists in those entries, but is empty, and cannot be
seen at all in this cut & past).

the rest of 6-1 includes most of the rest of the chars you'd expect to
find in any good ascii char set (the printables, plus <space> (0x20)
(which is #9) and also has an empty glyph .. or a glyph field containing
just spaces, in a printed/printable form there's no real difference) - aside
from those first 8, no control chars, and not DEL.

The last two in the table are (big surprise):

<right-brace>, <right-curly-bracket> } <U007D> RIGHT CURLY BRACKET
<tilde>  ~ <U007E> TILDE

(sorry for lack of column alignment, I'm too lazy to attempt to fix
it, and it probably wouldn't render correctly anyway).

Unless you need to know what POSIX decides to call them (the symbolic names)
which are mostly obvious, if you have an ascii chart, everything 0x20..0x7e
is identical, so there's no need to show them all here (if you don't think
you have an ascii chart, look at /usr/share/misc/ascii).

Table 6-2 is the "non-portable control chars", and includes the rest
of the chars that are in ascii 0x01-0x06, 0x0E-0x1F, and DEL (0x7F).

For obvious reasons there is no "glyph" column in 6-2:

<SOH>  <U0001> START OF HEADING
<STX>  <U0002> START OF TEXT
[....]
<IS1>, <US>  <U001F> INFORMATION SEPARATOR ONE
<DEL>        <U007F> DELETE

So:
  | Does POSIX mandate an ASCII superset, for example?

Yes (though perhaps subset would be more accurate).
Or a unicode (10646) subset, however you want to think of it.

But POSIX is describing traditional unix, and there, ascii is king
(and however you want to imagine the filesystem filename strings to
be represented, in practice, ascii is required for values <= 127,
beyond that anything goes - that's outside the standard, both the
practical standard, as in what "everyone" does, and the written one).

  | > 3.141 Filename
  |
  | >         A sequence of bytes consisting of 1 to {NAME_MAX} bytes used to
  | >      name a file. The bytes composing the name shall not contain the <NUL>
  | >      or <slash> characters.  [...]
  |
  | I think for some character sets that may be ill-defined,

The portable char set is the only one that matters, and that's ascii (or
unicode, for code points 0..127 they're the same).

  | and it definitely contradicts existing practice (which is that the octet
  | string shall not contain 0x00 or 0x2f octets,

As above, <NUL> is 0x00 (U0000) and <slash> is 0x2f (U002f), so what it
is saying is exactly what you expect, except that anything which is not
using the portable char set is out of scope, and not specified.

  | >> For example, what happens if you find that you have both, say, ls.0
  | >> and %6Cs.0 in a cat1/ directory somewhere?
  | > Obviously, whenever one picks a character to have special meaning,
  | > there needs to be a way to encode that character,
  |
  | No, that's not what I mean.

Yes, I realised that when I saw someone else's reply, which turned on
the light bulb in my brain (it's only a couple of watts, incandescent,
it is that old - may in fact even be a tallow candle .. it is behind my
eyes, so I cannot actually see it).

  | The point is, man(1) has to find the underlying file.

You want to think of this from the encoding point of view, rather than
decoding.  There should be a canonical encoding, for any given name,
which is what man(1) would be looking for.   Any files that happen to
exist which would decode to the same thing, but aren't canonically encoded
would simply not be found.   As long as you apply the same canonical encoding
routine all the time (when creating files, and when looking for one) there
is no issue at all.   If someone decides to ignore that and create other
files, they still can, they just don't work properly.   Tough.

The canonical encoding form can be whatever floats your boat, but I'd
suggest most chars represent themselves, except when that doesn't work,
and in that case we encode that char .. and of course, as in my mistaken
interpretation of your question, always encode the magic char).

With this, the code to lookup an arbitrary possible name would be

        if ((fd = open(Man_Encode(name), O_RDONLY)) >= 0)
                display_page(name, fd);
        else
                error("No man page: ", name);

or something like that (and of course, fd & open can be replaced by
fp & fopen, and in either case, something needs to close the file...)

The assumption in this form is that Man_Encode() cannot fail - it might
perhaps abort and exit if it gets ENOMEM or similar errors, but if it
returns it always returns a pointer to the encoded name.   Dealing with
memory management for that is someone else's problem...

kre

Re: Encoding non-alphanumeric characters in manpage filenames

Reply via email to