Date: Mon, 8 Nov 2021 13:47:09 -0500 (EST) From: Mouse <mo...@rodents-montreal.org> Message-ID: <202111081847.naa28...@stone.rodents-montreal.org>
| What does POSIX say? >From XBD (basic definitions) 3.243 Pathname A string that is used to identify a file. In the context of POSIX.1-202x, a pathname may be limited to {PATH_MAX} bytes, including the terminating null byte. It has optional beginning <slash> characters, followed by zero or more filenames separated by <slash> characters. A pathname can optionally contain one or more trailing <slash> characters. Multiple successive <slash> characters are considered to be the same as one <slash>, except it is implementation-defined whether the case of exactly two leading <slash> characters is treated specially. <slash> is posix speak for '/' And: 3.141 Filename A sequence of bytes consisting of 1 to {NAME_MAX} bytes used to name a file. The bytes composing the name shall not contain the <NUL> or <slash> characters. In the context of a pathname, each filename shall be followed by a <slash> or a <NUL> character; elsewhere, a filename followed by a <NUL> character forms a string (but not necessarily a character string). The filenames dot and dot-dot have special meaning. A filename is sometimes referred to as a ``pathname component''. See also Section 3.243 (on page 63). | What about POSIX layers atop filesystems that | _don't_ represent pathnames as relatively unstructured octet strings? Unspecified. As soon as you step outside a POSIX defined filesystem you're in uncharted territory, and POSIX does not apply. That includes relatively minor non-conforming filesystems, like NFS (which has no concept of open files, and hence cannot retain a file in the filesystem after it has been unlinked if it remains open - and requires tricks to simulate that) as well as filesystems like FAT and NTFS, which are "kind of" similar, in general operation, but don't support much of what is required (esp FAT), and anything wilder than that would be right off the chart. | As for the problem at immediate hand, it strikes me as somewhat | difficult to define if you can encode any octet. For example, what | happens if you find that you have both, say, ls.0 and %6Cs.0 in a cat1/ | directory somewhere? Obviously, whenever one picks a character to have special meaning, there needs to be a way to encode that character, even though it looks like it could just be stored literally, so if there was an encoding scheme like that, a filename like "%6Cs.0" would be encoded as %256Cs.0 (or something). There's nothing odd about this, we do it all the time (in a C string, '\' needs to be written "\\" as \ is used as part of the encoding of \n \t ...). I have a (private use) encoding scheme for filenames like this, though I use it to represent book and movie (etc) titles as filenames, mostly for conversion to HTML to greate web indexes .. I use ',' as the (main) magic char (there are a few others, _ represents space for example, and ,u an underscore, ,z (for reasons to bizarre to go into) an actual ',' - except that ,_ represents a comma followed by a space, which is the normal way that a comma is found in one of these titles) - this thing grew over time, and is kind of, no, actually more than that, very, ugly). It would not be suitable for the proposed purpose, as while it can encode any unicode char, it does so in a way derived from html, and uses html char names where they exist, if there isn't a shorter encoding (like ,=agrave= and stuff like that. kre