> While most ASCII punctuation characters are legal in Unix filenames, I actually would warn against some thinking that could be (not "is") present here.
UNIX filenames are not character strings. They are octet strings, which may be - often are - interpreted as encoding character strings. Two octets, 0x2f and 0x00, have special significance. But it is the octets, not any characters they may or may not represent, that have the significance. (For example, positing a character encoding with shift states where a 0x2f octet may, because of shift state, represent something other than /, trying to put that character in a filename is going to cause trouble even though the _character_ is not an "ASCII punctuation character". UTF-8 would cause similar issues if it didn't promise things about the 0x00-0x7f range that make 0x00 and 0x2f safe.) The difference tends to get blurred, especially in view of code like if (path[x] == '/') rather than if (path[x] == 0x2f) but it is still an important distinction to at least keep in the back of your mind. (Related issues are why SSH, as standardized, is, strictly speaking, unimplementable on many UNIX variants.) I don't know whether anyone has done anything UNIXy based on any character encoding where / is not 0x2f (EBCDIC maybe?) or 0x00 is not the canonical string terminator (I think I've heard of using 0xff for that). If there is such a thing, it would be interesting to examine its choices. Does it use 0x2f, /, or something else as its pathname separator? What as the terminator? How does it handle the i14y issues resulting from its choice (either choice has such issues, just different ones)? What does POSIX say? What about POSIX layers atop filesystems that _don't_ represent pathnames as relatively unstructured octet strings? ISTR that at least one Windows FS represents pathname components as strings of two-octet BMP Unicode codepoints - how is the impedance mismatch handled? As for the problem at immediate hand, it strikes me as somewhat difficult to define if you can encode any octet. For example, what happens if you find that you have both, say, ls.0 and %6Cs.0 in a cat1/ directory somewhere? Or both foo::bar.0 and foo%3A%3Abar.0? (And, strictly speaking, even those two lines blur the distinction between octets and characters in pathnames. It's things like that that make it hard to maintain the mental distinction. Those encodings assume, of course, the use of ASCII, or at least an ASCII superset.) I've found myself caring about this, too, because I find myself using both 8859-1 and 8859-14. I'm not sure what the right resolution is. (To forestall one likely suggestion: I am, however, sure that - at least for my purposes - it is not UTF-8. Variable-sized characters is a disaster I do not want to go anywhere near.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTML mo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B