Noah Lavine <noah.b.lav...@gmail.com> writes: > Mark is right that paths are basically just strings, even though > occasionally they're not. I sort of like the idea of the PEP-383 > encoding (making paths strings that can potentially contain unused > codepoints, which represent non-character bytes), but would that make > path strings break under some Guile string operations?
Yes, this is indeed a problem. Instead of using isolated surrogate code points as recommended by PEP-383, I think we should instead use one of the alternative mappings proposed in section 3.7.4 of Unicode Technical Report #36 <http://www.unicode.org/reports/tr36/>: 1. Use 256 private-use code points, somewhere in the ranges F0000..FFFFD or 100000..10FFFD. This would probably cause the fewest security and interoperability problems. There is, however, some possibility of collision with other uses of private-use characters. 2. Use pairs of noncharacter code points in the range FDD0..FDEF. These are "super" private-use characters, and are discouraged for general interchange. The transformation would take each nibble of a byte Y, and add to FDD0 and FDE0, respectively. However, noncharacter code points may be replaced by U+FFFD ( � ) REPLACEMENT CHARACTER by some implementations, especially when they use them internally. (Again, incoming characters must never be deleted, because that can cause security problems.) > Also, when we convert strings to paths, we need to know what encoding > the local filesystem uses. That will usually be UTF-8, but potentially > might not be, correct? Yes, that is correct. I haven't looked deeply into this, but clearly a lot of software uses the current locale encoding to interpret these POSIX byte strings, and I suspect at least some software uses UTF-8 to interpret filenames. Fortunately, most popular modern distributions of GNU are now using UTF-8 locales by default, which basically makes the problem disappear. Regardless, this method of mapping ill-formed byte sequences to private-use code points can used with _any_ encoding, not just UTF-8. Best, Mark