Hello all, Andy and I have been discussing how to deal with pathnames on IRC.
The tentative plan is to use normal strings to represent pathnames, command-line arguments, environmental variable values, and other such POSIX byte strings. We'd need to implement alternative conversions between POSIX byte strings and SCM strings which would implement a bijective (one-to-one) mapping between the set of all byte vectors and a subset of SCM strings. For purposes of this email, suppose they are called scm_to_permissive_stringn and scm_from_permissive_stringn. On top of these we would implement scm_to_permissive_locale_stringn, scm_from_permissive_locale_stringn, and some other convenience functions. These alternative mappings would be used to convert between POSIX byte strings and SCM strings. We'd reserve 256 private-use code points (somewhere in the ranges U+F0000..U+FFFFD or U+100000..U+10FFFD) which would represent bytes of ill-formed byte sequences. For purposes of this email, suppose we choose the range U+109700..U+1097FF. scm_from_permissive_locale_stringn would be used to convert filenames et al to SCM strings. Ill-formed byte sequences in the filename would be mapped to a sequence of Unicode characters in that range. For example, when using a UTF-8 locale, the filename 0x46 0x6F 0x6F 0xC0 0x80 0x41 would become a SCM string containing the characters: F, o, o, U+1097C0, U+109780, A. A few details: it is important for security reasons that the mapping be bijective (one-to-one) between all byte vectors and a subset of SCM strings. The subset would include all SCM strings that do not include characters within the reserved range U+109700..U+1097FF. Since scm_from_permissive_stringn maps invalid bytes to private-use code points in the range U+109700..U+1097FF, we must ensure that properly encoded code points in that range are mapped to something else. Otherwise, two distinct POSIX byte strings might map to the same SCM string. The simplest solution is to consider any byte sequence which would map to our reserved range to be invalid, and thus mapped one byte at a time using this scheme. For example, U+1097FF is represented in UTF-8 as 0xF4 0x89 0x9F 0xBF. Although scm_from_stringn would map this sequence of bytes to the single code point U+1097FF (when using UTF-8), scm_from_permissive_stringn would instead consider this entire byte sequence to be invalid, and instead map it to the 4 code points U+1097F4, U+109789, U+10979F, U+1097BF. We must also make sure that scm_to_permissive_stringn never maps two distinct SCM strings to the same POSIX byte string. In particular, we must make sure that the U+1097xx code points are only used to generate _invalid_ byte sequences, and never valid ones. The simplest way to do this is to apply scm_from_permissive_stringn to the result and make sure that it yields the original SCM string. If not, an exception would be thrown. So the tentative plan is to provide this alternative mapping, and use it whenever accessing POSIX byte strings, whether they be filenames, command-line arguments, environment variable values, fields within a passwd, group, wtmp, or utmp file, system information (e.g. the hostname or information from uname), etc. We should allow the user to access this mapping directly, via scm_{to,from}_permissive_stringn, scm_{to,from}_permissive_locale_stringn, scm_{to,from}_permissive_utf8_stringn, and also between strings and bytevectors in both Scheme and C: permissive-string->utf8, permissive-utf8->string, scm_permissive_string_to_utf8, scm_permissive_utf8_to_string, and we should probably add procedures to convert between strings and bytevectors using other encodings as well, most importantly the locale encoding. We'd also need permissive-string->pointer and permissive-pointer->string. I'm not sure about the names. Suggestions welcome. Regarding Noah's proposal to allow handling pathnames as sequences of path components: both Andy and I like this idea. However, as always, the devil's in the details. I'll write more about this in another email. Best, Mark