Filenames and other POSIX byte strings as SCM strings without loss

Mark H Weaver Mon, 23 May 2011 12:43:26 -0700

Hello all,

Andy and I have been discussing how to deal with pathnames on IRC.


The tentative plan is to use normal strings to represent pathnames,
command-line arguments, environmental variable values, and other such
POSIX byte strings.

We'd need to implement alternative conversions between POSIX byte
strings and SCM strings which would implement a bijective (one-to-one)
mapping between the set of all byte vectors and a subset of SCM strings.
For purposes of this email, suppose they are called
scm_to_permissive_stringn and scm_from_permissive_stringn.  On top of
these we would implement scm_to_permissive_locale_stringn,
scm_from_permissive_locale_stringn, and some other convenience
functions.

These alternative mappings would be used to convert between POSIX byte
strings and SCM strings.  We'd reserve 256 private-use code points
(somewhere in the ranges U+F0000..U+FFFFD or U+100000..U+10FFFD) which
would represent bytes of ill-formed byte sequences.  For purposes of
this email, suppose we choose the range U+109700..U+1097FF.

scm_from_permissive_locale_stringn would be used to convert filenames et
al to SCM strings.  Ill-formed byte sequences in the filename would be
mapped to a sequence of Unicode characters in that range.  For example,
when using a UTF-8 locale, the filename 0x46 0x6F 0x6F 0xC0 0x80 0x41
would become a SCM string containing the characters: F, o, o, U+1097C0,
U+109780, A.

A few details: it is important for security reasons that the mapping be
bijective (one-to-one) between all byte vectors and a subset of SCM
strings.  The subset would include all SCM strings that do not include
characters within the reserved range U+109700..U+1097FF.

Since scm_from_permissive_stringn maps invalid bytes to private-use code
points in the range U+109700..U+1097FF, we must ensure that properly
encoded code points in that range are mapped to something else.
Otherwise, two distinct POSIX byte strings might map to the same SCM
string.  The simplest solution is to consider any byte sequence which
would map to our reserved range to be invalid, and thus mapped one byte
at a time using this scheme.  For example, U+1097FF is represented in
UTF-8 as 0xF4 0x89 0x9F 0xBF.  Although scm_from_stringn would map this
sequence of bytes to the single code point U+1097FF (when using UTF-8),
scm_from_permissive_stringn would instead consider this entire byte
sequence to be invalid, and instead map it to the 4 code points
U+1097F4, U+109789, U+10979F, U+1097BF.

We must also make sure that scm_to_permissive_stringn never maps two
distinct SCM strings to the same POSIX byte string.  In particular, we
must make sure that the U+1097xx code points are only used to generate
_invalid_ byte sequences, and never valid ones.  The simplest way to do
this is to apply scm_from_permissive_stringn to the result and make sure
that it yields the original SCM string.  If not, an exception would be
thrown.

So the tentative plan is to provide this alternative mapping, and use it
whenever accessing POSIX byte strings, whether they be filenames,
command-line arguments, environment variable values, fields within a
passwd, group, wtmp, or utmp file, system information (e.g. the hostname
or information from uname), etc.

We should allow the user to access this mapping directly, via

  scm_{to,from}_permissive_stringn,
  scm_{to,from}_permissive_locale_stringn,
  scm_{to,from}_permissive_utf8_stringn,

and also between strings and bytevectors in both Scheme and C:

  permissive-string->utf8,
  permissive-utf8->string,
  scm_permissive_string_to_utf8,
  scm_permissive_utf8_to_string,

and we should probably add procedures to convert between strings and
bytevectors using other encodings as well, most importantly the locale
encoding.

We'd also need permissive-string->pointer and
permissive-pointer->string.

I'm not sure about the names.  Suggestions welcome.

Regarding Noah's proposal to allow handling pathnames as sequences of
path components: both Andy and I like this idea.  However, as always,
the devil's in the details.  I'll write more about this in another
email.

    Best,
     Mark

Filenames and other POSIX byte strings as SCM strings without loss

Reply via email to