> From: Ludovic Courtès l...@gnu.org I believe that we should aim for R6RS strings.
I think the most important thing is to have humility in the face of an impossible problem: how to encode all textual information. It is important to "stand on the shoulders of giants" here. It becomes a matter of deciding which actively developed library of wide character functions is to be used and how to integrate it. There are 3 good, actively developed solutions of which I am aware. 1. Use GNU libc functionality. Encode wide strings as wchar_t. 2. Use GLib functionality. Encode wide strings as UTF-8. Possibly give up on O(1). Possibly add indexing information to string to allow O(1), which might negate the space advantage of UTF-8. 3. Use IBM's ICU4c. Encode wide strings as UTF-16. Thus, add an obscure dependency. Option 3 is likely a non-starter, because it seems that Guile has tried to avoid adding new non-GNU dependencies. It is technologically a great solution, IMHO. Option 1 is probably the way to go, because it keeps Guile close to the metal and keeps dependencies out of it. Unfortunately, UTF-8 strings would require conversion. > 1. IMO it'd be nice to have ASCII strings special-cased so that they > are always encoded in ASCII. This would allow for memory savings > since, e.g., most symbols are expected to contain only ASCII > characters. It might also simplify interaction with C in certain > cases; for instance, it would make it easy to have statically > initialized ASCII Scheme strings. Why not? It does solve the initialization problem of dealing with strings before setlocale has been called. Let's say that a string is a union of either an ASCII char vector or a wchar_t vector. A "character" then is just a Unicode codepoint. String-ref returns a wchar_t. This is all in line with R6RS as I understand it. There could then be a separate iterator and function set that does (likely O(n)) operations on the grapheme clusters of strings. A grapheme cluster is a single written symbol which may be made up of several codepoints. Unicode Standard Annex #29 describes how to partition a string into a set of graphemes.[1] There is the problem of systems where wchar_t is 2 bytes instead of 4 bytes, like Cygwin. For those systems, I'd recommend restricting functionality to 16-bit characters instead of trying to add an extra UTF-16 encoding/decoding step. I think there should always be a complete codepoint in each wchar_t. -- Mike Gran [1] http://www.unicode.org/reports/tr29/