On Sun, 2009-09-06 at 12:45 +0200, Andy Wingo wrote: > Hey Mike, > > Would you mind posting to the list a "state of unicode & guile" summary? > I'm very excited about finally being able to say "Guile does unicode", > and was wondering what was left to do :) > > Andy
OK. First, here's the stuff I've already put in NEWS ** Characters Characters can take the whole Unicode range. char-upcase and char-downcase use default Unicode casing rules. Character comparisons such as char<? and char-ci<? are now sorting based on Unicode code points. Combining characters are printed with dotted circles #\◌́ ** Strings String and SRFI-13 functions can operate on Unicode strings. Strings can contain the new string escapes \uHHHH and \UHHHHHH for 4 and 6 hex digit characters. ** SRFI-14 char-sets are modified for Unicode The default char-sets are not longer locale dependent and contain characters from the whole Unicode range. There is a new char-set, char-set:designated, which contains all assigned Unicode characters. There is a new debugging function: %char-set-dump. ** Ports do transcoding Ports now have an associated character encoding, and port read/write operations do conversion to/from locales automatically. Ports also have an associated strategy for how to deal with locale conversion failures. Four functions to support this: set-port-encoding!, port-encoding, set-port-conversion-strategy!, port-conversion-strategy. ** Non-ASCII source code files can be read, but require coding declarations The default reader now handles source code files for some of the non-ASCII character encodings, such as UTF-8. A non-ASCII source file should have an encoding declaration near the top of the file. Also, there is a new function 'file-encoding' that scans a port for a coding declaration. The pre-1.9.3 reader handled 8-bit clean but otherwise unspecified source code. This use is now discouraged. ------------------------------------------------------------- Here's some stuff that is complete, but, not working quite right. ** There are undocumented things: %string-dump, %symbol-dump, setbinary, and a discussion about why ISO-8859-1 is the fastest encoding to process and why it should be used by default. ** Non-ASCII symbols and keywords are supported and variables and procedures can have non-ASCII names. These probably need wide-symbol and wide-keyword support in the VM, instead of the locale-specific implementation that they have now to avoid some corner cases where locales switch. ------------------------------------------------------------- Here's the stuff left to be done, in no particular order * The disassembler doesn't handle wide strings gracefully * Some parts of Goops expect 8-bit strings. This is probably fine for now, but, needs to be documented. I've avoided touching this because I've never used goops for anything, so I'm not sure what does what. * The i18n library hasn't been touched. It should probably move to use functions like u32_casecmp from libunistring for unicode-capable locale-specific sorting. But the #ifdef and locale madness in i18n is deep. I've avoided hacking it. Also we'll have to write our own functions for locale-string->double and locale-string->int. Bruno has some suggestions on how to do that at http://savannah.gnu.org/support/?106998 * I haven't done any testing on readline or gettext * Unicode-capable regex has not been implemented. Libunistring might do this someday. Until then, there will probably have to be the hack where strings are converted to UTF-8 encoding to pass through regex. This doesn't get you Unicode regex, but, it keeps non-ASCII from being mangled by regex. * EMACS has a lot of aliases that can be use in the "-*- coding: XXXXX -*-" line, like latin-1, that aren't valid encoding names. The reader should be modified to understand the common ones. * The whole issue of R6RS compliance will have to be dealt with some day. For example, I went with \xHH \uHHHH and \UHHHHHH escapes because they were backwards compatible with the \xHH we already had. R6RS uses a variable length hex escape terminated by a semicolon: \xHH; \xHHH;. These are not backward compatible. There are some R6RS functions that are missing: string-foldcase, string normalization routines. Also, R6RS and R5RS seem to disagree on the definition of string-upcase et al. R6RS is clear that the result of string-upcase can have more letters that its input, and it gets rid of string-upcase! for the same reason. That's all I remember off the top of my head. Thanks, Mike