Hi Mike! Glad to see Unicode has landed! :-)
Here's a quick review of the big patch. "Michael Gran" <spk...@yahoo.com> writes: > Add Unicode strings and symbols > > This adds full Unicode strings as a datatype, and it adds some > minimal functionality. The terminal and port encoding is assumed > to be ISO-8859-1. Non-ISO-8859-1 characters are written or > input as string character escapes. > > The string character escapes now have 3 forms: \xXX \uXXXX and > \UXXXXXX, for unprintable characters that have 2, 4 or 6 hex digits. > > The process for writing to strings has been modified. There is now a > function scm_i_string_start_writing that does the copy-on-write > conversion if necessary. > > To compile strings that may be wide, the VM storage of strings and > string-likes has changed. > > Most string-using functions have not yet been updated and may break > when used with wide strings. Most of these comments really belong in the source code, close to the things they refer to. > libguile/ports.c | 90 +++- > libguile/ports.h | 3 + > libguile/print.c | 157 +++++-- > libguile/print.h | 1 + > libguile/read.c | 233 ++++++---- > libguile/rw.c | 2 + > libguile/socket.c | 3 + > libguile/srfi-13.c | 23 +- > libguile/strings.c | 649 > +++++++++++++++++++++---- > libguile/strings.h | 59 ++- > libguile/vm-engine.h | 1 + > libguile/vm-i-loader.c | 87 +++- > module/language/assembly.scm | 12 +- > module/language/assembly/compile-bytecode.scm | 26 +- > test-suite/tests/asm-to-bytecode.test | 6 +- > 15 files changed, 1046 insertions(+), 306 deletions(-) I would feel more confident if the number of lines of tests added was proportional to the number of new C lines of code. Do you think some more tests could be added? Or maybe this will come at a later stage? > + else if (c == '\b') > + { > + SCM_DECCOL (port); > + } Style! ;-) > +SCM_API void scm_charprint (scm_t_uint32 c, SCM port); This ought to be internal, no? > #define STRINGBUF_F_SHARED 0x100 > #define STRINGBUF_F_INLINE 0x200 > +#define STRINGBUF_F_WIDE 0x400 Although other flags miss this, can you add a comment to the right saying that this means UCS-4 encoding? > +static SCM > +make_wide_stringbuf (size_t len) > +{ > + scm_t_wchar *mem; > +#if SCM_DEBUG > + if (len < 1000) > + lenhist[len]++; > + else > + lenhist[1000]++; > +#endif I would use "#ifdef SCM_STRING_LENGTH_HISTOGRAM" for that. Conversely, I'd make `%string-dump' and `%symbol-dump' always available (with docstring and possibly manual entry). > + (scm_t_wchar) (unsigned char) STRINGBUF_INLINE_CHARS (buf)[i]; Is the double cast needed? > + if (scm_i_is_narrow_string (name)) "Narrow strings" are Latin-1, right? > +SCM_DEFINE (scm_sys_string_dump, "%string-dump", 1, 0, 0, (SCM str), "") > #define FUNC_NAME s_scm_sys_string_dump > { > SCM_VALIDATE_STRING (1, str); > fprintf (stderr, "%p:\n", str); > fprintf (stderr, " start: %u\n", STRING_START (str)); > fprintf (stderr, " len: %u\n", STRING_LENGTH (str)); > + if (scm_i_is_narrow_string (str)) > + fprintf (stderr, " format: narrow\n"); > + else > + fprintf (stderr, " format: wide\n"); How about returning an alist with all this information instead of using printf? It would allow higher-level debugging functions to be implemented and may also be useful in unit tests. > +SCM_DEFINE (scm_sys_symbol_dump, "%symbol-dump", 1, 0, 0, (SCM sym), "") Same here. > +SCM_DEFINE (scm_string_width, "string-width", 1, 0, 0, > + (SCM string), > + "Return the bytes used to represent a character in @var{string}." > + "This will return 1 or 4.") I was wondering whether we should expose as much of the internal representation, but I think it's a good debugging aid and it can't hurt. I find the name slightly misleading, but I can't think of a better one. > - "Return character @var{k} of @var{str} using zero-origin\n" > - "indexing. @var{k} must be a valid index of @var{str}.") > + "Return character @var{k} of @var{str} using zero-origin\n" > + "indexing. @var{k} must be a valid index of @var{str}.") Please avoid unneeded formatting changes, to keep the diff smaller. > +SCM_INTERNAL char *scm_to_stringn (SCM str, size_t *lenp, > + const char *encoding, > + enum iconv_ilseq_handler handler); I suppose this would eventually become public. What do you think? Should we use a different type for HANDLER before that happens? > +SCM_API const scm_t_wchar *scm_i_string_wide_chars (SCM str); Should be SCM_INTERNAL. Thank you! Ludo'.