Hi, Mike Gran <spk...@yahoo.com> writes:
> I've been playing with this wide char stuff, and I have a patch that > would move the encoding of characters to UCS-4. Thanks for the good news! > This is completely useless on its own, because, in this > patch, the internal encoding of strings is still 8-bit chars, and, > thus, there is no way to use the wide characters in strings. Yes. I think the best thing will be to let you experiment in a dedicated branch, so we can progressively see things take shape. > It is all pretty simple. Since the internal representation of chars > becomes UCS-4, I used scm_t_uint32 as the char type, and I removed the > code that supported EBCDIC-encoded characters. I changed the tables > of character names to deal with more names and discontiguous control > characters. And, as a temporary kludge, I made a macro > SCM_MAKE_8BIT_CHAR to cast the 8-bit characters used in strings to a > scm_t_uint32. Also, I used functions from the Gnulib unicase and > unictype modules for character properties, including a couple that > Bruno Haible of Gnulib was kind enough to create for me. That sounds good. I only have minor comments at this point, see below. IMO it'd be good to augment `chars.test' or `reader.test' to test some of the new characters. > The gnulib invocation for this was This should appear as a diff of `m4/gnulib-cache.m4' and as the addition of relevant Gnulib files since we now store them in the repository. The best is probably to make this a separate commit. > +#include "lib/unicase.h" #include <unicase.h> should be enough. > SCM_DEFINE1 (scm_char_leq_p, "char<=?", scm_tc7_rpsubr, > (SCM x, SCM y), > "Return @code{#t} iff @var{x} is less than or equal to @var{y} in > the\n" > - "ASCII sequence, else @code{#f}.") > + "Imocpde sequence, else @code{#f}.") Typo? :-) > SCM_DEFINE1 (scm_char_ci_eq_p, "char-ci=?", scm_tc7_rpsubr, > (SCM x, SCM y), > "Return @code{#t} iff @var{x} is the same character as @var{y} > ignoring\n" > - "case, else @code{#f}.") > + "case, else @code{#f}. Case is computed in the Unicode locale.") The phrase "Unicode locale" looks confusing to me. This function is locale-independent, right? > - return scm_from_bool > (scm_c_upcase(SCM_CHAR(x))==scm_c_upcase(SCM_CHAR(y))); > + return scm_from_bool (uc_toupper(SCM_CHAR(x))==uc_toupper(SCM_CHAR(y))); Please leave a space before opening parentheses. > -int > -scm_c_upcase (unsigned int c) > +scm_t_uint32 > +scm_c_upcase (scm_t_uint32 c) This is an API change, but probably acceptable (and unavoidable). > +char *const scm_charnames[] = Could even be "const char *const scm_charnames[]". > + { > + /* C0 controls */ > + "nul", "soh", "stx", "etx", "eot", "enq", "ack", "bel", > + "bs", "ht", "newline", "vt", "np", "cr", "so", "si", > + "dle", "dc1", "dc2", "dc3", "dc4", "nak", "syn", "etb", > + "can", "em", "sub", "esc", "fs", "gs", "rs", "us", > + "del", > + /* C1 controls */ > + "bph", "nbh", "ind", "nel", "ssa", "esa", > + "hts", "htj", "vts", "pld", "plu", "ri" , "ss2", "ss3", > + "dcs", "pu1", "pu2", "sts", "cch", "mw" , "spa", "epa", > + "sos", "sci", "csi", "st", "osc", "pm", "apc" > + }; Are the new names standard? > int scm_n_charnames = sizeof (scm_charnames) / sizeof (char *); Could be `const'. > +SCM_API const scm_t_uint32 scm_charnums[]; > +SCM_API char *const scm_alt_charnames[]; > +SCM_API int scm_n_alt_charnames; > +SCM_API const scm_t_uint32 scm_alt_charnums[]; This should all be marked `SCM_INTERNAL'. Besides, instead of exposing these arrays, could we instead have two functions in `chars.c', say: scm_t_uint32 scm_i_lookup_character (const char *name); const char *scm_i_character_name (scm_t_uint32 chr); > + return (unsigned long)(scm_c_downcase(SCM_CHAR(obj))) % n; The cast shouldn't be needed. > - /* Dirk:FIXME:: This type of character syntax is not R5RS > - * compliant. Further, it should be verified that the constant > - * does only consist of octal digits. Finally, it should be > - * checked whether the resulting fixnum is in the range of > - * characters. */ > + /* FIXME:: This type of character syntax is not R5RS > + * compliant. */ I think this comment remains valid, doesn't it? > @@ -59,7 +59,7 @@ sf_flush (SCM port) > { > /* write the byte. */ > scm_call_1 (SCM_SIMPLE_VECTOR_REF (stream, 0), > - SCM_MAKE_CHAR (*pt->write_buf)); > + SCM_MAKE_8BIT_CHAR (*pt->write_buf)); It's actually not a byte. Thanks! Ludo'.