Re: [PATCH] Wide characters

Ludovic Courtès Mon, 23 Feb 2009 14:46:33 -0800

Hi,

Mike Gran <spk...@yahoo.com> writes:


> I've been playing with this wide char stuff, and I have a patch that
> would move the encoding of characters to UCS-4.

Thanks for the good news!

> This is completely useless on its own, because, in this
> patch, the internal encoding of strings is still 8-bit chars, and,
> thus, there is no way to use the wide characters in strings.

Yes.  I think the best thing will be to let you experiment in a
dedicated branch, so we can progressively see things take shape.

> It is all pretty simple.  Since the internal representation of chars
> becomes UCS-4, I used scm_t_uint32 as the char type, and I removed the
> code that supported EBCDIC-encoded characters.  I changed the tables
> of character names to deal with more names and discontiguous control
> characters.  And, as a temporary kludge, I made a macro
> SCM_MAKE_8BIT_CHAR to cast the 8-bit characters used in strings to a
> scm_t_uint32.  Also, I used functions from the Gnulib unicase and
> unictype modules for character properties, including a couple that
> Bruno Haible of Gnulib was kind enough to create for me.

That sounds good.

I only have minor comments at this point, see below.

IMO it'd be good to augment `chars.test' or `reader.test' to test some
of the new characters.

> The gnulib invocation for this was

This should appear as a diff of `m4/gnulib-cache.m4' and as the addition
of relevant Gnulib files since we now store them in the repository.  The
best is probably to make this a separate commit.

> +#include "lib/unicase.h"

#include <unicase.h> should be enough.

>  SCM_DEFINE1 (scm_char_leq_p, "char<=?", scm_tc7_rpsubr,
>               (SCM x, SCM y),
>            "Return @code{#t} iff @var{x} is less than or equal to @var{y} in 
> the\n"
> -          "ASCII sequence, else @code{#f}.")
> +          "Imocpde sequence, else @code{#f}.")

Typo?  :-)

>  SCM_DEFINE1 (scm_char_ci_eq_p, "char-ci=?", scm_tc7_rpsubr,
>               (SCM x, SCM y),
>            "Return @code{#t} iff @var{x} is the same character as @var{y} 
> ignoring\n"
> -          "case, else @code{#f}.")
> +          "case, else @code{#f}.  Case is computed in the Unicode locale.")

The phrase "Unicode locale" looks confusing to me.  This function is
locale-independent, right?

> -  return scm_from_bool 
> (scm_c_upcase(SCM_CHAR(x))==scm_c_upcase(SCM_CHAR(y)));
> +  return scm_from_bool (uc_toupper(SCM_CHAR(x))==uc_toupper(SCM_CHAR(y)));

Please leave a space before opening parentheses.

> -int
> -scm_c_upcase (unsigned int c)
> +scm_t_uint32
> +scm_c_upcase (scm_t_uint32 c)

This is an API change, but probably acceptable (and unavoidable).

> +char *const scm_charnames[] = 

Could even be "const char *const scm_charnames[]".

> +  {
> +    /* C0 controls */
> +    "nul", "soh", "stx", "etx", "eot", "enq", "ack", "bel",
> +    "bs",  "ht",  "newline",  "vt",  "np",  "cr",  "so",  "si",
> +    "dle", "dc1", "dc2", "dc3", "dc4", "nak", "syn", "etb", 
> +    "can", "em",  "sub", "esc", "fs",  "gs",  "rs",  "us",
> +    "del",
> +    /* C1 controls */
> +    "bph", "nbh", "ind", "nel", "ssa", "esa", 
> +    "hts", "htj", "vts", "pld", "plu", "ri" , "ss2", "ss3",
> +    "dcs", "pu1", "pu2", "sts", "cch", "mw" , "spa", "epa",
> +    "sos", "sci", "csi", "st",  "osc", "pm",  "apc"
> +  };

Are the new names standard?

>  int scm_n_charnames = sizeof (scm_charnames) / sizeof (char *);

Could be `const'.

> +SCM_API const scm_t_uint32 scm_charnums[];
> +SCM_API char *const scm_alt_charnames[];
> +SCM_API int scm_n_alt_charnames;
> +SCM_API const scm_t_uint32 scm_alt_charnums[];

This should all be marked `SCM_INTERNAL'.

Besides, instead of exposing these arrays, could we instead have two
functions in `chars.c', say:

  scm_t_uint32 scm_i_lookup_character (const char *name);
  const char *scm_i_character_name (scm_t_uint32 chr);

> +      return (unsigned long)(scm_c_downcase(SCM_CHAR(obj))) % n;

The cast shouldn't be needed.

> -      /* Dirk:FIXME::  This type of character syntax is not R5RS
> -       * compliant.  Further, it should be verified that the constant
> -       * does only consist of octal digits.  Finally, it should be
> -       * checked whether the resulting fixnum is in the range of
> -       * characters.  */
> +      /* FIXME:: This type of character syntax is not R5RS
> +       * compliant.  */

I think this comment remains valid, doesn't it?

> @@ -59,7 +59,7 @@ sf_flush (SCM port)
>      {
>        /* write the byte. */
>        scm_call_1 (SCM_SIMPLE_VECTOR_REF (stream, 0),
> -               SCM_MAKE_CHAR (*pt->write_buf));
> +               SCM_MAKE_8BIT_CHAR (*pt->write_buf));

It's actually not a byte.

Thanks!
Ludo'.

Re: [PATCH] Wide characters

Reply via email to