Re: Unicode strings and symbols

Mike Gran Sun, 09 Aug 2009 15:42:11 -0700

On Sun, 2009-08-09 at 20:22 +0200, Ludovic Courtès wrote:
> Hi Mike!


> I would feel more confident if the number of lines of tests added was
> proportional to the number of new C lines of code.  Do you think some
> more tests could be added?  Or maybe this will come at a later stage?

I should probably add some tests now.  They would have to use hex
escapes, since the character encoding/decoding functionality is not
ready.  That's not a problem, but, it doesn't look as cool as using
actual non-latin glyphs in the tests.

> 
> > +  else if (c == '\b')
> > +    {
> > +      SCM_DECCOL (port);
> > +    }
> 
> Style!  ;-)

OK. The SCM_DECCOL et al macros are missing enclosing do/while
statements, making them conflict with the if statement, so do/while
statements will have to be added.

> 
> > +SCM_API void scm_charprint (scm_t_uint32 c, SCM port);
> 
> This ought to be internal, no?

Could be.  A couple of the types are given their own print functions:
scm_intprint and an scm_uintprint.  Most types don't have their own
print functions.  Are int and uint given special treatment because of
their radix term?

> 
> >  #define STRINGBUF_F_SHARED      0x100
> >  #define STRINGBUF_F_INLINE      0x200
> > +#define STRINGBUF_F_WIDE        0x400
> 
> Although other flags miss this, can you add a comment to the right
> saying that this means UCS-4 encoding?

OK.

> 
> > +#if SCM_DEBUG
> > +  if (len < 1000)
> > +    lenhist[len]++;
> > +  else
> > +    lenhist[1000]++;
> > +#endif
> 
> I would use "#ifdef SCM_STRING_LENGTH_HISTOGRAM" for that.  Conversely,
> I'd make `%string-dump' and `%symbol-dump' always available (with
> docstring and possibly manual entry).

I like that idea.

> 
> > +          (scm_t_wchar) (unsigned char) STRINGBUF_INLINE_CHARS (buf)[i];
> 
> Is the double cast needed?

Sort of.  Unsigned char will successfully be implicitly cast to
scm_t_wchar, so the scm_t_wchar term is just for clarity.  The unsigned
char term is definitely needed. Negative 8-bit chars are the upper half
of the 8-bit charset (128 - 255).  Casting them directly to scm_t_wchar
may return 0xFFFFFF80 - 0xFFFFFFFF instead of 128-255.  I don't have any
problem removing the scm_t_wchar cast.  Would you prefer that? 

> 
> > +      if (scm_i_is_narrow_string (name))
> 
> "Narrow strings" are Latin-1, right?

Right.

> 
> > +SCM_DEFINE (scm_sys_string_dump, "%string-dump", 1, 0, 0, (SCM str), "")
>
> How about returning an alist with all this information instead of using
> printf?

OK

> > +SCM_DEFINE (scm_string_width, "string-width", 1, 0, 0,
> > +            (SCM string),
> > +            "Return the bytes used to represent a character in 
> > @var{string}."
> > +            "This will return 1 or 4.")
> 
> I was wondering whether we should expose as much of the internal
> representation, but I think it's a good debugging aid and it can't hurt.
> 
> I find the name slightly misleading, but I can't think of a better one.

I put it in because that information needs to be available in the
bytecode compiler.  A slightly clearer name would probably be
string-bytes-per-character, I suppose.

> > +SCM_INTERNAL char *scm_to_stringn (SCM str, size_t *lenp, 
> > +                                   const char *encoding,
> > +                                   enum iconv_ilseq_handler handler);
> 
> I suppose this would eventually become public.  What do you think?
> Should we use a different type for HANDLER before that happens?

The simplest thing would be to make some constants like

scm_c_define ("STRING_ESCAPE", scm_from_int(iconveh_escape_sequence))

Something similar is done in the scm_seek function's constants, such as
SEEK_CUR.

> 
> > +SCM_API const scm_t_wchar *scm_i_string_wide_chars (SCM str);
> 
> Should be SCM_INTERNAL.

OK

> 
> Thank you!

Thanks for taking the time to check it out.

> 
> Ludo'.

Mike

Re: Unicode strings and symbols

Reply via email to