Re: scm_to_locale_stringbuf

Mike Gran Tue, 03 Feb 2009 15:47:08 -0800

> From: Neil Jerram n...@ossau.uklinux.net

> I'm afraid I don't understand the problem, on two counts.
> 
> 1. The doc (in the manual) says that scm_to_locale_stringbuf doesn't
> add a terminating \0.  So presumably any \0s present must be padding.
> 
> 2. The doc also says that if scm_to_locale_stringbuf's return value
> is > max_len (as it would be in your case), the caller should call it
> again with a larger buffer.
>


Right now, the internal coding of strings is an unspecified 8-bit encoding, and 
is assumed to be compatible with the locale in which it is being run.

So if I have a guile string with some 8-bit character that is between 128 and 
255, it just gets passed through.  If I request the contents of that string 
from C with scm_to_locale_string, it just returns the buffer of the scheme 
string.

But, in future, scm_to_locale_string or scm_to_locale_stringbuf should actually 
do the proper conversion to the current locale so that wide characters are 
printed properly.

So, if we move the internal representation of strings away
from unspecified 8-bit data and toward something concrete,
like ISO-8859-1 or UCS-4, and if a program is running in an
environment where a locale that has a multibyte encoding
like UTF-8, then the created locale string could have multi-byte characters.

Consider a scheme string that is internally the single
character "LATIN SMALL LETTER A WITH ACUTE", which is
U+00E1.  If the locale were some sort of UTF-8, like
en_US.utf-8, this letter should become the two bytes 0xC3
and 0xA1 when converted to the locale.

So what should happen in this case if I call
scm_to_locale_stringbuf (str, buf, 1)?  Note that here BUF
can only contain 1 byte.  Should the one byte 0xC3 be
copied into it, which creates an illegal string?  Or,
should nothing be copied into it.  In either case, there
should be some mechanism in the API to provide information
that an incomplete last character has occurred, because
outputting just the one byte 0xC3 would cause problems
somewhere down the road.

So what I was saying was that in this case maybe the best
thing to do would be to pad the output buffer with '\0'
instead of putting in half of a multibyte character, and
then signal that there is some padding at the end of the
string.

For instance, one could have a function
scm_to_locale_stringbufn (SCM str, char *buf, size_t max_len, size_t *len_used)
where LEN_USED is size of the buffer that was actually
used.

Sorry for the book-length explanation,

Mike Gran

Re: scm_to_locale_stringbuf

Reply via email to