RE: Transcoding patch

Gibbs Tanton - tgibbs Tue, 09 Oct 2001 12:34:33 -0700

> At 07:03 PM 10/8/2001 -0500, Gibbs Tanton - tgibbs wrote:
> >This looks good.
> >
> >Also, WRT the utf8_t, utf16_t, and utf32_t can we not just use
utf32_t and
> >then mask off the lower 8 or 16 bits?  We can still have utf8_t be
defined
> >as char to allow sizeof to work right and we can do sizeof(utf8_t)*2
to get
> >the utf16_t's size.
>
> utf8 and utf16 are both variable length encodings for space reasons.
> There's not much reason to space-compact something then expand the
heck out
> of it.


#I think he was just referring to the internal type used to hold a
#character during processing, not to expanding the whole string.

Yep, they would still be in UTF8 or UTF16 format internally, but I was
trying to find a way were a 16 bit type was not needed as it might be hard
to find on some systems.  For that matter a 32 bit type can be hard to find
on some systems.  It seems we need one type and then some macros to fish out
the other types.

> On the other hand, I'd really, *really* rather not have Unicode
> constants in anything other than UTF-32, so I'd as soon we chopped out
the
> utf-8 and utf-16 constant support from this.
>
> A should be the prefix for US-ASCII characters.
> U should be the prefix for Unicode characters
> N should be the prefix for the native character set (and the default)
>
> Beyond that I'm not sure what, if anything, we should accommodate in
the
> assembler.

#What does US-ASCII correspond to internally - we don't have an
#encoding for that. unless you're planning to mark it as UTF-8 and
#rely on US-ASCII being a subset of UTF-8 of course ;-)

Besides that, I'm not sure who would want to write a string in parrot
assembly in iso latin 1 if it wasn't their native character set...seems like
to me they would go straight to unicode.  I can understand the need for
native and U32, but I question the latin1 US-ASCII need.

#The only oter thing is that writing tests for UTF-8 and UTF-16 strings
#and the transcoder is going to be quite tricky if we can't generate
#them using the assembler.

Yeah, perhaps we could keep this in, but note that it will be removed in the
future once our tests pass...however, it seems like this is something that
we will want to put in the test harness, so it seems we would need UTF8 and
UTF16 constants for that.

RE: Transcoding patch

Reply via email to