RE: Transcoding patch

Dan Sugalski Tue, 09 Oct 2001 13:08:56 -0700

At 03:03 PM 10/9/2001 -0500, Gibbs Tanton - tgibbs wrote:
> > At 07:03 PM 10/8/2001 -0500, Gibbs Tanton - tgibbs wrote:
> > >This looks good.
> > >
> > >Also, WRT the utf8_t, utf16_t, and utf32_t can we not just use
>utf32_t and
> > >then mask off the lower 8 or 16 bits?  We can still have utf8_t be
>defined
> > >as char to allow sizeof to work right and we can do sizeof(utf8_t)*2
>to get
> > >the utf16_t's size.
> >
> > utf8 and utf16 are both variable length encodings for space reasons.
> > There's not much reason to space-compact something then expand the
>heck out
> > of it.
>
>#I think he was just referring to the internal type used to hold a
>#character during processing, not to expanding the whole string.
>
>Yep, they would still be in UTF8 or UTF16 format internally, but I was
>trying to find a way were a 16 bit type was not needed as it might be hard
>to find on some systems.  For that matter a 32 bit type can be hard to find
>on some systems.  It seems we need one type and then some macros to fish out
>the other types.


Then I suppose you can fall back to an INTVAL and hope for the best. This 
stuff'll all be hidden in the string handling library for UTF-8 and UTF-16 
data, so it doesn't make that much difference.

> > On the other hand, I'd really, *really* rather not have Unicode
> > constants in anything other than UTF-32, so I'd as soon we chopped out
>the
> > utf-8 and utf-16 constant support from this.
> >
> > A should be the prefix for US-ASCII characters.
> > U should be the prefix for Unicode characters
> > N should be the prefix for the native character set (and the default)
> >
> > Beyond that I'm not sure what, if anything, we should accommodate in
>the
> > assembler.
>
>#What does US-ASCII correspond to internally - we don't have an
>#encoding for that. unless you're planning to mark it as UTF-8 and
>#rely on US-ASCII being a subset of UTF-8 of course ;-)
>
>Besides that, I'm not sure who would want to write a string in parrot
>assembly in iso latin 1 if it wasn't their native character set...seems like
>to me they would go straight to unicode.  I can understand the need for
>native and U32, but I question the latin1 US-ASCII need.

US-ASCII's guaranteed 7-bit. If there's a high-bit character set you can 
legitimately pitch a fit when assembling. Though it seems rather silly, 
thinking about it, as we could just have it as a restricted Unicode.

I was thinking of it for those cases where the native character set isn't a 
superset of ASCII, but I'm not sure if there are any. Seemed a bit 
presumptuous to assume so, though.

>#The only oter thing is that writing tests for UTF-8 and UTF-16 strings
>#and the transcoder is going to be quite tricky if we can't generate
>#them using the assembler.
>
>Yeah, perhaps we could keep this in, but note that it will be removed in the
>future once our tests pass...however, it seems like this is something that
>we will want to put in the test harness, so it seems we would need UTF8 and
>UTF16 constants for that.

No. No UTF-8 or UTF-16 constants in the assembler. Internally Parrot will 
deal with Unicode only in UTF-32 format. UTF-8 and UTF-16 are for I/O only, 
along with possibly a trivial amount of processing (chomp. Maybe.)

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

RE: Transcoding patch

Reply via email to