At 03:03 PM 10/9/2001 -0500, Gibbs Tanton - tgibbs wrote: > > At 07:03 PM 10/8/2001 -0500, Gibbs Tanton - tgibbs wrote: > > >This looks good. > > > > > >Also, WRT the utf8_t, utf16_t, and utf32_t can we not just use >utf32_t and > > >then mask off the lower 8 or 16 bits? We can still have utf8_t be >defined > > >as char to allow sizeof to work right and we can do sizeof(utf8_t)*2 >to get > > >the utf16_t's size. > > > > utf8 and utf16 are both variable length encodings for space reasons. > > There's not much reason to space-compact something then expand the >heck out > > of it. > >#I think he was just referring to the internal type used to hold a >#character during processing, not to expanding the whole string. > >Yep, they would still be in UTF8 or UTF16 format internally, but I was >trying to find a way were a 16 bit type was not needed as it might be hard >to find on some systems. For that matter a 32 bit type can be hard to find >on some systems. It seems we need one type and then some macros to fish out >the other types.
Then I suppose you can fall back to an INTVAL and hope for the best. This stuff'll all be hidden in the string handling library for UTF-8 and UTF-16 data, so it doesn't make that much difference. > > On the other hand, I'd really, *really* rather not have Unicode > > constants in anything other than UTF-32, so I'd as soon we chopped out >the > > utf-8 and utf-16 constant support from this. > > > > A should be the prefix for US-ASCII characters. > > U should be the prefix for Unicode characters > > N should be the prefix for the native character set (and the default) > > > > Beyond that I'm not sure what, if anything, we should accommodate in >the > > assembler. > >#What does US-ASCII correspond to internally - we don't have an >#encoding for that. unless you're planning to mark it as UTF-8 and >#rely on US-ASCII being a subset of UTF-8 of course ;-) > >Besides that, I'm not sure who would want to write a string in parrot >assembly in iso latin 1 if it wasn't their native character set...seems like >to me they would go straight to unicode. I can understand the need for >native and U32, but I question the latin1 US-ASCII need. US-ASCII's guaranteed 7-bit. If there's a high-bit character set you can legitimately pitch a fit when assembling. Though it seems rather silly, thinking about it, as we could just have it as a restricted Unicode. I was thinking of it for those cases where the native character set isn't a superset of ASCII, but I'm not sure if there are any. Seemed a bit presumptuous to assume so, though. >#The only oter thing is that writing tests for UTF-8 and UTF-16 strings >#and the transcoder is going to be quite tricky if we can't generate >#them using the assembler. > >Yeah, perhaps we could keep this in, but note that it will be removed in the >future once our tests pass...however, it seems like this is something that >we will want to put in the test harness, so it seems we would need UTF8 and >UTF16 constants for that. No. No UTF-8 or UTF-16 constants in the assembler. Internally Parrot will deal with Unicode only in UTF-32 format. UTF-8 and UTF-16 are for I/O only, along with possibly a trivial amount of processing (chomp. Maybe.) Dan --------------------------------------"it's like this"------------------- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk