At 11:36 PM +0000 3/15/04, [EMAIL PROTECTED] wrote:
Another possibility is to use a UTF-8 extended system where you use values over 0x10FFFF to encode temporary code block swaps in the encoding. I.e.,
some magic value means the one byte UTF-8 codes now mean the Greek block
instead of the ASCII block.

You could do that, but then I'd be forced to do something well and truly horrible to you, and we'd rather not have that. :)


Character set and encoding are metadata, and ought be stored out-of-band, at least once the data makes it into your program. Twiddling the internal representation of the bytes is a fairly sub-optimal way to do that, so I'd as soon not mandate that we have to. (I do dislike publically breaking mandates like that. Terribly inconvenient)

> At 12:28 AM +0100 3/16/04, Karl Brodowsky wrote:
 >Anyway, it will be necessary to specify the encoding of unicode in
 >some way, which could possibly allow even to specify even some
 >non-unicode-charsets.

 While I'll skip diving deeper into the swamp that is character sets
 and encoding (I'm already up to my neck in it, thanks, and I don't
 have any long straws handy :) I'll point out that the above statement
 is meaningless--there *are* no Unicode non-unicode charsets.

 It is possible to use the UTF encodings on non-unicode charsets--you
 could reasonably use UTF-8 to encode, say, Shift-JIS characters.
 (where Shift-JIS is both an encoding and a character set, and it can
 be separated into pieces)

 It's not unwise (and, in practice, at least in implementation quite
 sensible) to separate the encoding from the character set, but you
 need to be careful to keep the separation clear, though many of the
> sets and encodings don't go out of their way to help with that.

-- Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to