On 2004-03-16 at 00:28:32, Karl Brodowsky wrote: > Mark J. Reed wrote: > > >Unicode per se doesn't do anything to file sizes; it's all in how you > >encode it. > > Yes. And basically there are common ways to encode this: utf-8 and utf-16 > (or similar variants requiring >= 2 bytes per character)
There are many ways to encode it. UCS-4/UTF-32 (4 bytes per character), UCS-2/UTF-16 (2 bytes for 80% of all currently-defined characters, 4 bytes for the rarely-used 20% that lie outside the Basic Multilingual Plane), UTF-8 (1 byte for ASCII, 2 bytes for code points U+0080 through U+07FF, 3 bytes for code points U+0800 through U+FFFF, 4 bytes outside the BMP) > >there are other encoding schemes like SCSU which get you Unicode > >compatibility while not taking up much more space than the locale's native > >charset. > > These make sense for languages like Japanese, Korean, Chinese etc, where > you need more than one byte per character anyway. No. You have mischaracterized or misunderstood the situation. UTF-8 is *not* the only encoding that requires as little as one byte per character. That is why I specifically mentioned SCSU - it provides a sliding "window" accessible via single byte offsets. In SCSU, *any* 128-byte portion of the Unicode range, not just the part corresponding to US-ASCII, may be represented by a series of single bytes. It adds a small amount of overhead for code-switching, but in general file sizes are very close to what you get with the corresponding national character set, while still allowing the ability to escape out of that range and include any Unicode character. > Anyway, it will be necessary to specify the encoding of > unicode in some way, which could possibly allow even to specify even some > non-unicode-charsets. There are no non-Unicode charsets from the Unicode standpoint. National charsets are just encodings of Unicode - incomplete encodings, since only a subset of code points is representable, but encodings nevertheless. Making this possible is the reason Unicode has characters that are redundant with sequences using combining forms: every character which exists as a unique character in some established character set also exists as a unique character in Unicode. -- Mark REED | CNN Internet Technology 1 CNN Center Rm SW0831G | [EMAIL PROTECTED] Atlanta, GA 30348 USA | +1 404 827 4754