Re: Latin-1-characters

Mark J. Reed Tue, 16 Mar 2004 07:03:25 -0800

On 2004-03-16 at 00:28:32, Karl Brodowsky wrote:
> Mark J. Reed wrote:
> 
> >Unicode per se doesn't do anything to file sizes; it's all in how you
> >encode it.
> 
> Yes.  And basically there are common ways to encode this: utf-8 and utf-16
> (or similar variants requiring >= 2 bytes per character)


There are many ways to encode it.  UCS-4/UTF-32 (4 bytes per character),
UCS-2/UTF-16 (2 bytes for 80% of all currently-defined characters, 4 bytes
for the rarely-used 20% that lie outside the Basic Multilingual Plane),
UTF-8 (1 byte for ASCII, 2 bytes for code points U+0080 through U+07FF,
3 bytes for code points U+0800 through U+FFFF, 4 bytes outside the BMP)

> >there are other encoding schemes like SCSU which get you Unicode
> >compatibility while not taking up much more space than the locale's native 
> >charset.
> 
> These make sense for languages like Japanese, Korean, Chinese etc, where 
> you need more than one byte per character anyway.

No. You have mischaracterized or misunderstood the situation.  UTF-8 is
*not* the only encoding that requires as little as one byte per
character.  That is why I specifically mentioned SCSU - it provides a
sliding "window" accessible via single byte offsets.  In SCSU, *any*
128-byte portion of the Unicode range, not just the part corresponding
to US-ASCII, may be represented by a series of single bytes.  It adds a
small amount of overhead for code-switching, but in general file sizes
are very close to what you get with the corresponding national character
set, while still allowing the ability to escape out of that range and
include any Unicode character.

> Anyway, it will be necessary to specify the encoding of 
> unicode in some way, which could possibly allow even to specify even some 
> non-unicode-charsets.

There are no non-Unicode charsets from the Unicode standpoint.  National
charsets are just encodings of Unicode - incomplete encodings, since
only a subset of code points is representable, but encodings
nevertheless.  Making this possible is the reason Unicode has characters
that are redundant with sequences using combining forms: every character
which exists as a unique character in some established character set
also exists as a unique character in Unicode.

-- 
Mark REED                    | CNN Internet Technology
1 CNN Center Rm SW0831G      | [EMAIL PROTECTED]
Atlanta, GA 30348      USA   | +1 404 827 4754

Re: Latin-1-characters

Reply via email to