Unicode 3.1, UTF-16, and Java [Re: Perl 6, The Good Parts Version]

Mark J. Reed Wed, 31 Jul 2002 13:06:18 -0700

> On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote:
> > I thought Java used UTF-16. It's a variable-width encoding, so it 
> > should be fine. (Though I bet a lot of folks will be rather surprised 
> > when it happens...)
Update:


Since Unicode 3.1 (3.2 is the current version), there have in fact
been defined characters outside the 16-bit range U+0000 to U+FFFF.
For instance, the block U+1D100 to U+1D1FF contains musical symbols.

Since Java 'char's are 16-bit quantities; characters outside of
the range U+0000 to U+FFFF have to be represented by pairs of
characters from the 'surrogates' range, U+D800 through U+DFFF.
Java does not handle this conversion transparently; for instance,
the \uXXXX sequence to include a Unicode character code point
takes exactly four hexadecimal digits.  So to represent, e.g.,
U+1D107  MUSICAL SYMBOL RIGHT REPEAT SIGN, you have to manually
compose the surrogate pair (U+D834, U+DD07).  This is a good thing
from the point of view of the Java programmer since it means a
'char' is always the same size, even though it may not represent the
entire desired character.  In that, however, it is not fundamentally
different from composition within the 16-bit range - e.g. composing
'a' (U+0061) and the combining version of '~' (U+0303) to get 'ă',
instead of using the single character U+00E3.

Note that surrogates are bypassed when encoding in UTF-8;
you just transform the desired code point directly, resulting in a
UTF-8 sequence of four octets (characters through U+FFFF require a
maximum of three octets in UTF-8).  Perl 5.6.1 already handles this
correctly for \x{...} values greater than 0xffff; e.g.  
perl -e 'print "\x{1d107}\n";' will output the four-byte UTF-8 encoding
for that character.

-- 
Mark REED                    | CNN Internet Technology
1 CNN Center Rm SW0831G      | [EMAIL PROTECTED]
Atlanta, GA 30348      USA   | +1 404 827 4754 
--
Going the speed of light is bad for your age.

Unicode 3.1, UTF-16, and Java [Re: Perl 6, The Good Parts Version]

Reply via email to