> On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote: > > I thought Java used UTF-16. It's a variable-width encoding, so it > > should be fine. (Though I bet a lot of folks will be rather surprised > > when it happens...) Update:
Since Unicode 3.1 (3.2 is the current version), there have in fact been defined characters outside the 16-bit range U+0000 to U+FFFF. For instance, the block U+1D100 to U+1D1FF contains musical symbols. Since Java 'char's are 16-bit quantities; characters outside of the range U+0000 to U+FFFF have to be represented by pairs of characters from the 'surrogates' range, U+D800 through U+DFFF. Java does not handle this conversion transparently; for instance, the \uXXXX sequence to include a Unicode character code point takes exactly four hexadecimal digits. So to represent, e.g., U+1D107 MUSICAL SYMBOL RIGHT REPEAT SIGN, you have to manually compose the surrogate pair (U+D834, U+DD07). This is a good thing from the point of view of the Java programmer since it means a 'char' is always the same size, even though it may not represent the entire desired character. In that, however, it is not fundamentally different from composition within the 16-bit range - e.g. composing 'a' (U+0061) and the combining version of '~' (U+0303) to get 'ã', instead of using the single character U+00E3. Note that surrogates are bypassed when encoding in UTF-8; you just transform the desired code point directly, resulting in a UTF-8 sequence of four octets (characters through U+FFFF require a maximum of three octets in UTF-8). Perl 5.6.1 already handles this correctly for \x{...} values greater than 0xffff; e.g. perl -e 'print "\x{1d107}\n";' will output the four-byte UTF-8 encoding for that character. -- Mark REED | CNN Internet Technology 1 CNN Center Rm SW0831G | [EMAIL PROTECTED] Atlanta, GA 30348 USA | +1 404 827 4754 -- Going the speed of light is bad for your age.