kesh...@us.ibm.com wrote on 04/23/2009 08:02:18 PM:

> > UTF-8 and UTF-16 are character encodings [1], representing the
> > characters defined by Unicode as sequences of bytes. These encodings
> > have a representation for every character in Unicode. Like any of
> > the other encodings they're decoded into Java chars on input so it's
> > all the same to the consumer of the SAX API regardless of what the
> > document's encoding was.
>
> More specifically: Characters too long to represent in a single java
> char will take two chars; that's how UTF-16 works. (UTF-8 is
> similar, except that it takes one, two, or three bytes

or four bytes (0x10000 - 0x10FFFF).

> to cover the
> same range of values rather than UTF16's two or four.)
>
> Yes, this means that full unicode string manipulation in Java is
> more complex than just moving individual chars around. Luckily, most
> alphabetical languages don't need to go over 15 bits per character.
> (The high bit is reserved for signalling when more bits are needed.)
>
> Note that this is general Java behavior, nothing unique to Xerces.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

Reply via email to