> UTF-8 and UTF-16 are character encodings [1], representing the 
> characters defined by Unicode as sequences of bytes. These encodings
> have a representation for every character in Unicode. Like any of 
> the other encodings they're decoded into Java chars on input so it's
> all the same to the consumer of the SAX API regardless of what the 
> document's encoding was.

More specifically: Characters too long to represent in a single java char 
will take two chars; that's how UTF-16 works. (UTF-8 is similar, except 
that it takes one, two, or three bytes to cover the same range of values 
rather than UTF16's two or four.)

Yes, this means that full unicode string manipulation in Java is more 
complex than just moving individual chars around. Luckily, most 
alphabetical languages don't need to go over 15 bits per character. (The 
high bit is reserved for signalling when more bits are needed.)

Note that this is general Java behavior, nothing unique to Xerces.

Reply via email to