> UTF-8 and UTF-16 are character encodings [1], representing the > characters defined by Unicode as sequences of bytes. These encodings > have a representation for every character in Unicode. Like any of > the other encodings they're decoded into Java chars on input so it's > all the same to the consumer of the SAX API regardless of what the > document's encoding was.
More specifically: Characters too long to represent in a single java char will take two chars; that's how UTF-16 works. (UTF-8 is similar, except that it takes one, two, or three bytes to cover the same range of values rather than UTF16's two or four.) Yes, this means that full unicode string manipulation in Java is more complex than just moving individual chars around. Luckily, most alphabetical languages don't need to go over 15 bits per character. (The high bit is reserved for signalling when more bits are needed.) Note that this is general Java behavior, nothing unique to Xerces.