kesh...@us.ibm.com wrote on 04/23/2009 08:02:18 PM: > > UTF-8 and UTF-16 are character encodings [1], representing the > > characters defined by Unicode as sequences of bytes. These encodings > > have a representation for every character in Unicode. Like any of > > the other encodings they're decoded into Java chars on input so it's > > all the same to the consumer of the SAX API regardless of what the > > document's encoding was. > > More specifically: Characters too long to represent in a single java > char will take two chars; that's how UTF-16 works. (UTF-8 is > similar, except that it takes one, two, or three bytes
or four bytes (0x10000 - 0x10FFFF). > to cover the > same range of values rather than UTF16's two or four.) > > Yes, this means that full unicode string manipulation in Java is > more complex than just moving individual chars around. Luckily, most > alphabetical languages don't need to go over 15 bits per character. > (The high bit is reserved for signalling when more bits are needed.) > > Note that this is general Java behavior, nothing unique to Xerces. Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: mrgla...@ca.ibm.com E-mail: mrgla...@apache.org