Hi Raimon, Raimon Bosch <raimon.bo...@gmail.com> wrote on 04/23/2009 06:59:42 PM:
> I see that characters method is always interpreting the characters as 16-bit > characters, because is an array of type char. How Xerces manage the > non-16-bit characters? For example, in UTF8 there is a lot of characters > between 16 and 32 bits. > > If I found a char outside the 16 bit UTF-8 range, can I suppose that it is > not an UTF-8 character? UTF-8 and UTF-16 are character encodings [1], representing the characters defined by Unicode as sequences of bytes. These encodings have a representation for every character in Unicode. Like any of the other encodings they're decoded into Java chars on input so it's all the same to the consumer of the SAX API regardless of what the document's encoding was. Thanks. [1] http://en.wikipedia.org/wiki/Character_encoding Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: mrgla...@ca.ibm.com E-mail: mrgla...@apache.org