I like to break characters into concepts - character sets and character encodings.
Unicode is a character set. UTF-8, UTF-16, etc are encodings of the Unicode set. ISO-8859-1 is a character set and a character encoding. On Thu, Apr 23, 2009 at 6:23 PM, Michael Glavassevich <mrgla...@ca.ibm.com> wrote: > Hi Raimon, > > Raimon Bosch <raimon.bo...@gmail.com> wrote on 04/23/2009 06:59:42 PM: > >> I see that characters method is always interpreting the characters as >> 16-bit >> characters, because is an array of type char. How Xerces manage the >> non-16-bit characters? For example, in UTF8 there is a lot of characters >> between 16 and 32 bits. >> >> If I found a char outside the 16 bit UTF-8 range, can I suppose that it is >> not an UTF-8 character? > > UTF-8 and UTF-16 are character encodings [1], representing the characters > defined by Unicode as sequences of bytes. These encodings have a > representation for every character in Unicode. Like any of the other > encodings they're decoded into Java chars on input so it's all the same to > the consumer of the SAX API regardless of what the document's encoding was. > > Thanks. > > [1] http://en.wikipedia.org/wiki/Character_encoding > > Michael Glavassevich > XML Parser Development > IBM Toronto Lab > E-mail: mrgla...@ca.ibm.com > E-mail: mrgla...@apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org For additional commands, e-mail: j-users-h...@xerces.apache.org