Hi, References to surrogates are not allowed in XML documents.
Here's the range of allowed characters in the XML 1.0 specification [1]: Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ Surrogate pairs are used to represent characters in the [#x10000-#x10FFFF] range: code points in the supplementary planes. You need a reference to one of these instead. You can use java.lang.Character.toCodePoint(char high, char low) to compute the code point value. Thanks. [1] http://www.w3.org/TR/2008/REC-xml-20081126/#charsets Michael Glavassevich XML Technologies and WAS Development IBM Toronto Lab E-mail: mrgla...@ca.ibm.com E-mail: mrgla...@apache.org Ilya Sokolov <ilya_soko...@symantec.com> wrote on 08/22/2014 03:23:57 PM: > Hi! > > I have an issue parsing XML containing Unicode strings with > surrogate characters (Xerces 2.11.0). The following exception is thrown: > > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: > 18; Character reference "�" is an invalid XML character. > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) > > Simple code to reproduce the issue: > > byte[] enc1 = new byte[] {(byte)0xd8, 0x40, (byte)0xdc, 0x2a}; > String result = new String(enc1, "UTF-16"); > System.out.println(result); // Outputs 𠀪 correctly > > String saml="<name>lz1��.cct.cm</name>"; > DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance > (); > DocumentBuilder builder = factory.newDocumentBuilder(); > Document document= builder.parse(new InputSource(new StringReader( > saml))); // Throws exception > > > Do I parse the XML correctly? > > The XML I parse contains the following string: > lz1𠀪.cct.cm