Re: Problem with surrogate characters

Michael Glavassevich Tue, 26 Aug 2014 13:26:56 -0700

Hi,

References to surrogates are not allowed in XML documents.


Here's the range of allowed characters in the XML 1.0 specification [1]:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate 
blocks, FFFE, and FFFF. */

Surrogate pairs are used to represent characters in the [#x10000-#x10FFFF] 
range: code points in the supplementary planes. You need a reference to 
one of these instead. You can use java.lang.Character.toCodePoint(char 
high, char low) to compute the code point value.

Thanks.

[1] http://www.w3.org/TR/2008/REC-xml-20081126/#charsets

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

Ilya Sokolov <ilya_soko...@symantec.com> wrote on 08/22/2014 03:23:57 PM:

> Hi!
> 
> I have an issue parsing XML containing Unicode strings with 
> surrogate characters (Xerces 2.11.0). The following exception is thrown:
> 
>         org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 
> 18; Character reference "&#55360" is an invalid XML character.
>         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>         at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown 
Source)
> 
> Simple code to reproduce the issue:
> 
>          byte[] enc1 = new byte[] {(byte)0xd8, 0x40, (byte)0xdc, 0x2a};
>          String result = new String(enc1, "UTF-16");
>          System.out.println(result); // Outputs 𠀪 correctly
> 
>          String saml="<name>lz1&#55360;&#56362;.cct.cm</name>";
>          DocumentBuilderFactory factory = 
DocumentBuilderFactory.newInstance
> ();
>          DocumentBuilder builder = factory.newDocumentBuilder();
>          Document document= builder.parse(new InputSource(new 
StringReader(
> saml))); // Throws exception
> 
> 
> Do I parse the XML correctly?
> 
> The XML I parse contains the following string:
> lz1𠀪.cct.cm

Re: Problem with surrogate characters

Reply via email to