I'm trying to figure out if this is a bug or not. I created a DOM with
an element with a CDATA section and I set the value to a String of
characters which include a division symbol (xF7). (I actually do this by
reading the characters in from a file and converting them from bytes to
a String specifying a Windows-1252 encoding.) When I serialize this DOM
out to a String, byte array or anything else, the CData section is split
around the division symbol and the division symbol is emitted as an
entity (÷). I do try to serialize this as UTF-8.
I see in the documentation that this is the correct behavior when the
serializer encounters a Unicode character that isn't recognized; not
sure if this means not recognized in the Unicode (internal) form or
there is no UTF-8 equivalent. But x00F7 seems to be the correct Unicode
value for a division symbol and there is a UTF-8 encoding for it. Other
"special" characters seem to serialize to UTF-8 without this split.
I can send code. I've tried this on the latest Xerces-J. Anyone have any
thoughts about it?
Thanks,
Steve Carton