Hi Steve, Do you have serializer.jar (containing the LSSerializer from Xalan) on your classpath? I can only reproduce this with Xerces' implementation of LSSerializer which I might add is also deprecated.
Thanks. Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: [EMAIL PROTECTED] E-mail: [EMAIL PROTECTED] "Steve Carton" <[EMAIL PROTECTED]> wrote on 11/08/2007 11:48:16 AM: > Hi Michael, > > I've fooled with this in several forms, always with the same > results. My current incarnation of the code uses the LSSerializer > API. I've also used the (deprecated) XMLSerializer. In either case, > I've tried StringWriter, FileWriter, and ByteArrayOutputStream (then > to a FileOutputStream to write to a file). I specify UTF-8 as the > output encoding. Here's a snippet of the code: > > System.setProperty(DOMImplementationRegistry.PROPERTY,"org. > apache.xerces.dom.DOMImplementationSourceImpl"); > DOMImplementationRegistry registry = > DOMImplementationRegistry.newInstance(); > DOMImplementation domImpl = registry.getDOMImplementation("LS 3.0"); > DOMImplementationLS implLS = (DOMImplementationLS)domImpl; > LSSerializer dom3Writer = implLS.createLSSerializer(); > LSOutput output=implLS.createLSOutput(); > ByteArrayOutputStream bs = new ByteArrayOutputStream(); > output.setByteStream(bs); > output.setEncoding("UTF-8"); > dom3Writer.write(doc,output); > > Here's what get's written to a file from that byte stream: > > <test><div>¦º3 times: ÷ ÷ ÷º¬</div><divCDATA><![CDATA[¦º3 > times: ]]>÷<![CDATA[ ]]>÷<![CDATA[ ]]>÷<! > [CDATA[º¬]]></divCDATA></test> > > Note that the serialized element that is *not* a cdata section > converts the division symbol to UTF-8 without a problem. > > Steve > > -----Original Message----- > From: Michael Glavassevich [mailto:[EMAIL PROTECTED] > Sent: Wednesday, November 07, 2007 11:04 PM > To: j-users@xerces.apache.org > Cc: Steve Carton > Subject: Re: Split CDATA Sections and the division Symbol (x00f7) > > Hi Steve, > > "Steve Carton" <[EMAIL PROTECTED]> wrote on 11/06/2007 > 04:10:45 PM: > > > I'm trying to figure out if this is a bug or not. I created a DOM with > > an element with a CDATA section and I set the value to a String of > > characters which include a division symbol (xF7). (I actually do this > > by reading the characters in from a file and converting them from > > bytes to a String specifying a Windows-1252 encoding.) When I > > serialize this DOM out to a String, byte array or anything else, the > > CData section is split around the division symbol and the division > > symbol is emitted as an entity (÷). I do try to serialize this as > UTF-8. > > Some questions ... > > What API are you using for serialization? Are you specifying an > output encoding? What type of output are you writing to? A java.io. > OutputStream? A java.io.Writer? > > > I see in the documentation that this is the correct behavior when the > > serializer encounters a Unicode character that isn't recognized; not > > sure if this means not recognized in the Unicode (internal) form or > > there is no UTF-8 equivalent. But x00F7 seems to be the correct > > Unicode value for a division symbol and there is a UTF-8 encoding for > > it. Other "special" characters seem to serialize to UTF-8 without > > this split. > > I think what you meant to say here is "not expressible in the output > encoding". For instance ASCII is only capable of representing > Unicode code points from 0x00-0x7F. 0xF7 isn't representable in ASCII. > > > I can send code. I've tried this on the latest Xerces-J. Anyone have > > any thoughts about it? > > > > Thanks, > > > > Steve Carton > > Thanks. > > Michael Glavassevich > XML Parser Development > IBM Toronto Lab > E-mail: [EMAIL PROTECTED] > E-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]