"Kurt Klinner" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Hello, > > while trying to parse a "large" XML document i found a > strange behaviour of the Parser Module(s) (XML::Parser:PerlSAX, > XML::Parser, XML::Parser::Expat > > If my file XML file is larger then 65536 bytes > the actual character string is interrupted and a whitespace > is added. > > For Example > > <DATASET> > <DATA><![CDATA["NOVDEC_B"]]></DATA> > <DATA><![CDATA["November\December"]]></DATA> > <DATA><![CDATA["Nov\Dec"]]></DATA> > <DATA><![CDATA["01.11."]]></DATA> > <DATA><![CDATA[11]]></DATA> > <DATA><![CDATA["begin_2month"]]></DATA> > <DATA><![CDATA[11]]></DATA> > </DATASET> > > if now "Novemver\December" is at the 65536 border the String is > splitted in "Nov WHITESPACE ember\December"
Hi Kurt, Not sure if this is your problem, but it seems to be something that trips people up. If your using a Char handler for pasing your xml you might be surprised to learn that it won't always contain the full text from a CDATA section like you descibe. Sometimes it will be called twice firstly with the first half of the data, and again with the second half. Your code need to ensure you cope with the behaviour. To quote the XML::Parser documentation ... Char (Expat, String) This event is generated when non-markup is recognized. The non-markup sequence of characters is in String. A single non-markup sequence of characters may generate multiple calls to this handler. Whatever the encoding of the string in the original document, this is given to the handler in UTF-8. ... Are you sure you're joining your CDATA correctly Post your code if this hasn't helped. Cheers, Rob > Any ideas how to avoid /fix that problem > > > Thanks in advance > > Regards > > Kurt -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]