Re: Parlsax parsing problem

Rob Anderson Fri, 22 Aug 2003 06:07:27 -0700

"Kurt Klinner" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Hello,
>
> while trying to parse a "large" XML document i found a
> strange behaviour of the Parser Module(s) (XML::Parser:PerlSAX,
> XML::Parser, XML::Parser::Expat
>
> If my file XML file is larger then 65536 bytes
> the actual character string is interrupted and a whitespace
> is added.
>
> For Example
>
> <DATASET>
> <DATA><![CDATA["NOVDEC_B"]]></DATA>
> <DATA><![CDATA["November\December"]]></DATA>
> <DATA><![CDATA["Nov\Dec"]]></DATA>
> <DATA><![CDATA["01.11."]]></DATA>
> <DATA><![CDATA[11]]></DATA>
> <DATA><![CDATA["begin_2month"]]></DATA>
> <DATA><![CDATA[11]]></DATA>
> </DATASET>
>
> if now "Novemver\December" is at the 65536 border the String is
> splitted in "Nov WHITESPACE ember\December"


Hi Kurt,

Not sure if this is your problem, but it seems to be something that trips
people up. If your using a Char handler for pasing your xml you might be
surprised to learn that it won't always contain the full text from a CDATA
section like you descibe. Sometimes it will be called twice firstly with the
first half of the data, and again with the second half. Your code need to
ensure you cope with the behaviour.

To quote the XML::Parser documentation

...
Char (Expat, String)

This event is generated when non-markup is recognized. The non-markup
sequence of characters is in String. A single non-markup sequence of
characters may generate multiple calls to this handler. Whatever the
encoding of the string in the original document, this is given to the
handler in UTF-8.
...

Are you sure you're joining your CDATA correctly

Post your code if this hasn't helped.

Cheers, Rob

> Any ideas how to avoid /fix that problem
>
>
> Thanks in advance
>
> Regards
>
> Kurt



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parlsax parsing problem

Reply via email to