I have looked into the libxml code and I found the method
htmlParseScript() within HTMLParser.c.

https://gitlab.gnome.org/GNOME/libxml2/blob/master/HTMLparser.c

It describes the problem with the "<" character within scripts.
But it offers the possibility to use the recover mode to ignore
the tags.

I have used

xmllint --html -htmlout --recover mypage.html

and it returns the last </td> tag. The PHP equivalent does not work
(there is a flag "recover" on class DOMDocument, but the output is
always the same). So I will look into the DOMDocument code (if it is
available).

~André

On 18.08.2018 00:33, Eric S Eberhard wrote:
> I could be way off base -- don't you have to encode the portions in the js?  
> Otherwise I can see it being confused.  The js looks like data and it can't 
> have < or > in it.
> 
> https://stackoverflow.com/questions/1398571/html-inside-xml-should-i-use-cdata-or-encode-the-html
> 
> Eric
> 
> 
> Eric S Eberhard
> VICS (Vertical Integrated Computer Systems)
> Voice: 928 567 3529
> Cell    : 928 301 7537  (not reliable except for text or if not home)
> 2933 W Middle Verde Rd
> Camp Verde, AZ  86322
> 
> 
> -----Original Message-----
> From: xml [mailto:xml-boun...@gnome.org] On Behalf Of André Rothe
> Sent: Friday, August 17, 2018 5:43 AM
> To: xml@gnome.org
> Subject: [xml] Error on parsing HTML with libxml
> 
> Hi,
> 
> I run into an HTML parser problem during PHP development. There is a class 
> DOMDocument, which uses libxml2 to parse HTML and XML documents. I found out, 
> that there is a problem with HTML documents, which have inline Javascript 
> code, which uses HTML tags within Javascript String variables.
> 
> There is a little code example, which shows the problem:
> 
> https://3v4l.org/O0iEf
> 
> As you can see there, the last tag <td> is lost within the output.
> Exactly the same error I will get with xmllint:
> 
> xmllint --html --htmlout /tmp/page.html
> 
> where page.html contains the HTML part of the example code above. The output 
> is
> 
> page.html:11: HTML parser error : Unexpected end tag : td
>         printwin.document.writeln('</td>');
> 
> and within the output, the String will be empty:
> 
> printwin.document.writeln('');
> 
> So I think, that the PHP error comes from the error within libxml2. I use 
> libxml2 version 2.9.1.
> 
> Is it possible to fix that or is it already fixed within a newer version?
> 
> Best regards
> André
> 
> _______________________________________________
> xml mailing list, project page  http://xmlsoft.org/ xml@gnome.org 
> https://mail.gnome.org/mailman/listinfo/xml
> 
> 

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Reply via email to