Re: [fpc-pascal] XML DOM and HTML

Sebastian Günther Fri, 20 Jun 2008 19:04:53 -0700

Johannes Nohl schrieb:

Dear list,


I player around with the units dom and xmlread. I liked them very
much. Now I thought I could parse websites with it. But they are
slightly different as far as I know. In xml everthing is within a node
while in HTML there are more then one value in a node. E.g.:

possible XML:

<div>
 asdf1
 <span>qwer1</span>
 <span>qwer2</span>
</div>

HTML:
<div>
 asdf1
 <span>qwer1</span>
 asdf2
 <span>qwer2</span>
 asdf3
</div>

Using XML-Dom I can access Value "asdf1" only. I think second example
is not valid XML, or?

Has anybody used XML to parse HTML-files? Is there a unit?



Yes.

HTML is based on SGML, and XML is a subset of SGML. So you cannot simplyparse any HTML file using a XML parser.You can try to use the HTML parser (but which relies on more or lesscorrect HTML code) in packages/fpc-xml/sax_html.pp instead of the XMLparser, which should be able to parse most of all websites.



Regards,
Sebastian
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] XML DOM and HTML

Reply via email to