Johannes Nohl wrote:
Dear list, dear Michael!
There are multiple problems with HTML parsing: HTML is not a well-formed
XML document, because
- the tags are case insensitive (in XML they are case sensitive)
- Not all tags must be closed.
If the HTML is XHTML, then the DOM unit can be used to parse it.
But how do I retrieve more than the first part of the node's value?
If I read in:
<div>
asdf1
<span>qwer1</span>
asdf2
<img src="" />
asdf3
</div>
FindNode('dvi').NodeValue returns "asdf1". But not asdf2 and asdf3.
Isn't the example above valid XHTML?
If were going to parse web pages I would probably opt to use RegEx. There is
regex included with fpc I believe, but I tend to use this one since its
compatible with fpc and delphi:
http://regexpstudio.com/TRegExpr/TRegExpr.html
--
Warm Regards,
Lee
_______________________________________________
fpc-pascal maillist - fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal