Marco van de Voort wrote:
(maillist maintainer/jonas: I wrote a similar message from a non-subscribed
email addr. It can be discarded, sorry)
I needed a html parser, and am not in a hurry, so I decided to check FPC's
own first, in the hope that I can at least make some documentation in the
wiki /examples during the experience.
The first project is simple, see program below, executed on FPC's html
documentation. I noticed that it failed like this:
An unhandled exception occurred at $004284EC :
EDOMError : EDOMError in DOMDocument.CreateElement hr/0
$004284EC
$00411A86 THTMLTODOMCONVERTER__READERSTARTELEMENT, line 500 of
src/sax_html.pp
$0042648A TSAXREADER__DOSTARTELEMENT, line 738 of src/sax.pp
$004110DC THTMLREADER__ENTERNEWSCANNERCONTEXT, line 391 of
src/sax_html.pp
$00410C80 THTMLREADER__PARSE, line 358 of src/sax_html.pp
$0042612C TSAXREADER__PARSESTREAM, line 647 of src/sax.pp
$00411F3D READHTMLFILE, line 609 of src/sax_html.pp
$00411E91 READHTMLFILE, line 593 of src/sax_html.pp
$004015DE main, line 21 of saxattempt.dpr
Some debugging seems that it fails on <hr/>, doctype of the doc in question
is
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
Some questions for the more xmlable:
1. is this correct? I think <hr/> is more xml notation than html notation?
For html this is not correct, but that file might happen to be xhtml. In
general, FPC's xml parser is much more developed than html parser,
therefore many FPC tools actually write xhtml.
2. can I somehow convince (override) DOM to accept it? (since modifying the
generator (tex4ht) might prove to be difficult). It could be genera
I think it would be better to fix sax_html.pp either to handle this
condition gracefully (strip '/'), or raise a exception. If it raises an
exception, that exception could contain location information you need.
3. Is there a way to have line numbers in the exceptions? Modifying the
source with writeln's to find out which tag exactly goes wrong is a bit
ugly.
The exceptions generated by parser contain this information (for xml,
this is EXMLReadError.Line, EXMLReadError.LinePos). sax_html seems not
to generate exceptions at all :(
The exceptions raised from DOM methods (like CreateElement) do not have
location information because these methods are primarily intended for
building DOM tree from code, when there is no source file.
Note that I'm already happy with pointers where to start. Anybody willing to
share private examples or documentation would be great too.
program saxattempt;
{$mode delphi}
Uses Sax_HTML,sysutils,classes,dom_html;
var d:TSearchRec;
sx : THTMLDocument;
Htmls: TStringList;
begin
htmls:=TStringList.create;
if findfirst('*.html',faanyfile,d)=0 then
begin
repeat
writeln(d.name);
sx:=THtmlDocument.create;
ReadHtmlFile(sx,d.name);
htmls.addobject(d.name,sx);
until findnext(d)<>0;
findclose(d);
end;
end.
Regards,
Sergei
_______________________________________________
fpc-devel maillist - [email protected]
http://lists.freepascal.org/mailman/listinfo/fpc-devel