Ing. Branislav Gerzo wrote:
Hi all,

I have to parse some thousand of html files, so I'd like to use some
html parser, and not my own regexpes. Htmls I am parsing are quite
complex, so I need your help. First of all, is HTML::Tree good and
fast module?

Because, I am not sure if I have to look for some criteria using
if( my $h = $tree->look_down('_tag', 'sometag') ) { }
it is not slow ?

When I used Dumped through Data::Dumper, from 300 kb html file is 13mb
dump output...

There are basically two types of parser: 1) the type that reads in html, xml, etc. and builds an in memory representation of the data, usually a hierarchical tree structure, and 2) the type that reads in the file and fires off events for each of the tags/elements it encounters.


The first type is very convenient, especially when you want to reference lots of random elements from the data, but it takes up a lot of memory and is usually slower. HTML::Tree is of this type.

The second type is fast and only takes up as much memory as you allow because as each element is encountered you decide whether to hang on to the data or throw it away. HTML::Parser is of this type.

Having said all that, I did a quick search on the CPAN:

<http://search.cpan.org/search?query=html%20table&mode=all>

and near the top I see two modules that migh help you out. They are both based on HTML::Parser and they both deal exclusively with html tables:

HTML::TableExtractor
HTML::TableContentParser

I haven't used either of these, but a quick look at the docs seems to indicate that HTML::TableExtractor works a lot like HTML::Parser: as it encounters tables and table elements it fires off events so that you can process the date. HTML::TableContentParser seems to work like HTML::Tree: it reads the data and constructs a simple structure that holds the tables found in the document.

Randy.

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>




Reply via email to