Re: HTML parsing

Felix Geerinckx Tue, 29 Mar 2005 01:10:24 -0800

On 28/03/2005, Daniel Smith wrote:

> I was tasked with parsing a set of .html files in order to extract
> the data contained within some terribly formatted tables.


[...]

> Can anyone shed some light?

I used HTML::Treebuilder on a similar project once:

    #! /usr/bin/perl
    use warnings;
    use strict;

    use HTML::TreeBuilder;
    
    my $tree = HTML::TreeBuilder->new; 
    $tree->parse_file('yourfile.html') or die "Cannot open file: $!";

    # Get tables
    my @tables = $tree->look_down( '_tag', 'table' ); 
    for my $t (@tables) {
        # Get rows
        my @rows = $t->look_down('_tag', 'tr');
        for my $r (@rows) {
            print "Row contents:\n";
            # Get 'th' and 'td' cells
            my @cells = $r->look_down('_tag', qr/(th|td)/); 
            for my $c (@cells) {
                print "\t", $c->as_text(), "\n";
            }                   
        }
    }
    $tree->delete();

-- 
felix

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: HTML parsing

Reply via email to