Re: HTML parsing

Offer Kaye Tue, 29 Mar 2005 01:18:49 -0800

On Mon, 28 Mar 2005 15:49:38 -0500, Daniel Smith wrote:
> Hi all,
> 
> I'm brand new to Perl, and have just a little programming background.  I was 
> tasked with parsing 
> a set of .html files in order to extract the data contained within some 
> terribly formatted tables.  
> Here is a sample of what I have.....
> 
> <tr>
> <th align="left" width="10%"><font size="-1">Data to be extracted </font></th>
> <td width="30%"><font size="-1">
> DATA DATA DATA
> </font></td>
> <th align="left" width="10%"><font size="-1">Need this too</font></th>
> <td colspan="3" valign="top"><font size="-1">More data I need to get 
> out</font></td>
> </tr>
> 
> This is one row from the typical four row table that is returned as a search 
> result.  There are 25 
> of these four row tables per page.  Could someone point me in the right 
> direction as to how I 
> might go about doing this?  A colleague of mine told me "put the file into an 
> array and use the 
> 'split' command"....while I vaguely understand the concept, I'm not sure 
> about the syntax.  Can 
> anyone shed some light?
> 
> Thanks in advance,
> 
> Dan
> 
 
Hi Dan,
I would recommend against using a split or regexp based approach, as
any such approach is bound to be very fragile when parsing HTML. It is
much better to use a module. Here is one example, using
HTML::TokeParser :
################### begin code
use strict;
use warnings;
use Data::Dumper;
use HTML::TokeParser;


my @all_data; # an array to hold the data
# Parse the HTML
my $parser = HTML::TokeParser->new("input.html") || die "Can't open
input file input.html: $!";
# Search for a font tag and extract the data.
while (defined(my $token = $parser->get_tag("font"))) {
   my $data = $parser->get_text; #get the data
   $data =~ s/^\s+//; #get rid of extra whitespace the 
   $data =~ s/\s+$//; #   the beginning and end
   push @all_data,$data; # save the data
}

print Dumper([EMAIL PROTECTED]);
################### end code

This approach assumes that the data always comes after a font tag
(based on your example data). If this isn't the case, the code has to
change, but it is a lot easier to do if you use HTML::TokeParser than
if you do so using split.
If you insist on using split, read "perldoc -f split".

Hope this helps,
-- 
Offer Kaye

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: HTML parsing

Reply via email to