On Mon, 28 Mar 2005 15:49:38 -0500, Daniel Smith wrote: > Hi all, > > I'm brand new to Perl, and have just a little programming background. I was > tasked with parsing > a set of .html files in order to extract the data contained within some > terribly formatted tables. > Here is a sample of what I have..... > > <tr> > <th align="left" width="10%"><font size="-1">Data to be extracted </font></th> > <td width="30%"><font size="-1"> > DATA DATA DATA > </font></td> > <th align="left" width="10%"><font size="-1">Need this too</font></th> > <td colspan="3" valign="top"><font size="-1">More data I need to get > out</font></td> > </tr> > > This is one row from the typical four row table that is returned as a search > result. There are 25 > of these four row tables per page. Could someone point me in the right > direction as to how I > might go about doing this? A colleague of mine told me "put the file into an > array and use the > 'split' command"....while I vaguely understand the concept, I'm not sure > about the syntax. Can > anyone shed some light? > > Thanks in advance, > > Dan > Hi Dan, I would recommend against using a split or regexp based approach, as any such approach is bound to be very fragile when parsing HTML. It is much better to use a module. Here is one example, using HTML::TokeParser : ################### begin code use strict; use warnings; use Data::Dumper; use HTML::TokeParser;
my @all_data; # an array to hold the data # Parse the HTML my $parser = HTML::TokeParser->new("input.html") || die "Can't open input file input.html: $!"; # Search for a font tag and extract the data. while (defined(my $token = $parser->get_tag("font"))) { my $data = $parser->get_text; #get the data $data =~ s/^\s+//; #get rid of extra whitespace the $data =~ s/\s+$//; # the beginning and end push @all_data,$data; # save the data } print Dumper([EMAIL PROTECTED]); ################### end code This approach assumes that the data always comes after a font tag (based on your example data). If this isn't the case, the code has to change, but it is a lot easier to do if you use HTML::TokeParser than if you do so using split. If you insist on using split, read "perldoc -f split". Hope this helps, -- Offer Kaye -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>