Re: Regex...HTML::Parser...Getting webpage data?

Rob Dixon Sat, 05 Aug 2006 13:40:48 -0700

Wesley Bresson wrote:
>
> Thanks for your example script using HTML::Treebuilder, however I'm
> trying to figure out why it appears to grab some items but not others.
> I've removed the $20-100 limitation (I didn't need it, I really just
> need to poll one item) but am still missing some of the items. For
> example, the most obvious, are the 2 1986-2006 eagle at the top of the
> page, the script grabs one but not the other, any idea why ? Does it
> have to do with it looking for the 5 td's ?


Hello Wesley.

The script fails because the site is an appalling example of HTML and
HTML::TreeBuilder cannot parse it successfully. There are many spurious closing
tags without matching opening ones, as well as a lot of missing closing tags;
the page as a whole simply doesn't hold together.

I have managed to establish that the HTML tables containing the pricing
information will parse on their own, so I offer this hack to get the information
you need. It works by scanning the input and extracting just the pricing tables,
then submitting these to HTML::TreeBuilder. It's not pretty but it will probably
suffice for what you need. Please buy from these people: they need your money
for better Web development staff!

Cheers,

Rob


use strict;
use warnings;

use LWP::Simple;
use HTML::TreeBuilder;

my $html = get 
'http://www.apmex.com/shop/buy/Silver_American_Eagles.asp?orderid=0';
my @newhtml;

my $in_table;

foreach (split /\n/, $html) {

  next if /^\s*<!--.*-->\s*$/;

  if (m%<table\b%) {
    $in_table++ if /"pricesTable"/ or $in_table;
  }

  if ($in_table) {
    push @newhtml, $_;
    $in_table-- if m%</table\b%;
  }
}

my $tree = HTML::TreeBuilder->new_from_content(join '', @newhtml);

my @table = $tree->look_down(_tag => 'table', id => 'pricesTable');

foreach my $table (@table) {

  my @content = $table->content_list;

  foreach my $elem (@content) {
    print $elem->as_trimmed_text, "\n";
  }
}

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Regex...HTML::Parser...Getting webpage data?

Reply via email to