On 09/12/2010 18:00, shawn wilson wrote:
i decided to use another module to get my data but, i'm having a bit
of an issue with xpath.
the data i want looks like this:
<table class="someclass" style="width:508px;" id="Any_20">
<tbody>
<tr>
<td>name</td>
<td>attribute</td>
<td>name2</td>
<td>attribute2</td>
<td>possible name3</td>
<td>possible attribute3</td>
<td>
....
</tr><tr>
more of the same format
with this code, i'm only getting the first line of data (ie,<td> ...
</td>). i realize that i'm only getting the first and second td which
is fine, but how do i get multiple rows? i'm also grabbing the html
from a file so that i don't needlessly keep hitting up their web
server.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use LWP::Simple;
use Web::Scraper;
use Data::Dumper::Simple;
my( $infile ) = $ARGV[ 0 ] =~ m/^([\ A-Z0-9_.-]+)$/ig;
my $pagedata = scraper {
process '//*/tab...@class="someclass"]', 'table[]' => scraper {
process '//tr/td[1]', 'name' => 'TEXT';
process '//tr/td[2]', 'attr' => 'TEXT';
};
};
open( FILE, "< $infile" );
my $content = do { local $/;<FILE> };
my $res = $pagedata->scrape( $content )
or die "Can't define content to parser $!";
print Dumper( $res );
You will find more than one <tr> if you move it up to the previous level
in the XPath hierarchy. Try something like this:
my $pagedata = scraper {
process '//tab...@class="someclass"]/tbody/tr', 'tr[]' => scraper {
process '//td[1]', 'name' => 'TEXT',
process '//td[2]', 'attr' => 'TEXT',
}
};
But Web::Scraper is a large and cumbersome solution to your problem. I
suggest you are better off using HTML::TreeBuilder, which will build a
parse tree for the HTML and let you traverse it with the methods in
HTML::Element.
HTH,
Rob
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/