On 09/12/2010 18:00, shawn wilson wrote:
i decided to use another module to get my data but, i'm having a bit
of an issue with xpath.

the data i want looks like this:

<table class="someclass" style="width:508px;" id="Any_20">
  <tbody>
   <tr>
    <td>name</td>
    <td>attribute</td>

    <td>name2</td>
    <td>attribute2</td>

    <td>possible name3</td>
    <td>possible attribute3</td>

    <td>
....
    </tr><tr>
more of the same format


with this code, i'm only getting the first line of data (ie,<td>  ...
</td>). i realize that i'm only getting the first and second td which
is fine, but how do i get multiple rows? i'm also grabbing the html
from a file so that i don't needlessly keep hitting up their web
server.

#!/usr/bin/perl

use strict;
use warnings;


use LWP::UserAgent;
use LWP::Simple;
use Web::Scraper;
use Data::Dumper::Simple;

my( $infile ) = $ARGV[ 0 ] =~ m/^([\ A-Z0-9_.-]+)$/ig;

my $pagedata = scraper {
    process '//*/tab...@class="someclass"]', 'table[]' =>  scraper {
       process '//tr/td[1]', 'name' =>  'TEXT';
       process '//tr/td[2]', 'attr' =>  'TEXT';
    };
};


open( FILE, "<  $infile" );

my $content = do { local $/;<FILE>  };

    my $res = $pagedata->scrape( $content )
       or die "Can't define content to parser $!";

print Dumper( $res );

You will find more than one <tr> if you move it up to the previous level
in the XPath hierarchy. Try something like this:

  my $pagedata = scraper {
    process '//tab...@class="someclass"]/tbody/tr', 'tr[]' => scraper {
      process '//td[1]', 'name' => 'TEXT',
      process '//td[2]', 'attr' => 'TEXT',
    }
  };

But Web::Scraper is a large and cumbersome solution to your problem. I
suggest you are better off using HTML::TreeBuilder, which will build a
parse tree for the HTML and let you traverse it with the methods in
HTML::Element.

HTH,

Rob

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to