Re: subroutine in LWP - in order to get 700 forum threads

Omega -1911 Sat, 26 Aug 2006 08:36:14 -0700

tr/Robin Norwood redundant replies//g;


On 8/26/06, Robin Norwood <[EMAIL PROTECTED]> wrote:

jobst müller <[EMAIL PROTECTED]> writes:

> hello dear Perl-addicted,
>
> to admit - i am a Perl-novice and ihave not so much experience in perl. But i 
am willing to learn. i want to learn perl. As for now i have to solve some tasks 
for the college. I have to do some investigations on a board where i have no 
access to the db.
>
> first of - i have to explain something; I have to grab some data out of a 
phpBB in order to do some field reseach. I need the data out of a forum that is 
runned by a user community.  I need the data to analyze the discussions.  To give 
an example - let us take this forum here. How can i grab all the data out of this 
forum - and get it local and then after wards put it in a local
>
> database - of a phpBB-forum - is this possible"?!"? 
[URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]
>
> Nothing harmeful - nothing bad - nothing serious and dangerous. But the issue 
is. i have to get the data - so what?
> I need the data in a allmost full and complete formate. So i need all the 
data like
>
> username .-
> forum
> thread
> topic
> text of the posting and so on and so on.
>
> how to do that?
>

Jobst,

As the debate between Randal and I shows, there are some ethical and
legal concerns when running a web spider.  I can't advise you on those.

From a technical point of view, your code appears to work, so I'm not
quite sure what you are asking...if you want to 'loop' over the URLs,
you could either run the spider multiple times, or put a 'foreach' loop
around the main body of your program.

my @urls = ("http://www.example.com/first.html";,
            "http://www.example.com/second.html";);

foreach my $url (@urls) {
  # main code
}

However, I would be *very* careful about this, because it is easy to
write a spider that 'behaves badly', and could overload the server in
question.  At the very least, I would make liberal use of the 'sleep'
function so that there is a delay of a few seconds between each request.
And a well-behaved spider would inspect the robots.txt file.  It looks
like there is a CPAN module for that (of course), though I've never used
it personally:

http://search.cpan.org/~rse/lcwa-1.0.0/lib/lwp/lib/WWW/RobotRules.pm

Again, I'd be very careful here, since there are potential legal
ramifications if you inadvertantly do something the site admin does not
like.

-RN

> [URL]=http://www.nukeforums.com/forums/viewforum.php?f=3[/URL]
> [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]
>
>
>
> [code]
>
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> use LWP::RobotUA;
> use HTML::LinkExtor;
> use HTML::TokeParser;
> use URI::URL;
>
> use Data::Dumper; # for show and troubleshooting
>
> my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17";;
> my $ua = LWP::RobotUA->new;
> my $lp = HTML::LinkExtor->new(\&wanted_links);
>
> my @links;
> get_threads($url);
>
> foreach my $page (@links) { # this loops over each link collected from the 
index
>       my $r = $ua->get($page);
>       if ($r->is_success) {
>               my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error 
in $page: $!";
>               # just printing what was collected
>               print Dumper get_thread($stream);
>               # would instead have database insert statement at this point
>        } else {
>               warn $r->status_line;
>        }
> }
>
> sub get_thread {
>       my $p = shift;
>       my ($title, $name, @thread);
>       while (my $tag = $p->get_tag('a','span')) {
>               if (exists $tag->[1]{'class'}) {
>                       if ($tag->[0] eq 'span') {
>                               if ($tag->[1]{'class'} eq 'name') {
>                                       $name = $p->get_trimmed_text('/span');
>                               } elsif ($tag->[1]{'class'} eq 'postbody') {
>                                       my $post = 
$p->get_trimmed_text('/span');
>                                       push @thread, {'name'=>$name, 
'post'=>$post};
>                               }
>                       } else {
>                               if ($tag->[1]{'class'} eq 'maintitle') {
>                                       $title = $p->get_trimmed_text('/a');
>                               }
>                       }
>               }
>       }
>       return {'title'=>$title, 'thread'=>[EMAIL PROTECTED];
> }
>
> sub get_threads {
>       my $page = shift;
>       my $r = $ua->request(HTTP::Request->new(GET => $url), sub 
{$lp->parse($_[0])});
>       # Expand URLs to absolute ones
>       my $base = $r->base;
>       return [map { $_ = url($_, $base)->abs; } @links];
> }
>
> sub wanted_links {
>       my($tag, %attr) = @_;
>       return unless exists $attr{'href'};
>       return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
>       push @links, values %attr;
> }
>
> [/code]
>
>
>
> If we have the necessary modules installed, and run it from the command line 
you'll see output such as the following:
>
>
>
> [code]
>
> $VAR1 = {
>           'thread' => [
>                         {
>                           'post' => 'Hello, I\'m pretty new to PHPNuke. I\'ve 
got my site up and running great! I\'m now starting to make modifications, add 
modules etc. I\'m using the most recent RavenPHP76. I want to display the 5 most 
recent forum posts at the top of the forum page. I\'m not sure if this functionality 
is built in, if so, how to activate. Or if there is a module or block made to do 
this. I looked at Raven\'s Collapsing Forum block but wasn\'t crazy about the format, 
and I don\'t want it to be collapsable. Thanks! mopho',
>                           'name' => 'mopho'
>                         },
>                         {
>                           'post' => 'hi there',
>                           'name' => 'sail'
>                         },
>                         {
>                           'post' => 'thanks for asking this; :not very sure 
if i got you right; Do you want to have a feed of the last forumthreads? guess the 
easiest way is to go to raven and ask how he did it. hth sail.',
>                           'name' => 'sail'
>                         },
>                         {
>                           'post' => 'Thanks. i found what I was looking for. 
It wasn\'t so easy to find! It\'s called glance_mod. mopho',
>                           'name' => 'mopho'
>                         },
>                         {
>                           'post' => 'hi there thx',
>                           'name' => 'sail'
>                         },
>                         {
>                           'post' => 'it sound interesting - i will have also 
a look i google after it - and try to find out more regards sailor',
>                           'name' => 'sail'
>                         }
>                       ],
>           'title' => 'Recent Forum Posts Module'
>         };
>
> [/code]
>
>
>
> to be honest - i think that the thing is to run
> the script  just looped over the first index page
> here [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]
> But I need it to loop over all the more than 50 pages. Therefore I need to 
get a routine here
>
>
> this must get a subroutine .... that the code is looped
>
> [code]
>
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> use LWP::RobotUA;
> use HTML::LinkExtor;
> use HTML::TokeParser;
> use URI::URL;
>
> use Data::Dumper; # for show and troubleshooting
>
> my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17";;
> my $ua = LWP::RobotUA->new;
> my $lp = HTML::LinkExtor->new(\&wanted_links);
>
> my @links;
> get_threads($url);
>
> foreach my $page (@links) { # this loops over each link collected from the 
index
>       my $r = $ua->get($page);
>       if ($r->is_success) {
>               my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error 
in $page: $!";
>               # just printing what was collected
>               print Dumper get_thread($stream);
>               # would instead have database insert statement at this point
>        } else {
>               warn $r->status_line;
>        }
> }
>
>
> [/code]
>
>
> This must get a subroutine - doesn t it?
>
>
> It has to  get a subroutine in order to let the script loop  over all the 
pages in the forum  
[URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL] in the above 
version it  isnt set up a loop to grab each of the index pages but someone may 
consider that trivial. the demonstration is very imressive - and makes me thinking 
that Perl is very very powerful.  I will try to harvest this category of the Forum 
(note  those both categories are of my interest nothing more:   
[URL]=http://www.nukeforums.com/forums/viewforum.php?f=3[/URL]
> [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]
>
> Question - am i able to get the results of the above mentionde forum 
categories - and can i get the forum threads that are stored in the two above 
forums....
>
>
> i look forward to hear from you
>
> fllobee
>
>
>
>
> _______________________________________________________________________
> Viren-Scan für Ihren PC! Jetzt für jeden. Sofort, online und kostenlos.
> Gleich testen! http://www.pc-sicherheit.web.de/freescan/?mc=022222

--
Robin Norwood
Red Hat, Inc.

"The Sage does nothing, yet nothing remains undone."
-Lao Tzu, Te Tao Ching

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: subroutine in LWP - in order to get 700 forum threads

Reply via email to