tr/Robin Norwood redundant replies//g;
On 8/26/06, Robin Norwood <[EMAIL PROTECTED]> wrote:
jobst müller <[EMAIL PROTECTED]> writes: > hello dear Perl-addicted, > > to admit - i am a Perl-novice and ihave not so much experience in perl. But i am willing to learn. i want to learn perl. As for now i have to solve some tasks for the college. I have to do some investigations on a board where i have no access to the db. > > first of - i have to explain something; I have to grab some data out of a phpBB in order to do some field reseach. I need the data out of a forum that is runned by a user community. I need the data to analyze the discussions. To give an example - let us take this forum here. How can i grab all the data out of this forum - and get it local and then after wards put it in a local > > database - of a phpBB-forum - is this possible"?!"? [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL] > > Nothing harmeful - nothing bad - nothing serious and dangerous. But the issue is. i have to get the data - so what? > I need the data in a allmost full and complete formate. So i need all the data like > > username .- > forum > thread > topic > text of the posting and so on and so on. > > how to do that? > Jobst, As the debate between Randal and I shows, there are some ethical and legal concerns when running a web spider. I can't advise you on those. From a technical point of view, your code appears to work, so I'm not quite sure what you are asking...if you want to 'loop' over the URLs, you could either run the spider multiple times, or put a 'foreach' loop around the main body of your program. my @urls = ("http://www.example.com/first.html", "http://www.example.com/second.html"); foreach my $url (@urls) { # main code } However, I would be *very* careful about this, because it is easy to write a spider that 'behaves badly', and could overload the server in question. At the very least, I would make liberal use of the 'sleep' function so that there is a delay of a few seconds between each request. And a well-behaved spider would inspect the robots.txt file. It looks like there is a CPAN module for that (of course), though I've never used it personally: http://search.cpan.org/~rse/lcwa-1.0.0/lib/lwp/lib/WWW/RobotRules.pm Again, I'd be very careful here, since there are potential legal ramifications if you inadvertantly do something the site admin does not like. -RN > [URL]=http://www.nukeforums.com/forums/viewforum.php?f=3[/URL] > [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL] > > > > [code] > > #!/usr/bin/perl > use strict; > use warnings; > > use LWP::RobotUA; > use HTML::LinkExtor; > use HTML::TokeParser; > use URI::URL; > > use Data::Dumper; # for show and troubleshooting > > my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17"; > my $ua = LWP::RobotUA->new; > my $lp = HTML::LinkExtor->new(\&wanted_links); > > my @links; > get_threads($url); > > foreach my $page (@links) { # this loops over each link collected from the index > my $r = $ua->get($page); > if ($r->is_success) { > my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in $page: $!"; > # just printing what was collected > print Dumper get_thread($stream); > # would instead have database insert statement at this point > } else { > warn $r->status_line; > } > } > > sub get_thread { > my $p = shift; > my ($title, $name, @thread); > while (my $tag = $p->get_tag('a','span')) { > if (exists $tag->[1]{'class'}) { > if ($tag->[0] eq 'span') { > if ($tag->[1]{'class'} eq 'name') { > $name = $p->get_trimmed_text('/span'); > } elsif ($tag->[1]{'class'} eq 'postbody') { > my $post = $p->get_trimmed_text('/span'); > push @thread, {'name'=>$name, 'post'=>$post}; > } > } else { > if ($tag->[1]{'class'} eq 'maintitle') { > $title = $p->get_trimmed_text('/a'); > } > } > } > } > return {'title'=>$title, 'thread'=>[EMAIL PROTECTED]; > } > > sub get_threads { > my $page = shift; > my $r = $ua->request(HTTP::Request->new(GET => $url), sub {$lp->parse($_[0])}); > # Expand URLs to absolute ones > my $base = $r->base; > return [map { $_ = url($_, $base)->abs; } @links]; > } > > sub wanted_links { > my($tag, %attr) = @_; > return unless exists $attr{'href'}; > return if $attr{'href'} !~ /^viewtopic\.php\?t=/; > push @links, values %attr; > } > > [/code] > > > > If we have the necessary modules installed, and run it from the command line you'll see output such as the following: > > > > [code] > > $VAR1 = { > 'thread' => [ > { > 'post' => 'Hello, I\'m pretty new to PHPNuke. I\'ve got my site up and running great! I\'m now starting to make modifications, add modules etc. I\'m using the most recent RavenPHP76. I want to display the 5 most recent forum posts at the top of the forum page. I\'m not sure if this functionality is built in, if so, how to activate. Or if there is a module or block made to do this. I looked at Raven\'s Collapsing Forum block but wasn\'t crazy about the format, and I don\'t want it to be collapsable. Thanks! mopho', > 'name' => 'mopho' > }, > { > 'post' => 'hi there', > 'name' => 'sail' > }, > { > 'post' => 'thanks for asking this; :not very sure if i got you right; Do you want to have a feed of the last forumthreads? guess the easiest way is to go to raven and ask how he did it. hth sail.', > 'name' => 'sail' > }, > { > 'post' => 'Thanks. i found what I was looking for. It wasn\'t so easy to find! It\'s called glance_mod. mopho', > 'name' => 'mopho' > }, > { > 'post' => 'hi there thx', > 'name' => 'sail' > }, > { > 'post' => 'it sound interesting - i will have also a look i google after it - and try to find out more regards sailor', > 'name' => 'sail' > } > ], > 'title' => 'Recent Forum Posts Module' > }; > > [/code] > > > > to be honest - i think that the thing is to run > the script just looped over the first index page > here [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL] > But I need it to loop over all the more than 50 pages. Therefore I need to get a routine here > > > this must get a subroutine .... that the code is looped > > [code] > > #!/usr/bin/perl > use strict; > use warnings; > > use LWP::RobotUA; > use HTML::LinkExtor; > use HTML::TokeParser; > use URI::URL; > > use Data::Dumper; # for show and troubleshooting > > my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17"; > my $ua = LWP::RobotUA->new; > my $lp = HTML::LinkExtor->new(\&wanted_links); > > my @links; > get_threads($url); > > foreach my $page (@links) { # this loops over each link collected from the index > my $r = $ua->get($page); > if ($r->is_success) { > my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in $page: $!"; > # just printing what was collected > print Dumper get_thread($stream); > # would instead have database insert statement at this point > } else { > warn $r->status_line; > } > } > > > [/code] > > > This must get a subroutine - doesn t it? > > > It has to get a subroutine in order to let the script loop over all the pages in the forum [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL] in the above version it isnt set up a loop to grab each of the index pages but someone may consider that trivial. the demonstration is very imressive - and makes me thinking that Perl is very very powerful. I will try to harvest this category of the Forum (note those both categories are of my interest nothing more: [URL]=http://www.nukeforums.com/forums/viewforum.php?f=3[/URL] > [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL] > > Question - am i able to get the results of the above mentionde forum categories - and can i get the forum threads that are stored in the two above forums.... > > > i look forward to hear from you > > fllobee > > > > > _______________________________________________________________________ > Viren-Scan für Ihren PC! Jetzt für jeden. Sofort, online und kostenlos. > Gleich testen! http://www.pc-sicherheit.web.de/freescan/?mc=022222 -- Robin Norwood Red Hat, Inc. "The Sage does nothing, yet nothing remains undone." -Lao Tzu, Te Tao Ching -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>