hello list , hello Rob many thanks for the reply.
to avoid confusion - i try a first reply to your adress - not to the list. I am aware that i have to explain the issue, the problem and the needs more clearly - i try to do so. Rob please give me feedback on that - if you need more input then please let me know. I try to do all i can do! so i start here to describe the problems; I need collect some of the data out of a site - here is an example. http://www.bamaclubgp.org/forum/sitemap.php this is very similar to the site in am interested in...! why do i need to harvest - and collect some data. Why do i need to collect the data, you may ask: i am an researcher and i want to do some socio-ethnographic research (see the research field - describet at http://opensourec.mit.edu and http://opensource.mit.edu/online_papers.php ). Therefore i need the data: i want to harvest the data. Harvest is an integrated set of tools to gather, extract, organize, search, cache, and replicate relevant information. I need to gather information out of a phpBB2. The question is: Can we tailor httrack to harvest and to digest information in some different formats. I need to fetch data out of a online-forum (a phpBB-board) and to store it locally in a mysql-db) Is this possible with perl. first snippets were available to solve it http://forums.devshed.com/perl-programming-6/data-grabbing-and-mining-need-scripthelp-370550.html http://forums.devshed.com/perl-programming-6/minor-change-in-lwp-need-ideas-how-to-accomplish-388061.html You allready reviewed it - at a first glance... Now the problem is. I have to get the site above - in a allmost full and complete data set. according my view: The problem is two folded: it has two major issues or things... 1. Grabbing the data out of the site and then parsing it; finally 2. storing the data in the new - (local ) database... Well the question of restoring is not too hard. if i can pull almost a full thread-data-set out of the site The tables are shown here in this site: http://www.phpbbdoctor.com/doc_columns.php?id=24 Well if we are able to do the first job very good: 1. Grabbing the data out of the site and then parsing it; then the second job would be not too hard. Then i have as a result - a large file of CSV - data, donŽt i? The final question was: how can the job of restoring be done!? Then i am able to have a full set of data. Well i guess that it can be done with some help of the guys from the http://www.phpBB.com -Team The question is: how should i get the data with the robot:USER-AGENT - does the Agent give me back the most of the data - so that i can use it for an investigation. BTW -the investigation needs to be done with some retrieval operations. Therefore i need to store the gathered datas in a mysql-db. Well thats it. I need to build u p an allmost 100 per cent COPY of the original site - i need to store it locally - here on my machine. I need to collect some of the data out of the site which i am interested in: http://www.karakas-online.de/forum/sitemap.php If the data that is gained with an script - i have to set up some PERL::DBI and try to store the data in a phpBB-DB. Rob, what do you think about it. Are we able to do so!? Rob , perhaps with a good converter or at least a part of a converter i can restore the whole cvs-dump with ease. What do you think. So if we do the first job then i think the second part can be done also. Rob, i look forward to hear from you best regards martin aka jobst Rob here the script.... > > > > #!e:/Server/xampp/perl/bin/perl.exe -w > > use strict; > > use CGI::Carp qw(fatalsToBrowser warningsToBrowser); > > use CGI; > > my $cgi = CGI->new(); > > print $cgi->header(); > > warningsToBrowser(1); # > > use warnings; > > > > use LWP::RobotUA; > > use HTML::LinkExtor; > > use HTML::TokeParser; > > use URI::URL; > > > > use Data::Dumper; # for show and troubleshooting > > > > my $url = "http://www.mysite.com/forums/"; > > my $lp = HTML::LinkExtor->new(\&wanted_links); > > my $ua = LWP::RobotUA->new('my-robot/0.1', '[EMAIL PROTECTED]'); > > my $lp = HTML::LinkExtor->new(\&wanted_links); > > > > > > > > print "Content-type: text/html\n\n"; > > print "Surfer variablen ua PRINT: $ua \n"; > > print "Surfer variablen lp PRINT: $lp \n"; > > > > my @links; > > get_threads($url); > > > > foreach my $page (@links) { # this loops over each link collected from the > > index > > my $r = $ua->get($page); > > if ($r->is_success) { > > my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in > > $page: $!"; > > # just printing what was collected > > print Dumper get_thread($stream); > > > > print "Content-type: text/html\n\n"; > > print "surfer variablen stream PRINT: $stream \n"; > > > > } else { > > warn $r->status_line; > > } > > } > > > > sub get_thread { > > my $p = shift; > > my ($title, $name, @thread); > > while (my $tag = $p->get_tag('a','span')) { > > if (exists $tag->[1]{'class'}) { > > if ($tag->[0] eq 'span') { > > if ($tag->[1]{'class'} eq 'name') { > > $name = $p->get_trimmed_text('/span'); > > } elsif ($tag->[1]{'class'} eq 'postbody') { > > my $post = $p->get_trimmed_text('/span'); > > push @thread, {'name'=>$name, 'post'=>$post}; > > } > > } else { > > if ($tag->[1]{'class'} eq 'maintitle') { > > $title = $p->get_trimmed_text('/a'); > > } > > } > > } > > } > > return {'title'=>$title, 'thread'=>[EMAIL PROTECTED]; > > } > > > > sub get_threads { > > my $page = shift; > > my $r = $ua->request(HTTP::Request->new(GET => $url), sub > > {$lp->parse($_[0])}); > > # Expand URLs to absolute ones > > my $base = $r->base; > > return [map { $_ = url($_, $base)->abs; } @links]; > > } > > > > sub wanted_links { > > my($tag, %attr) = @_; > > return unless exists $attr{'href'}; > > return if $attr{'href'} !~ /^viewtopic\.php\?t=/; > > push @links, values %attr; > > } > > Hello Jobst > > I'm afraid I'm unclear what your question is. It is hard to read and > understand > all of the links you gave as they are all long forum threads. > > The code you have written looks reasonable. Can you explain what it is you are > trying to do and what doesn't work please? It would help a lot if your post > explained everything without referring to previous conversations. > > Also, your code uses a URL of http://www.mysite.com/forums/, which is clearly > a > placeholder. Are you saying that the live value is > http://www.phpbbdoctor.com/doc_columns.php?id=24? The success of the program > depends enormously on the object data, so you need to tell us what site you > are > reading from, or at least the address of a private site that gives the same > problem. > > Rob > > > > _________________________________________________________________________ In 5 Schritten zur eigenen Homepage. Jetzt Domain sichern und gestalten! Nur 3,99 EUR/Monat! http://www.maildomain.web.de/?mc=021114 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/