parsing and storing with Mechanize & DBI

jobst müller Wed, 11 Jun 2008 14:02:46 -0700

hello list , hello Rob

many thanks for the reply.


to avoid confusion - i try a first reply to your adress - not to the list. I am 
aware that i have to explain the issue, the problem and the needs more clearly 
- i try to do so.

Rob please give me feedback on that - if you need more input then please let me 
know. I try to do all i can do!


so i start here to describe the problems;


I need collect some of the data out of a site - here is an example. 
http://www.bamaclubgp.org/forum/sitemap.php
this is very similar to the site in am interested in...!

why do i need to harvest - and collect some data. Why do i need to collect the 
data, you may ask: i am an researcher and i want to do some socio-ethnographic 
research (see the research field - describet at http://opensourec.mit.edu and
http://opensource.mit.edu/online_papers.php ). Therefore i need the data: i 
want to harvest the data.

Harvest is an integrated set of tools to gather, extract, organize, search, 
cache, and replicate relevant information. I need to gather information out of 
a phpBB2. The question is: Can we tailor httrack to harvest and to digest 
information in some different formats. I need to fetch data out of a 
online-forum (a phpBB-board) and to store it locally in a mysql-db)
Is this possible with perl.


first snippets were available to solve it
http://forums.devshed.com/perl-programming-6/data-grabbing-and-mining-need-scripthelp-370550.html
http://forums.devshed.com/perl-programming-6/minor-change-in-lwp-need-ideas-how-to-accomplish-388061.html

You allready reviewed it - at a first glance... Now the problem is. I have to 
get the site above - in a allmost
full and complete data set.

according my view: The problem is two folded: it has two major issues or 
things...

1. Grabbing the data out of the site and then parsing it; finally
2. storing the data in the new - (local ) database...

Well the question of restoring is not too hard. if i can pull almost a full 
thread-data-set out of the site
The tables are shown here in this site: 
http://www.phpbbdoctor.com/doc_columns.php?id=24
Well if we are able to do the first job very good:

1. Grabbing the data out of the site and then parsing it; then the second job 
would be not too hard. Then i have as a result - a large file of CSV - data, 
donŽt i? The final question was: how can the job of restoring be done!? Then i 
am able to have a full set of data. Well i guess that it can be done with some 
help of the guys from the http://www.phpBB.com -Team

The question is: how should i get the data with the robot:USER-AGENT - does the 
Agent give me back the most of the
data - so that i can use it for an investigation. BTW -the investigation needs 
to be done with some retrieval operations.
Therefore i need to store the gathered datas in a mysql-db.

Well thats it. I need to build u p an allmost 100 per cent COPY of the original 
site - i need to store it locally - here on my machine. I need to collect some 
of the data out of the site which i am interested in: 
http://www.karakas-online.de/forum/sitemap.php

If the data that is gained with an script - i have to set up some PERL::DBI and 
try to store the data in a phpBB-DB.

Rob, what do you think about it. Are we able to do so!?


Rob , perhaps with a good converter or at least a part of a converter i can 
restore the whole cvs-dump with ease.
What do you think. So if we do the first job then i think the second part can 
be done also.

Rob, i look forward to hear from you
best regards

martin aka jobst

Rob
here the script....
> >
> > #!e:/Server/xampp/perl/bin/perl.exe -w
> > use strict;
> > use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
> > use CGI;
> > my $cgi = CGI->new();
> > print $cgi->header();
> > warningsToBrowser(1); #
> > use warnings;
> >
> > use LWP::RobotUA;
> > use HTML::LinkExtor;
> > use HTML::TokeParser;
> > use URI::URL;
> >
> > use Data::Dumper; # for show and troubleshooting
> >
> > my $url = "http://www.mysite.com/forums/";;
> > my $lp = HTML::LinkExtor->new(\&wanted_links);
> > my $ua = LWP::RobotUA->new('my-robot/0.1', '[EMAIL PROTECTED]');
> > my $lp = HTML::LinkExtor->new(\&wanted_links);
> >
> >
> >
> > print "Content-type: text/html\n\n";
> > print "Surfer variablen ua PRINT: $ua \n";
> > print "Surfer variablen lp PRINT: $lp \n";
> >
> > my @links;
> > get_threads($url);
> >
> > foreach my $page (@links) { # this loops over each link collected from the 
> > index
> > my $r = $ua->get($page);
> > if ($r->is_success) {
> > my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in 
> > $page: $!";
> > # just printing what was collected
> > print Dumper get_thread($stream);
> >
> > print "Content-type: text/html\n\n";
> > print "surfer variablen stream PRINT: $stream \n";
> >
> > } else {
> > warn $r->status_line;
> > }
> > }
> >
> > sub get_thread {
> > my $p = shift;
> > my ($title, $name, @thread);
> > while (my $tag = $p->get_tag('a','span')) {
> > if (exists $tag->[1]{'class'}) {
> > if ($tag->[0] eq 'span') {
> > if ($tag->[1]{'class'} eq 'name') {
> > $name = $p->get_trimmed_text('/span');
> > } elsif ($tag->[1]{'class'} eq 'postbody') {
> > my $post = $p->get_trimmed_text('/span');
> > push @thread, {'name'=>$name, 'post'=>$post};
> > }
> > } else {
> > if ($tag->[1]{'class'} eq 'maintitle') {
> > $title = $p->get_trimmed_text('/a');
> > }
> > }
> > }
> > }
> > return {'title'=>$title, 'thread'=>[EMAIL PROTECTED];
> > }
> >
> > sub get_threads {
> > my $page = shift;
> > my $r = $ua->request(HTTP::Request->new(GET => $url), sub 
> > {$lp->parse($_[0])});
> > # Expand URLs to absolute ones
> > my $base = $r->base;
> > return [map { $_ = url($_, $base)->abs; } @links];
> > }
> >
> > sub wanted_links {
> > my($tag, %attr) = @_;
> > return unless exists $attr{'href'};
> > return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
> > push @links, values %attr;
> > }
>
> Hello Jobst
>
> I'm afraid I'm unclear what your question is. It is hard to read and 
> understand
> all of the links you gave as they are all long forum threads.
>
> The code you have written looks reasonable. Can you explain what it is you are
> trying to do and what doesn't work please? It would help a lot if your post
> explained everything without referring to previous conversations.
>
> Also, your code uses a URL of http://www.mysite.com/forums/, which is clearly 
> a
> placeholder. Are you saying that the live value is
> http://www.phpbbdoctor.com/doc_columns.php?id=24? The success of the program
> depends enormously on the object data, so you need to tell us what site you 
> are
> reading from, or at least the address of a private site that gives the same 
> problem.
>
> Rob
>
>
>
>

_________________________________________________________________________
In 5 Schritten zur eigenen Homepage. Jetzt Domain sichern und gestalten! 
Nur 3,99 EUR/Monat! http://www.maildomain.web.de/?mc=021114


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

parsing and storing with Mechanize & DBI

Reply via email to