Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.

辉王 Sat, 11 Nov 2006 16:18:25 -0800

Hello everyone;
   
Recently, when I want to implement Chakrabarti's algorithm 
 
using Perl, I found it difficult for me to extract five texts on 
 
each side of an URL(except anchor text). 
 
I can make my program do its job at last, but it runs slowly. 
   
Can anybody tell me how to improve the running speed of this  
   
program? Thanks.


Below is my own implemented perl module named 'chakrabarti.pm'.

#!/usr/bin/perl
package chakrabarti;
require Exporter;
@ISA = qw/Exporter/;
@EXPORT = qw/extract_url_and_text/;
use warnings;
use strict;
use HTML::TreeBuilder;
use URI;
use constant WIDTH => 5;

my @texts_arr = ();
my @anchor_index = ();

sub extract_url_and_text{
    my($html_ref, $base_ref) = @_;
     my %text_hash;
     my $tree = HTML::TreeBuilder->new_from_content(${$html_ref});
     my $body_tag = $tree->find_by_tag_name('body');
     &process($body_tag);
     for (@anchor_index) {  
          my ($start_index, $end_index, $url) = ($_->[0], $_->[1], $_->[2]);
          $url = URI->new_abs($url, ${$base_ref});
          my $text;    
      for my $left_index (1..WIDTH) {
           last if $start_index < $left_index;     
         $text .= $texts_arr[$start_index - $left_index] . ' ';
          }
      $text .= join(" ", @texts_arr[$start_index..$end_index]) . ' ';
       for my $right_index (1..WIDTH) {
            last if $end_index + $right_index > $#texts_arr;
         $text .= $texts_arr[$end_index + $right_index] . ' ';
          }
       $text_hash{$url} = $text;    
     }
     $tree->delete;
     return [\%text_hash];
}
sub process {
    my $tag = shift;
    my ($start_index, $end_index, $url);
  if ($tag->tag eq 'a') {
       $start_index = @texts_arr;
        $url = $tag->attr('href');
      }
  foreach my $kid ($tag->content_list) {
       if (ref $kid) {
         &process($kid);
      } else {
          push @texts_arr, $kid;
        }
      }
  if ($tag->tag eq 'a') {
     $end_index = @texts_arr - 1;
     push @anchor_index, [$start_index, $end_index, $url];
      }
}
1;

Then, in my perl program, I can invoke this module. Below is a working    
example:
   
use warnings;
use strict;
use LWP::UserAgent;
use chakrabarti;
  
my $ua = LWP::UserAgent->new;
my $res = $ua->get('http://www.cpan.org/');
if($res->is_success){
     my $url_text_ref = extract_url_and_text($res->content_ref, $res->base);
     for(keys %{$url_text_ref->[0]}){
         print $_, "\n", ${$url_text_ref->[0]}{$_}, "\n\n";
    }
}
   
Below is the Chakrabarti's article:
http://www.cs.berkeley.edu/~soumen/doc/www2002m/p336-chakrabarti.pdf
   
Good luck!
   
Hui Wang

                
---------------------------------
抢注雅虎免费邮箱-3.5G容量，20M附件！

Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.

Reply via email to