Hello everyone; Recently, when I want to implement Chakrabarti's algorithm using Perl, I found it difficult for me to extract five texts on each side of an URL(except anchor text). I can make my program do its job at last, but it runs slowly. Can anybody tell me how to improve the running speed of this program? Thanks.
Below is my own implemented perl module named 'chakrabarti.pm'. #!/usr/bin/perl package chakrabarti; require Exporter; @ISA = qw/Exporter/; @EXPORT = qw/extract_url_and_text/; use warnings; use strict; use HTML::TreeBuilder; use URI; use constant WIDTH => 5; my @texts_arr = (); my @anchor_index = (); sub extract_url_and_text{ my($html_ref, $base_ref) = @_; my %text_hash; my $tree = HTML::TreeBuilder->new_from_content(${$html_ref}); my $body_tag = $tree->find_by_tag_name('body'); &process($body_tag); for (@anchor_index) { my ($start_index, $end_index, $url) = ($_->[0], $_->[1], $_->[2]); $url = URI->new_abs($url, ${$base_ref}); my $text; for my $left_index (1..WIDTH) { last if $start_index < $left_index; $text .= $texts_arr[$start_index - $left_index] . ' '; } $text .= join(" ", @texts_arr[$start_index..$end_index]) . ' '; for my $right_index (1..WIDTH) { last if $end_index + $right_index > $#texts_arr; $text .= $texts_arr[$end_index + $right_index] . ' '; } $text_hash{$url} = $text; } $tree->delete; return [\%text_hash]; } sub process { my $tag = shift; my ($start_index, $end_index, $url); if ($tag->tag eq 'a') { $start_index = @texts_arr; $url = $tag->attr('href'); } foreach my $kid ($tag->content_list) { if (ref $kid) { &process($kid); } else { push @texts_arr, $kid; } } if ($tag->tag eq 'a') { $end_index = @texts_arr - 1; push @anchor_index, [$start_index, $end_index, $url]; } } 1; Then, in my perl program, I can invoke this module. Below is a working example: use warnings; use strict; use LWP::UserAgent; use chakrabarti; my $ua = LWP::UserAgent->new; my $res = $ua->get('http://www.cpan.org/'); if($res->is_success){ my $url_text_ref = extract_url_and_text($res->content_ref, $res->base); for(keys %{$url_text_ref->[0]}){ print $_, "\n", ${$url_text_ref->[0]}{$_}, "\n\n"; } } Below is the Chakrabarti's article: http://www.cs.berkeley.edu/~soumen/doc/www2002m/p336-chakrabarti.pdf Good luck! Hui Wang --------------------------------- 抢注雅虎免费邮箱-3.5G容量,20M附件!