Hi all,

I am trying to create a search page using swish-e search engine.

I want to spider some web pages and it has a Perl program named
swishspider.pl that gets the addresses of all the links from a page.

It gets the links only if the file is text/html.
I have some more pages that are with the .shtml extension and they are
ignored.

If I change the .shtml extension to .html, the spider works and it reports
that the file is text/html, otherwise, it tells me that the file is
text/plain.

What can I do to make it know that .html, .shtml, .htm, ... are text/html
files?

Thank you very much!
Here is that little Perl script:

#!/perl/bin/perl -W
use strict;

use LWP::UserAgent;
use HTTP::Status;
use HTML::LinkExtor;

if (scalar(@ARGV) != 2) {
    print STDERR "Usage: SwishSpider localpath url\n";
    exit(1);
}

my $ua = new LWP::UserAgent;
$ua->agent( "SwishSpider http://swish-e.org"; );

my $localpath = shift;
my $url = shift;

my $request = new HTTP::Request( "GET", $url );
my $response = $ua->simple_request( $request );

#
# Write out important meta-data.  This includes the HTTP code.  Depending on
the
# code, we write out other data.  Redirects have the location printed,
everything
# else gets the content-type.
#
open( RESP, ">$localpath.response" ) || die( "Could not open response file
$localpath.response" );

print RESP $response->code() . "\n";
if( $response->code() == RC_OK ) {
    print RESP $response->header( "content-type" ) . "\n";
} elsif( $response->is_redirect() ) {
    print RESP $response->header( "location" ) . "\n";
}
close( RESP );

#
# Write out the actual data assuming the retrieval was succesful.  Also, if
# we have actual data and it's of type text/html, write out all the links it
# refers to
#
if( $response->code() == RC_OK ) {
    my $contents = $response->content();

    open( CONTENTS, ">$localpath.contents" ) || die( "Could not open
contents file $localpath.contents\n" );
    print CONTENTS $contents;
    close( CONTENTS );

    if( $response->header("content-type") =~ "text/html" ) {
 open( LINKS, ">$localpath.links" ) || die( "Could not open links file
$localpath.links\n" );
 my $p = HTML::LinkExtor->new( \&linkcb, $url );
 $p->parse( $contents );

 close( LINKS );
    }
}


sub linkcb {
    my($tag, %links) = @_;
    if (($tag eq "a") && ($links{"href"})) {
 my $link = $links{"href"};

 #
 # Remove fragments
 #
 $link =~ s/(.*)#.*/$1/;

 #
 # Remove ../  This is important because the abs() function
 # can leave these in and cause never ending loops.
 #
 $link =~ s/\.\.\///g;

 # hack for apostrophe -- changes URL, but should work for most clients.
 $link =~ s/'/%27/g;

 print LINKS "$link\n";
    }
}


Teddy,
[EMAIL PROTECTED]



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to