Hi, J77,

Your question is not exactly clear, but I will try to answer what *I* think
you are asking. First, I will assume that you have a list of web sites with
and without "http://www."; that you want to index like so:

http://www.microsoft.com
http://ibm.com
www.ebay.com
yahoo.com

Second, I will assume that if you have already indexed
"http://www.microsoft.com"; and you later come across "microsoft.com", that
you do NOT want to index Microsoft again because you have already come
across one of the six Microsoft formats you listed below.

This does what's listed above and will index only 4 sites not 8:

------BEGIN CODE------
#!/usr/bin/perl
use warnings;
use strict;

my %seen;

while (<DATA>) {
  my ($site) = m!^(?:http://)?(?:www\.)?([^/\s]+)! or next;
  next if exists $seen{$site};
  # code to index $site here
  $seen{$site} = undef;
}

__DATA__
http://www.microsoft.com
http://ibm.com/
www.ebay.com
yahoo.com
microsoft.com/
www.ibm.com/
http://www.ebay.com
www.yahoo.com/
-------END CODE-------

If you have another file containing links that have already been indexed,
you simply populate %seen using those file entries before the while loop
that reads new sites.

I hope this helps,
ZO


<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]

> > if ($siteurl2 =~ /^(?:www.)?$FORM{'siteurl'}\/?$/) {
> >    print "Matched";
> > }

<snip>

>   Ok, maybe I wasn't clear. What I want to do is check a URL against urls
in
> a list, so that all six forms of the url will match so duplicates won't be
> indexed.
>
> http:/www.mysite.com/
> http:/www.mysite.com
> http:/mysite.com/
> http:/mysite.com
> www.mysite.com/
> www.mysite.com



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to