On 01/22/2013 01:26 PM, Ferrous Cranus wrote:
<snip>
sub hashit {
my $url=shift;
my @ltrs=split(//,$url);
my $hash = 0;
foreach my $ltr(@ltrs){
$hash = ( $hash + ord($ltr)) %10000;
}
printf "%s: %0.4d\n",$url,$hash
}
which yields:
$ perl testMD5.pl
/index.html: 1066
/about/time.html: 1547
If you use that algorithm to get a 4 digit number, it'll look good for
the first few files. But if you try 100 files, you've got almost 40%
chance of a collision, and if you try 10001, you've got a 100% chance.
So is it really okay to reuse the same integer for different files?
I tried to help you when you were using the md5 algorithm. By using
enough digits/characters, you can cut the likelihood of a collision
quite small. But 4 digits, don't be ridiculous.
--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list