On 01/22/2013 01:26 PM, Ferrous Cranus wrote:

<snip>

sub hashit {
    my $url=shift;
    my @ltrs=split(//,$url);
    my $hash = 0;

    foreach my $ltr(@ltrs){
         $hash = ( $hash + ord($ltr)) %10000;
    }
    printf "%s: %0.4d\n",$url,$hash

}


which yields:
$ perl testMD5.pl
/index.html: 1066
/about/time.html: 1547


If you use that algorithm to get a 4 digit number, it'll look good for the first few files. But if you try 100 files, you've got almost 40% chance of a collision, and if you try 10001, you've got a 100% chance.


So is it really okay to reuse the same integer for different files?

I tried to help you when you were using the md5 algorithm. By using enough digits/characters, you can cut the likelihood of a collision quite small. But 4 digits, don't be ridiculous.


--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to