From: Rob Dixon <[EMAIL PROTECTED]>
> Jenda Krynicky wrote:
> > From: Nitin Kalra <[EMAIL PROTECTED]>
> >>
> >> In a Perl script of mine I have to compare 2 8M-10M
> >> files(each). Which mean 80-90M searches. As a normal
> >> procedure (upto 1 M)I use hashes, but going beyond 1M
> >> system performance degrades drastically.
> > 
> > You mean you compute MD5 (or something similar) hashes of the files 
> > and store them in a %hash for the first step of the file comparison? 
> > Maybe the %hash grows too big to fit in memory together with the 
> > other stuff you need and forces the OS to start paging. If this is 
> > the case you may get better performance storing the %hash on disk 
> > using DB_File or similar module. Apart from the %hash declaration 
> > this should not force any changes to your code, but will drasticaly 
> > lower the memory footprint. Though of course for the cases when the 
> > %hash would fit in memory this will be slower.
> 
> AFAIK, regardless of the size of the source data, calculating an MD5 needs 128
> bits plus presumably a secondary copy of them plus a few more. Lets say 64 
> bytes
> to be safe. It's not going to challenge the memory of any current PC.
> 
> Rob

That's the %hash keys only. For the %hash containing the MD5 hashes 
to be of any use I presume it contains some values as well. Like the 
paths to the files. And you have 16-20 milions of those. With 500B 
for each entry we are on 10GB. Considering the fact that this is most 
likely not the only process running on the computer it may very well 
challenge the memory of the computer.

Jenda
===== [EMAIL PROTECTED] === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed 
to get drunk and croon as much as they like.
        -- Terry Pratchett in Sourcery


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to