Bug#449602: perforate: finddup is a damn memory hog

Joerg Jaspert Tue, 06 Nov 2007 15:54:01 -0800

Package: perforate
Version: 1.2-5
Severity: important

Hi


The subject say all - finddup eats all memory.

Looking into it its the Digest::MD5 sum usage of finddup, seems the
author never tested with large files, or lots of files (worth a terabyte
of disc space). Kicking the "addfile" and instead doing a while loop has
two effects:

- does not eat any noticable amount of memory anymore.
  (other than whats needed for the file list)
- is noticable faster, for whatever reason.

--- /usr/bin/finddup	2006-08-18 23:09:57.000000000 +0200
+++ /home/joerg/finddup	2007-11-07 00:33:01.827142588 +0100
@@ -131,11 +131,19 @@
 sub insert_md5
 {
    my $file = shift;
+   my $data;
+
    if (open(IN, "<", $file->[4]->[0]))
    {
-      my $md5 = Digest::MD5->new->addfile(*IN)->hexdigest;
-      $md5 .= "\t".$file->[1]."\t".$file->[2]."\t".$file->[3] unless $opt->{'ignore-perms'};
+      my $check = Digest::MD5->new;
+      while (sysread(IN, $data, 8192))
+      {
+         $check->add($data);
+      }
       close IN;
+      my $md5 = $check->hexdigest;
+
+      $md5 .= "\t".$file->[1]."\t".$file->[2]."\t".$file->[3] unless $opt->{'ignore-perms'};
       $md5list{$md5} = [] unless exists $md5list{$md5};
       push @{$md5list{$md5}}, $file;
    }

-- 
bye Joerg
Some AM to his NM on [11 Aug. 2004]:
You already won't get through Front Desk and Account Manager approvals before 
sarge,[...]
[Note: He made it! :) ]

pgpvBLPFbMDPJ.pgp
Description: PGP signature

Bug#449602: perforate: finddup is a damn memory hog

Reply via email to