On Fri, 11 Mar 2005 14:06:27 -0800, David Eppstein <[EMAIL PROTECTED]> wrote:
>In article <[EMAIL PROTECTED]>, > Patrick Useldinger <[EMAIL PROTECTED]> wrote: > >> > Well, but the spec didn't say efficiency was the primary criterion, it >> > said minimizing the number of comparisons was. >> >> That's exactly what my program does. > >If you're doing any comparisons at all, you're not minimizing the number >of comparisons. > ISTM a lot will depend on the distribution of differences between candidate (equal-sized, obviously) files. If multi-GB files differ in the last byte only (but this is not known), they will all have to be crunched to the last byte to eliminate them from possible duplicate-set membership. If there is a-priori knowledge of highly probable difference locations (e.g., internal date stamp at some offset etc) then obviously the candidates can be pre-classified into smaller candidate sets by some simple heuristic probes. If differences are likely to appear early in a sequential scan, perhaps developing hashes in parallel would be a good strategy. But I doubt if blindly pre-computing full hashes would be optimal. Compare to sort|uniq as a sieve. You wouldn't want to read through any files any further than you had to. Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list