On Fri, 11 Mar 2005 14:06:27 -0800, David Eppstein <[EMAIL PROTECTED]> wrote:

>In article <[EMAIL PROTECTED]>,
> Patrick Useldinger <[EMAIL PROTECTED]> wrote:
>
>> > Well, but the spec didn't say efficiency was the primary criterion, it 
>> > said minimizing the number of comparisons was.
>> 
>> That's exactly what my program does.
>
>If you're doing any comparisons at all, you're not minimizing the number 
>of comparisons.
>
ISTM a lot will depend on the distribution of differences between candidate
(equal-sized, obviously) files. If multi-GB files differ in the last byte only
(but this is not known), they will all have to be crunched to the last byte to
eliminate them from possible duplicate-set membership. If there is a-priori 
knowledge
of highly probable difference locations (e.g., internal date stamp at some 
offset etc)
then obviously the candidates can be pre-classified into smaller candidate sets 
by some
simple heuristic probes.

If differences are likely to appear early in a sequential scan, perhaps 
developing hashes
in parallel would be a good strategy. But I doubt if blindly pre-computing full 
hashes would be optimal.
Compare to sort|uniq as a sieve. You wouldn't want to read through any files 
any further than you had to.

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to