Why not try to use NoClone, it finds and deletes duplicate files by
true byte-by-byte comparison. Smart marker filters duplicate files to
delete. With GUI.
http://noclone.net
Xah Lee wrote:
> here's a large exercise that uses what we built before.
>
> suppose you have tens of thousands of files i
Sorry i've been busy...
Here's the Perl code. I have yet to clean up the code and make it
compatible with the cleaned spec above. The code as it is performs the
same algorithm as the spec, just doesn't print the output as such. In a
few days, i'll post a clean version, and also a Python version, a
>> I'll post my version in a few days.
Have I missed something?
Where can I see your version?
Claudio
"Xah Lee" <[EMAIL PROTECTED]> schrieb im Newsbeitrag
news:[EMAIL PROTECTED]
> here's a large exercise that uses what we built before.
>
> suppose you have tens of thousands of files in various d
In article <[EMAIL PROTECTED]>, [EMAIL PROTECTED] (John J. Lee)
wrote:
> > If you read them in parallel, it's _at most_ m (m is the worst case
> > here), not 2(m-1). In my tests, it has always significantly less than
> > m.
>
> Hmm, Patrick's right, David, isn't he?
Yes, I was only considering
Patrick Useldinger wrote:
David Eppstein wrote:
When I've been talking about hashes, I've been assuming very strong
cryptographic hashes, good enough that you can trust equal results to
really be equal without having to verify by a comparison.
I am not an expert in this field. All I know is that
Patrick Useldinger <[EMAIL PROTECTED]> writes:
> David Eppstein wrote:
>
> > The hard part is verifying that the files that look like duplicates
> > really are duplicates. To do so, for a group of m files that appear
> > to be the same, requires 2(m-1) reads through the whole files if you
> > us
On Mon, 14 Mar 2005 10:43:23 -0800, David Eppstein <[EMAIL PROTECTED]> wrote:
>In article <[EMAIL PROTECTED]>,
> "John Machin" <[EMAIL PROTECTED]> wrote:
>
>> Just look at the efficiency of processing N files of the same size S,
>> where they differ after d bytes: [If they don't differ, d = S]
>
>
David Eppstein wrote:
The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear to
be the same, requires 2(m-1) reads through the whole files if you use a
comparison based method, or m reads if you use a strong hashin
David Eppstein wrote:
When I've been talking about hashes, I've been assuming very strong
cryptographic hashes, good enough that you can trust equal results to
really be equal without having to verify by a comparison.
I am not an expert in this field. All I know is that MD5 and SHA1 can
create c
In article <[EMAIL PROTECTED]>,
Patrick Useldinger <[EMAIL PROTECTED]> wrote:
> Shouldn't you add the additional comparison time that has to be done
> after hash calculation? Hashes do not give 100% guarantee.
When I've been talking about hashes, I've been assuming very strong
cryptographic ha
In article <[EMAIL PROTECTED]>,
"John Machin" <[EMAIL PROTECTED]> wrote:
> Just look at the efficiency of processing N files of the same size S,
> where they differ after d bytes: [If they don't differ, d = S]
I think this misses the point. It's easy to find the files that are
different. Just
John Machin wrote:
Test:
!for k in range(1000):
!open('foo' + str(k), 'w')
I ran that and watched it open 2 million files and going strong ...
until I figured that files are closed by Python immediately because
there's no reference to them ;-)
Here's my code:
#!/usr/bin/env python
import os
John Machin wrote:
Oh yeah, "the computer said so, it must be correct". Even with your
algorithm, I would be investigating cases where files were duplicates
but there was nothing in the names or paths that suggested how that
might have come about.
Of course, but it's good to know that the computer
Patrick Useldinger wrote:
> John Machin wrote:
>
> > Maybe I was wrong: lawyers are noted for irritating precision. You
> > meant to say in your own defence: "If there are *any* number (n >=
2)
> > of identical hashes, you'd still need to *RE*-read and *compare*
...".
>
> Right, that is what I mea
John Machin wrote:
Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare* ...".
Right, that is what I meant.
2. As others have explained, with a decent
Patrick Useldinger wrote:
> John Machin wrote:
>
> > Just look at the efficiency of processing N files of the same size
S,
> > where they differ after d bytes: [If they don't differ, d = S]
> >
> > PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
> > which is important for small
François Pinard wrote:
Identical hashes for different files? The probability of this happening
should be extremely small, or else, your hash function is not a good one.
We're talking about md5, sha1 or similar. They are all known not to be
100% perfect. I agree it's a rare case, but still, why se
[Patrick Useldinger]
> Shouldn't you add the additional comparison time that has to be done
> after hash calculation? Hashes do not give 100% guarantee. If there's
> a large number of identical hashes, you'd still need to read all of
> these files to make sure.
Identical hashes for different file
Scott David Daniels wrote:
comparisons. Using hashes, three file reads and three comparisons
of hash values. Without hashes, six file reads; you must read both
files to do a file comparison, so three comparisons is six files.
That's provided you compare always 2 files at a time. I compar
Patrick Useldinger wrote:
Just to explain why I appear to be a lawer: everybody I spoke to about
this program told me to use hashes, but nobody has been able to explain
why. I found myself 2 possible reasons:
1) it's easier to program: you don't compare several files in parallel,
but process on
John Machin wrote:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]
PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].
Hashing method: O(NS) reading time
David Eppstein wrote:
> In article <[EMAIL PROTECTED]>,
> Patrick Useldinger <[EMAIL PROTECTED]> wrote:
>
> > > Well, but the spec didn't say efficiency was the primary
criterion, it
> > > said minimizing the number of comparisons was.
> >
> > That's exactly what my program does.
>
> If you're do
On Fri, 11 Mar 2005 14:06:27 -0800, David Eppstein <[EMAIL PROTECTED]> wrote:
>In article <[EMAIL PROTECTED]>,
> Patrick Useldinger <[EMAIL PROTECTED]> wrote:
>
>> > Well, but the spec didn't say efficiency was the primary criterion, it
>> > said minimizing the number of comparisons was.
>>
>> T
On Fri, 11 Mar 2005 11:07:02 -0800, rumours say that David Eppstein
<[EMAIL PROTECTED]> might have written:
>More seriously, the best I can think of that doesn't use a strong slow
>hash would be to group files by (file size, cheap hash) then compare
>each file in a group with a representative of
In article <[EMAIL PROTECTED]>,
Patrick Useldinger <[EMAIL PROTECTED]> wrote:
> > Well, but the spec didn't say efficiency was the primary criterion, it
> > said minimizing the number of comparisons was.
>
> That's exactly what my program does.
If you're doing any comparisons at all, you're no
David Eppstein wrote:
Well, but the spec didn't say efficiency was the primary criterion, it
said minimizing the number of comparisons was.
That's exactly what my program does.
More seriously, the best I can think of that doesn't use a strong slow
hash would be to group files by (file size, cheap
On Thursday 10 March 2005 11:02 am, Christos "TZOTZIOY" Georgiou wrote:
> On Wed, 9 Mar 2005 16:13:20 -0600, rumours say that Terry Hancock
> <[EMAIL PROTECTED]> might have written:
>
> >For anyone interested in responding to the above, a starting
> >place might be this maintenance script I wrote
In article <[EMAIL PROTECTED]>,
Patrick Useldinger <[EMAIL PROTECTED]> wrote:
> > You need do no comparisons between files. Just use a sufficiently
> > strong hash algorithm (SHA-256 maybe?) and compare the hashes.
>
> That's not very efficient. IMO, it only makes sense in network-based
> ope
Christos TZOTZIOY Georgiou wrote:
A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).
Changed.
--
http://mail.python.org/mailman/listinfo/python-list
David Eppstein wrote:
You need do no comparisons between files. Just use a sufficiently
strong hash algorithm (SHA-256 maybe?) and compare the hashes.
That's not very efficient. IMO, it only makes sense in network-based
operations such as rsync.
-pu
--
http://mail.python.org/mailman/listinfo/py
Christos TZOTZIOY Georgiou wrote:
The relevant parts from this last page:
st_dev <-> dwVolumeSerialNumber
st_ino <-> (nFileIndexHigh, nFileIndexLow)
I see. But if I am not mistaken, that would mean that I
(1) had to detect NTFS volumes
(2) use non-standard libraries to find these information (like
On Fri, 11 Mar 2005 01:12:14 +0100, rumours say that Patrick Useldinger
<[EMAIL PROTECTED]> might have written:
>> On POSIX filesystems, one has also to avoid comparing files having same
>> (st_dev,
>> st_inum), because you know that they are the same file.
>
>I then have a bug here - I consider
On Fri, 11 Mar 2005 01:24:59 +0100, rumours say that Patrick Useldinger
<[EMAIL PROTECTED]> might have written:
>> Have you found any way to test if two files on NTFS are hard linked without
>> opening them first to get a file handle?
>
>No. And even then, I wouldn't know how to find out.
MSDN is
David Eppstein wrote:
> In article <[EMAIL PROTECTED]>,
> "Xah Lee" <[EMAIL PROTECTED]> wrote:
>
>> a absolute requirement in this problem is to minimize the number of
>> comparison made between files. This is a part of the spec.
>
> You need do no comparisons between files. Just use a suffici
In article <[EMAIL PROTECTED]>,
"Xah Lee" <[EMAIL PROTECTED]> wrote:
> a absolute requirement in this problem is to minimize the number of
> comparison made between files. This is a part of the spec.
You need do no comparisons between files. Just use a sufficiently
strong hash algorithm (SHA-2
Christos TZOTZIOY Georgiou wrote:
That's fast and good.
Nice to hear.
A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).
I'll look into that.
Have you found any way to test if two files on NTFS are hard linked without
opening them first to get a file handle?
No. And even then, I wo
Christos TZOTZIOY Georgiou wrote:
On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
st_inum), because you know that they are the same file.
I then have a bug here - I consider all files with the same inode equal,
but according to what you say I need to consider the
On Thu, 10 Mar 2005 10:54:05 +0100, rumours say that Patrick Useldinger
<[EMAIL PROTECTED]> might have written:
>I wrote something similar, have a look at
>http://www.homepages.lu/pu/fdups.html.
That's fast and good.
A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).
Have you
I've written a python GUI wrapper around some shell scripts:
http://www.pixelbeat.org/fslint/
the shell script logic is essentially:
exclude hard linked files
only include files where there are more than 1 with the same size
print files with matching md5sum
Pádraig.
--
http://mail.python.org/mailma
On Wed, 9 Mar 2005 16:13:20 -0600, rumours say that Terry Hancock
<[EMAIL PROTECTED]> might have written:
>For anyone interested in responding to the above, a starting
>place might be this maintenance script I wrote for my own use. I don't
>think it exactly matches the spec, but it addresses the
I wrote something similar, have a look at
http://www.homepages.lu/pu/fdups.html.
--
http://mail.python.org/mailman/listinfo/python-list
On Wednesday 09 March 2005 06:56 am, Xah Lee wrote:
> here's a large exercise that uses what we built before.
>
> suppose you have tens of thousands of files in various directories.
> Some of these files are identical, but you don't know which ones are
> identical with which. Write a program that
On 9 Mar 2005 04:56:13 -0800, rumours say that "Xah Lee" <[EMAIL PROTECTED]>
might
have written:
>Write a Perl or Python version of the program.
>
>a absolute requirement in this problem is to minimize the number of
>comparison made between files. This is a part of the spec.
http://groups-beta.g
here's a large exercise that uses what we built before.
suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.
Here's the spec.
44 matches
Mail list logo