> On Mon, Jan 16, 2012 at 12:58:13PM -0800, Kamal Mostafa wrote: > > * Package name : duff > > * URL : http://duff.sourceforge.net/
On Tue, 2012-01-17 at 09:56 +0100, Simon Josefsson wrote: > If there aren't warnings about use of SHA1 in the tool, there should > be. While I don't recall any published SHA1 collisions, SHA1 is > considered broken and shouldn't be used if you want to trust your > comparisons. I'm assuming the tool supports SHA256 and other SHA2 > hashes as well? It might be useful to make sure the defaults are > non-SHA1. Duff supports SHA1, SHA256, SHA384 and SHA512 hashes. The default is SHA1. For comparison, rdfind supports MD5 but only SHA1 hashes. Thanks for the note Simon -- I'll bring it to the attention of the upstream author, Camilla Berglund. On Tue, 2012-01-17 at 09:12 +0000, Lars Wirzenius wrote: > rdfind seems to be quickest one, but duff compares well with hardlink, > which (see http://liw.fi/dupfiles/) was the fastest one I knew of in > Debian so far. > > This was done using my benchmark-cmd utility in my extrautils > collection (not in Debian): http://liw.fi/extrautils/ for source. Thanks for the pointer to your benchmark-cmd tool, Lars. Very handy! My results with it mirrored yours -- of the similar tools, duff appears to lag only rdfind in performance (for my particular dataset, at least). I looked into duff's methods a bit and discovered a few easy performance optimizations that may speed it up a bit more. The author is reviewing my proposed patch now, and seems very open to collaboration. > Personally, I would be wary of using checksums for file comparisons, > since comparing files byte-by-byte isn't slow (you only need to > do it to files that are identical in size, and you need to read > all the files anyway). Byte-by-byte might well be slower then checksums, if you end up faced with N>2 very large (uncacheable) files of identical size but unique contents. They all need to be checked against each other so each of the N files would need to be read N-1 times. Anyway, duff actually *does* offer byte-by-byte comparison as an option (rdfind does not). > I also think we've now got enough of duplicate file finders in > Debian that it's time to consider whether we need so many. It's > too bad they all have incompatible command line syntaxes, or it > would be possible to drop some. (We should accept a new one if > it is better than the existing ones, of course. Evidence required.) To me, the premise that a new package must be better than existing similar ones ("evidence required", no less) seems pretty questionable. It may not be so easy to establish just what "better than" means, and it puts us in a position of making value judgments for our users that they should be able to make for themselves. While I do think it is productive to compare performance of these similar tools to each other, I don't see much value in pitting them against each other in benchmark wars as criteria of acceptance into Debian. Here we have a good quality DFSG-compliant package with an active upstream and a willing DD maintainer. While similar tools do exist already in Debian, they do not offer identical feature sets or user interfaces, and only one of them has been shown to outperform duff in quick spot checks. Some users have expressed a preference for duff over the others. Does that make it "better than the existing ones"? My answer: Who cares? Nobody is making us choose only one. In my view, its not really a problem if carry multiple duplicate file detectors in Debian, and that we will best serve our users by letting them choose their preferred tool for the job. And by allowing such packages into Debian we encourage their improvement, to everyone's benefit. -Kamal
signature.asc
Description: This is a digitally signed message part