Patrick Useldinger wrote: > John Machin wrote: > > > Maybe I was wrong: lawyers are noted for irritating precision. You > > meant to say in your own defence: "If there are *any* number (n >= 2) > > of identical hashes, you'd still need to *RE*-read and *compare* ...". > > Right, that is what I meant. > > > 2. As others have explained, with a decent hash function, the > > probability of a false positive is vanishingly small. Further, nobody > > in their right mind [1] would contemplate automatically deleting n-1 > > out of a bunch of n reportedly duplicate files without further > > investigation. Duplicate files are usually (in the same directory with > > different names or in different-but-related directories with the same > > names) and/or (have a plausible explanation for how they were > > duplicated) -- the one-in-zillion-chance false-positive should stand > > out as implausible. > > Still, if you can get it 100% right automatically, why would you bother > checking manually?
A human in their right mind is required to decide what to do with the duplicates. The proponents of hashing -- of which I'm not one -- would point out that any false-positives would be picked up as part of the human scrutiny. > Why get back to argments like "impossible", > "implausible", "can't be" if you can have a simple and correct answer - > yes or no? Oh yeah, "the computer said so, it must be correct". Even with your algorithm, I would be investigating cases where files were duplicates but there was nothing in the names or paths that suggested how that might have come about. > > Anyway, fdups does not do anything else than report duplicates. > Deleting, hardlinking or anything else might be an option depending on > the context in which you use fdups, but then we'd have to discuss the > context. I never assumed any context, in order to keep it as universal > as possible. That's very good, but it wasn't under contention. > > > Different subject: maximum number of files that can be open at once. I > > raised this issue with you because I had painful memories of having to > > work around max=20 years ago on MS-DOS and was aware that this magic > > number was copied blindly from early Unix. I did tell you that > > empirically I could get 509 successful opens on Win 2000 [add 3 for > > stdin/out/err to get a plausible number] -- this seems high enough to > > me compared to the likely number of files with the same size -- but you > > might like to consider a fall-back detection method instead of just > > quitting immediately if you ran out of handles. > > For the time being, the additional files will be ignored, and a warning > is issued. fdups does not quit, why are you saying this? I beg your pardon, I was wrong. Bad memory. It's the case of running out of the minuscule buffer pool that you allocate by default where it panics and pulls the sys.exit(1) rip-cord. > > A fallback solution would be to open the file before every _block_ read, > and close it afterwards. Ugh. Better use more memory, so less blocks!! > In my mind, it would be a command-line option, > because it's difficult to determine the number of available file handles > in a multitasking environment. The pythonic way is to press ahead optimistically and recover if you get bad news. > > Not difficult to implement, but I first wanted to refactor the code so > that it's a proper class that can be used in other Python programs, as > you also asked. I didn't "ask"; I suggested. I would never suggest a class-for-classes-sake. You had already a singleton class; why another". What I did suggest was that you provide a callable interface that returned clusters of duplicates [so that people could do their own thing instead of having to parse your file output which contains a mixture of warning & info messages and data]. > That is what I have sent you tonight. It's not that I > don't care about the file handle problem, it's just that I do changes by > (my own) priority. > > > You wrote at some stage in this thread that (a) this caused problems on > > Windows and (b) you hadn't had any such problems on Linux. > > > > Re (a): what evidence do you have? > > I've had the case myself on my girlfriend's XP box. It was certainly > less than 500 files of the same length. Interesting. Less on XP than on 2000? Maybe there's a machine-wide limit, not a per-process limit, like the old DOS max=20. What else was running at the time? > > > Re (b): famous last words! How long would it take you to do a test and > > announce the margin of safety that you have? > > Sorry, I do not understand what you mean by this. Test: !for k in range(1000): ! open('foo' + str(k), 'w') Announce: "I can open A files at once on box B running os C. The most files of the same length that I have seen is D. The ratio A/D is small enough not to worry." Cheers, John -- http://mail.python.org/mailman/listinfo/python-list