Bugs item #1234674, was opened at 2005-07-08 11:01 Message generated for change (Comment added) made by pvrijlandt You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1234674&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Extension Modules Group: Python 2.4 Status: Open Resolution: None Priority: 5 Submitted By: Mendez (goatsofmendez) Assigned to: Fred L. Drake, Jr. (fdrake) Summary: filecmp.cmp's "shallow" option Initial Comment: The filecmp.cmp function has a shallow option (set as default) to only compare files based on stats rather than doing a bit by bit comparison of the file itself. The relevant bit of the code follows. s1 = _sig(os.stat(f1)) s2 = _sig(os.stat(f2)) if s1[0] != stat.S_IFREG or s2[0] != stat.S_IFREG: return False if shallow and s1 == s2: return True if s1[1] != s2[1]: return False result = _cache.get((f1, f2)) if result and (s1, s2) == result[:2]: return result[2] outcome = _do_cmp(f1, f2) _cache[f1, f2] = s1, s2, outcome return outcome There's a check to see if the shallow mode is enabled and if that's the case and the stats match it returns true but the test for returning false is for only one of the stats attributes meaning that it's possible for a file not to match either test and the function to carry on to the full file check lower down. The check above to see if the stats match with stat.S_IFREG also looks wrong to me, but it could just be I don't understand what he's trying to do :) This code is the same in both 2.3 and 2.4 ---------------------------------------------------------------------- Comment By: Patrick Vrijlandt (pvrijlandt) Date: 2005-08-29 00:16 Message: Logged In: YES user_id=1307917 "cmp should return a bool" refers to the modules docstring. On other places, it does already. The default value for "shallow" should be a bool too. ---------------------------------------------------------------------- Comment By: Patrick Vrijlandt (pvrijlandt) Date: 2005-08-28 22:24 Message: Logged In: YES user_id=1307917 The cache is buggy too; it can be fooled by changing the cwd between calls. Also, if it caches (a, b) it does not cache (b, a). Is this caching feature really useful? Maybe it's up to the caller to record what comparisons have been made. If a caching function is to be provided, it should recognize the (b, a) case and maybe also infer from (a, b) and (b, c) about (a, c). (My programs remember an md5-signature) I propose to eliminate the caching feature. Also I propose to publish cmp_shallow and cmp_bytes which I think will make the unit easier to understand and verify. cmp_bytes is of course _do_cmp, with a check added for length equality. Then, the docs should be updated to whatever implementation for cmp is chosen, because non-shallow comparison is ill-specified. cmp should return a bool. what should cmp do with funny input like a dir? It now returns false. Would an exception be better? BTW: is it really useful in _do_cmp to have a local variable bufsize? is 8k still optimal? ---------------------------------------------------------------------- Comment By: Mendez (goatsofmendez) Date: 2005-08-26 13:38 Message: Logged In: YES user_id=1309441 On my own system I've modified the testing code as follows if s1 != s2: return False if shallow and s1 == s2: return True Which works as I expected it to work If attributes aren't identical - Always fails If the attributes are identical and it's in shallow mode - returns true If the attributes are identical and it's not in shallow mode - goes on to check if the files are byte identical. Whether there should be additional modes for finding byte identical files with different names, attributes etc. is another matter. ---------------------------------------------------------------------- Comment By: Raymond Hettinger (rhettinger) Date: 2005-08-26 10:33 Message: Logged In: YES user_id=80475 Fred, do you have thoughts on this patch? ---------------------------------------------------------------------- Comment By: Reinhold Birkenfeld (birkenfeld) Date: 2005-08-26 10:19 Message: Logged In: YES user_id=1188172 Hm. Looks like if the size is identical, but the mtime is not, the file will be read even in shallow mode. The filecmp docs say "Unless shallow is given and is false, files with identical os.stat() signatures are taken to be equal." The filecmp.cmp docstring says "shallow: Just check stat signature (do not read the files)" Two questions arise: - Should the file contents be compared in shallow mode if the mtimes do not match? - Should the mtimes matter in non-shallow mode? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1234674&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com