Dan Stromberg wrote: > On Thu, 22 Dec 2005 22:06:42 +0000, Dan Stromberg wrote: > > > > > Hi folks. > > > > Python appears to have a good sort method, but when sorting array elements > > that are very large, and hence have very expensive compares, is there some > > sort of already-available sort function that will merge like elements into > > a chain, so that they won't have to be recompared as many times? > > > > Thanks! > > I probably should've been more specific. > > I'm wanting to sort a large number of files, like a bunch of output files > from a large series of rsh or ssh outputs on a large series of distinct > machines, a music collection in .ogg format (strictly redistributable and > legally purchased music), a collection of .iso cdrom images (strictly > redistributable and legally purchased software), and so forth. > > I'm not sorting the content of each file individually. > > I'm treating each file as a potentially very large string, and "sorting > the strings". > > I've been using the following compare function, which in short checks, in > order: > > 1) device number > 2) inode number > 3) file length > 4) the beginning of the file > 5) an md5 hash of the entire file > 6) the entire file > > (If #1 and #2 are identical, then the file must be a hardlink to the other > file. Also, files that do not have the same length can never be > identical. And of course these items are generated on demand, making it > frequently possible to avoid doing full-file-length compare on a lot of > files) > > However, my program is still doing more #6 comparisons than seems > strictly necessary when I could just toss all the filenames describing > identical files into a list, and avoid re-comparing files with identical > content over and over - I don't want to compare them to each other again > and again), when there are a lot of identical files in the input list. > Why would #5 not enough as an indicator that the files are indentical ?
-- http://mail.python.org/mailman/listinfo/python-list