On 7 Mar 2005 05:36:32 -0800, rumours say that "Joerg Schuster" <[EMAIL PROTECTED]> might have written:
>Hello, > >I am looking for a method to "shuffle" the lines of a large file. [snip] >So, it would be very useful to do one of the following things: > >- produce corpus.uniq in a such a way that it is not sorted in any way >- shuffle corpus.uniq > corpus.uniq.shuffled > >Unfortunately, none of the machines that I may use has 80G RAM. >So, using a dictionary will not help. To implement your 'shuffle' command in Python, you can do the following algorithm, with a couple of assumptions: ASSUMPTION ---------- The total line count in your big file is less than sys.maxint. The algorithm as given works for systems where eol is a single '\n'. ALGORITHM --------- Create a temporary filelist.FileList fl (see attached file) of struct.calcsize("q") bytes each (struct.pack and the 'q' format string is your friend), to hold the offset of each line start in big_file. fl[0] would be 0, fl[1] would be the length of the first line including its '\n' and so on. Read once the big_file appending to fl the offset each time (if you need help with this, let me know). random.shuffle(fl) # this is tested with the filelist.FileList as given for offset_as_str in fl: offset= struct.unpack("q", offset_as_str)[0] big_file.seek(offset) sys.stdout.write(big_file.readline()) That's it. Redirect output to your preferred file. No promises for speed though :) -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list