Not tested this, run it (or some derivation thereof) over the output to get increasing randomness. You will want to keep max_buffered_lines as high as possible really I imagine. If shuffle() is too intensize you could itterate over the buffer several times randomly removing and printing lines until the buffer is empty/suitibly small removing some more processing overhead.
### START ### import random f = open('corpus.uniq') buffer = [] max_buffered_lines = 1000 for line in f: if len(buffer) < max_buffered_lines: buffer.append(line) else: buffer.shuffle() for line in buffer: print line random.shuffle(buffer) for line in buffer: print line f.close() ### END ### -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Joerg Schuster Sent: 07 March 2005 13:37 To: python-list@python.org Subject: shuffle the lines of a large file Hello, I am looking for a method to "shuffle" the lines of a large file. I have a corpus of sorted and "uniqed" English sentences that has been produced with (1): (1) sort corpus | uniq > corpus.uniq corpus.uniq is 80G large. The fact that every sentence appears only once in corpus.uniq plays an important role for the processes I use to involve my corpus in. Yet, the alphabetical order is an unwanted side effect of (1): Very often, I do not want (or rather, I do not have the computational capacities) to apply a program to all of corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a very lopsided set of English sentences. So, it would be very useful to do one of the following things: - produce corpus.uniq in a such a way that it is not sorted in any way - shuffle corpus.uniq > corpus.uniq.shuffled Unfortunately, none of the machines that I may use has 80G RAM. So, using a dictionary will not help. Any ideas? Joerg Schuster -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list