"Joerg Schuster" <[EMAIL PROTECTED]> writes: >Hello,
>I am looking for a method to "shuffle" the lines of a large file. >I have a corpus of sorted and "uniqed" English sentences that has been >produced with (1): >(1) sort corpus | uniq > corpus.uniq >corpus.uniq is 80G large. The fact that every sentence appears only >once in corpus.uniq plays an important role for the processes >I use to involve my corpus in. Yet, the alphabetical order is an >unwanted side effect of (1): Very often, I do not want (or rather, I >do not have the computational capacities) to apply a program to all of >corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a >very lopsided set of English sentences. >So, it would be very useful to do one of the following things: >- produce corpus.uniq in a such a way that it is not sorted in any way >- shuffle corpus.uniq > corpus.uniq.shuffled >Unfortunately, none of the machines that I may use has 80G RAM. >So, using a dictionary will not help. >Any ideas? Instead of shuffling the file itself maybe you could index it (with dbm for instance) and select random lines by using random indexes whenever you need a sample. Eddie -- http://mail.python.org/mailman/listinfo/python-list