As far as I can tell, what you ultimately want is to be able to extract a random ("representative?") subset of sentences. Given the huge size of data, I would suggest not randomizing the file, but randomizing accesses to the file. E.g. (sorry for off-the-cuff pseudo python): [adjust 8196 == 2**13 to your disk block size] . while True: . byteno = random.randint(0,length_of_file) . #align to disk block to avoid unnecessary IO . byteno = (byteno >> 13) << 13 #zero out the bottom 13 bits . f.seek(byteno) #set the file pointer to a random position . bytes = r.read(8196) #read one block . sentences = bytes.splitlines()[2:-1] #omit ends with partial lines . do_something(sentences)
If you only need 1000 sentences, use only one sentence from each block, if you need 1M, then use them all. [I hope I understood you problem] -- george -- http://mail.python.org/mailman/listinfo/python-list