Unfortunately, none of the machines that I may use has 80G RAM. So, using a dictionary will not help.
Any ideas?
Why don't you index the file? I would store the byte-offsets of the beginning of each line into an index file. Then you can generate a random number from 1 to Whatever, go get that index from the index file,
then open your text file, seek to that position in the file, read one line, and close the file. Using this process you can then extract a somewhat random set of lines from your 'corpus' text file.
You probably should consider making a database of the file, keep the raw text file for sure, but create a converted copy in bsddb or pytables format.
Warren -- http://mail.python.org/mailman/listinfo/python-list