Raymond Hettinger <[EMAIL PROTECTED]> wrote: > >>> from random import random > >>> out = open('corpus.decorated', 'w') > >>> for line in open('corpus.uniq'): > print >> out, '%.14f %s' % (random(), line), > > >>> out.close() > > sort corpus.decorated | cut -c 18- > corpus.randomized
Very good solution! Sort is truly excellent at very large datasets. If you give it a file bigger than memory then it divides it up into temporary files of memory size, sorts each one, then merges all the temporary files back together. You tune the memory sort uses for in memory sorts with --buffer-size. Its pretty good at auto tuning though. You may want to set --temporary-directory also to save filling up your /tmp. In a previous job I did a lot of stuff with usenet news and was forever blowing up the server with scripts which used too much memory. sort was always the solution! -- Nick Craig-Wood <[EMAIL PROTECTED]> -- http://www.craig-wood.com/nick -- http://mail.python.org/mailman/listinfo/python-list