Re: shuffle the lines of a large file

Nick Craig-Wood Tue, 08 Mar 2005 03:35:05 -0800

Raymond Hettinger <[EMAIL PROTECTED]> wrote:
> >>> from random import random
> >>> out = open('corpus.decorated', 'w')
> >>> for line in open('corpus.uniq'):
>          print >> out, '%.14f %s' % (random(), line),
> 
> >>> out.close()
> 
>  sort corpus.decorated | cut -c 18- > corpus.randomized


Very good solution!

Sort is truly excellent at very large datasets.  If you give it a file
bigger than memory then it divides it up into temporary files of
memory size, sorts each one, then merges all the temporary files back
together.

You tune the memory sort uses for in memory sorts with --buffer-size.
Its pretty good at auto tuning though.

You may want to set --temporary-directory also to save filling up your
/tmp.

In a previous job I did a lot of stuff with usenet news and was
forever blowing up the server with scripts which used too much memory.
sort was always the solution!

-- 
Nick Craig-Wood <[EMAIL PROTECTED]> -- http://www.craig-wood.com/nick
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: shuffle the lines of a large file

Reply via email to