[EMAIL PROTECTED] wrote:
> Thanks to all who replied. It's very appreciated.
> 
> Yes, I had to double check line counts and the number of lines is ~16
> million (instead of stated 1.6B).

    OK, that's not bad at all.

    You have a few options:

    - Get enough memory to do the sort with an in-memory sort, like UNIX "sort"
        or Python's "sort" function.
    - Thrash; in-memory sorts do very badly with virtual memory, but eventually
        they finish.  Might take many hours.
    - Get a serious disk-to-disk sort program. (See "http://www.ordinal.com/";.
        There's a free 30-day trial.  It can probably sort your data
        in about a minute.)
    - Load the data into a database like MySQL and let it do the work.
        This is slow if done wrong, but OK if done right.
    - Write a distribution sort yourself.  Fan out the incoming file into
        one file for each first letter, sort each subfile, merge the
        results.

With DRAM at $64 for 4GB, I'd suggest just getting more memory and using
a standard in-memory sort.

                                John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to