In article <[EMAIL PROTECTED]>, John Nagle <[EMAIL PROTECTED]> wrote: >[EMAIL PROTECTED] wrote: >> Thanks to all who replied. It's very appreciated. >> >> Yes, I had to double check line counts and the number of lines is ~16 >> million (instead of stated 1.6B). > > OK, that's not bad at all. > > You have a few options: > > - Get enough memory to do the sort with an in-memory sort, like UNIX "sort" > or Python's "sort" function.
There is no Unix sort besides the executable binary /usr/bin/sort There is qsort() in the standard c-library but you would have to write a c-program around it. > - Thrash; in-memory sorts do very badly with virtual memory, but eventually > they finish. Might take many hours. A long time I thought that too, but of course it doesn't apply to mergesorts. Then I heard reputable sources say that quicksort has good "locality". Thinking about it, I find it more than plausible. Now I wonder whether even trash to disk (page fault) could tank a qsort using application. Why would it be a worse than a batch sort? > - Get a serious disk-to-disk sort program. (See "http://www.ordinal.com/". > There's a free 30-day trial. It can probably sort your data > in about a minute.) This is the way, but you don't need/want to go with "trial" software. > - Load the data into a database like MySQL and let it do the work. > This is slow if done wrong, but OK if done right. > - Write a distribution sort yourself. Fan out the incoming file into > one file for each first letter, sort each subfile, merge the > results. > >With DRAM at $64 for 4GB, I'd suggest just getting more memory and using >a standard in-memory sort. You give the mistaken impression that normal sort programs can only sort what can be hold in memory. This is only true for MSDOS' sort.exe that was an abomination even in 1983. All sort programs on Unixes, way back in the 1980 could handle files that were restricted by disk size. With respect to Microsoft OS'es: Despite its age and using only 1 Mbyte of internal memory, my ssort.exe can handle the OP's problem with ease. It sorted a 12 Mbyte file in minutes even in 1993. http://home.hccnet.nl/a.w.m.van.der.horst/ssort.html This is a plain executable, so no hassles and no trial. It is GPL, so you can even hack the source, if you fancy C++. There are a lot of Unix compatible toolkit present for Windows nowadays. Install one, you can spare a dozen Mbytes, can't you? > > John Nagle Groetjes Albert -- -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- like all pyramid schemes -- ultimately falters. [EMAIL PROTECTED]&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst -- http://mail.python.org/mailman/listinfo/python-list