On Jan 25, 6:18 am, [EMAIL PROTECTED] wrote: > Hello all, > > I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like > to sort based on first two characters.
If you mean 1.6 American billion i.e. 1.6 * 1000 ** 3 lines, and 2 * 1024 ** 3 bytes of data, that's 1.34 bytes per line. If you mean other definitions of "billion" and/or "GB", the result is even fewer bytes per line. What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this: print repr(open('the_file', 'rb').read(100)) and show us the results. What does "based on [the] first two characters" mean? Do you mean raw order based on the ordinal of each character i.e. no fancy language- specific collating sequence? Do the first two characters always belong to the ASCII subset? You'd like to sort a large file? Why? Sorting a file is just a means to an end, and often another means is more appropriate. What are you going to do with it after it's sorted? > I'd greatly appreciate if someone can post sample code that can help > me do this. I'm sure you would. However it would benefit you even more if instead of sitting on the beach next to the big arrow pointing to the drop zone, you were to read the manual and work out how to do it yourself. Here's a start: http://docs.python.org/lib/typesseq-mutable.html > Also, any ideas on approximately how long is the sort process going to > take (XP, Dual Core 2.0GHz w/2GB RAM). If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath. Instead of writing Python code, you are probably better off doing an external sort. You might consider looking for a Windows port of a Unicode-capable Unix sort utility. Google "GnuWin32" and see if their sort does what you want. -- http://mail.python.org/mailman/listinfo/python-list