Paul,
I can pretty much promise you that it you really have 10^8 records they should be put into a database and let the database do the sorting by creating indexes on the fields that you want. Something like MySQL should do nicely and is free.
http://www.mysql.org
Python has good interface to mysql database if you want other processing on the records.
The alternative is a good high-speed sort like Syncsort, etc.
Good Luck, Larry Bates
Paul wrote:
Hi all
I have a sorting problem, but my experience with Python is rather limited (3 days), so I am running this by the list first.
I have a large database of 15GB, consisting of 10^8 entries of approximately 100 bytes each. I devised a relatively simple key map on my database, and I would like to order the database with respect to the key.
I expect a few repeats for most of the keys, and that s actually part of what I want to figure out in the end. (Said loosely, I want to group all the data entries having "similar" keys. For this I need to sort the keys first (data entries having _same_ key), and then figure out which keys are "similar").
A few thoughts on this: - Space is not going to be an issue. I have a Tb available. - The Python sort() on list should be good enough, if I can load the whole database into a list/dict - each data entry is relatively small, so I shouldn't use pointers - Keys could be strings, integers with the usual order, whatever is handy, it doesn't matter to me. The choice will probably have to do with what sort() prefers. - Also I will be happy with any key space size. So I guess 100*size of the database will do.
Any comments? How long should I hope this sort will take? It will sound weird, but I actually have 12 different key maps and I want to sort this with respect to each map, so I will have to sort 12 times.
Paul
-- http://mail.python.org/mailman/listinfo/python-list