Paul,

I can pretty much promise you that it you really have 10^8
records they should be put into a database and let the database
do the sorting by creating indexes on the fields that you want.
Something like MySQL should do nicely and is free.

http://www.mysql.org

Python has good interface to mysql database if you want other
processing on the records.

The alternative is a good high-speed sort like Syncsort, etc.

Good Luck,
Larry Bates



Paul wrote:
Hi all

I have a sorting problem, but my experience with Python is rather
limited (3 days), so I am running this by the list first.

I have a large database of 15GB, consisting of 10^8 entries of
approximately 100 bytes each. I devised a relatively simple key map on
my database, and I would like to order the database with respect to the
key.

I expect a few repeats for most of the keys, and that s actually part
of what I want to figure out in the end. (Said loosely, I want to group
all the data entries having "similar" keys. For this I need to sort the
keys first (data entries having _same_ key), and then figure out which
keys are "similar").

A few thoughts on this:
- Space is not going to be an issue. I have a Tb available.
- The Python sort() on list should be good enough, if I can load the
whole database into a list/dict
- each data entry is relatively small, so I shouldn't use pointers
- Keys could be strings, integers with the usual order, whatever is
handy, it doesn't matter to me. The choice will probably have to do
with what sort() prefers.
- Also I will be happy with any key space size. So I guess 100*size of
the database will do.

Any comments?
How long should I hope this sort will take? It will sound weird, but I
actually have 12 different key maps and I want to sort this with
respect to each map, so I will have to sort 12 times.

Paul

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to