On Fri, Feb 19, 2010 at 11:27 PM, Jonathan Gardner < jgard...@jonathangardner.net> wrote:
> On Fri, Feb 19, 2010 at 10:36 PM, krishna <krishna.k.0...@gmail.com> > wrote: > > I have to manage a couple of dicts with huge dataset (larger than > > feasible with the memory on my system), it basically has a key which > > is a string (actually a tuple converted to a string) and a two item > > list as value, with one element in the list being a count related to > > the key. I have to at the end sort this dictionary by the count. > > > > The platform is linux. I am planning to implement it by setting a > > threshold beyond which I write the data into files (3 columns: 'key > > count some_val' ) and later merge those files (I plan to sort the > > individual files by the key column and walk through the files with one > > pointer per file and merge them; I would add up the counts when > > entries from two files match by key) and sorting using the 'sort' > > command. Thus the bottleneck is the 'sort' command. > > > > Any suggestions, comments? > > > > You should be using BDBs or even something like PostgreSQL. The > indexes there will give you the scalability you need. I doubt you will > be able to write anything that will select, update, insert or delete > data better than what BDBs and PostgreSQL can give you. > > -- > Jonathan Gardner > jgard...@jonathangardner.net Thank you. I tried BDB, it seems to get very very slow as you scale. Thank you, Krishna
-- http://mail.python.org/mailman/listinfo/python-list