On Dec 10, 1:28 pm, [EMAIL PROTECTED] wrote: > Seongsu Lee: > > >I have a dictionary with million keys. Each value in the dictionary has a > >list with up to thousand integers.< > > Let's say each integer can be represented with 32 bits (if there are > less numbers then a 3-byte representation may suffice, but this makes > things more complex), that is 2^2 bytes. Let's say there are 2^20 keys > each one associated to 2^10 values. So to represent the values you > need 2^32 bytes. It means 4 GB, so I don't think Python suffices to > store them in RAM, because a Python int object requires quite more > than 4 bytes (only represented inside an array.array it may need just > 4 bytes). > > So if you can use 128 MB RAM to store such data structure you need to > store data on HD too. You probably can use a lower-level language. On > disk you can keep the reverse index, represented as an array of > records/structs, each of such structures keep two 32-bit numbers (so > such array is 8 GB). Such index is sorted according to the first > element of the struct. The first number is the value of the original > dictionary and the second nuber is its key. Inside the RAM you can > keep another sorted array that "summarizes" your whole data. When you > need a number you can do a binary search on the array in RAM, such > array gives you the position where you can read (with a seek) a little > part of the file (512 bytes may suffice), to perform a little binary > search (if the block is very little a linear scan suffices) on it too > to find the number you need. Note that the summarizing data structure > in RAM may be represented with just a Python dict too, so in the end > you can use Python to solve this problem. You may need a lower-level > language to create the 8 GB file on disk (or create it with Python, > but it may take lot of time. You may sort it with the sort unix > command). > > This isn't a complete solution, but I think it may work. > > Bye, > bearophile
Nice. :) Regards, Jordan -- http://mail.python.org/mailman/listinfo/python-list