Re: Writing huge Sets() to disk

Martin MOKREJŠ Mon, 17 Jan 2005 04:32:35 -0800

Duncan Booth wrote:

Martin MOKREJ© wrote:

Hi,
 could someone tell me what all does and what all doesn't copy
references in python. I have found my script after reaching some
state and taking say 600MB, pushes it's internal dictionaries
to hard disk. The for loop consumes another 300MB (as gathered
by vmstat) to push the data to dictionaries, then releases
little bit less than 300MB and the program start to fill-up
again it's internal dictionaries, when "full" will do the
flush again ...

Almost anything you do copies references.

But what does this?:

x = 'xxxxx'
a = x[2:]
b = z + len(x)
dict[a] = b

 The point here is, that this code takes a lot of extra memory.
I believe it's the references problem, and I remeber complains
of frineds facing same problem. I'm a newbie, yes, but don't
have this problem with Perl. OK, I want to improve my Pyhton
knowledge ... :-))
<long code extract snipped>
The above routine doesn't release of the memory back when it exits.
That's probably because there isn't any memory it can reasonable be expected to release. What memory would *you* expect it to release?


Thos 300MB, they get allocated/reserved when the posted loop get's
executed. When the loops exits, almost all is returned/deallocated.
Yes, almost. :(

The member variables are all still accessible as member variables until you run your loop at the end to clear them, so no way could Python release them.


OK, I wanted to know if there's some assignment using a reference,
which makes the internal garbage collector not to recycle the memory,
as, for example, the target dictionary still keeps reference to the temporary
dictionary.

Some hints:
When posting code, try to post complete examples which actually work. I don't know what type the self._dict_on_diskXX variables are supposed to be. It makes a big difference if they are dictionaries (so you are trying to hold everything in memory at one time) or shelve.Shelf objects which would store the values on disc in a reasonably efficient manner.


The self._dict_on_diskXX are bsddb files, self._tmpdictXX are builtin 
dictionaries.

Even if they are Shelf objects, I see no reason here why you have to


I gathered from previous discussion it's faster to use bsddb directly,
so no shelve.

process everything at once. Write a simple function which processes one tmpdict object into one dict_on_disk object and then closes the


That's what I do, but in the for loop ...

dict_on_disk object. If you want to compare results later then do that by


OK, I got your point.

reopening the dict_on_disk objects when you have deleted all the tmpdicts.


That's what I do (not shown).

Extract out everything you want to do into a class which has at most one tmpdict and one dict_on_disk That way your code will be a lot easier to read.
Make your code more legible by using fewer underscores.
What on earth is the point of an explicit call to __add__? If Guido had meant us to use __add__ he woudn't have created '+'.

To invoke additon directly on the object. It's faster than letting python to figure out that I sum up int() plus int(). It definitely has helped a lot when using Decimal(a) + Decimal(b), where I got rid of thousands of Decimal(__new__), __init__ and I think few other methods of decimal as well - I think getcontext too.

What is the purpose of dict_on_disk? Is it for humans to read the data? If not, then don't store everything as a string. Far better to just store a


For humans is processed later.

tuple of your values then you don't have to use split or cast the strings


bsddb creaps on me that I can store as a key or value only a string.
I'd love to store tuple.

import bsddb _words1 = bsddb.btopen('db1.db', 'c') _words1['a'] = 1


Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.3/bsddb/__init__.py", line 120, in __setitem__
  self.db[key] = value
TypeError: Key and Data values must be of type string or None.


How can I record a number then?

to integers. If you do want humans to read some final output then produce that separately from the working data files.

You split out 4 values from dict_on_disk and set three of them to 0. If that really what you meant or should you be preserving the previous values?


No, overwrite them, i.e. invalidate them. Originally I recorded only first,
but to compute the latter numbers is so expensive I have to store them.
As walking through the dictionaries is so slow, I gave up on an idea to
store just one, and a lot later in the program walk once again through the
dictionary and 'fix' it by computing missing values.


Here is some (untested) code which might help you:

import shelve


Why shelve? To have the ability to record tuple? Isn't it cheaper
to convert to string and back and write to bsddb compared to this overhead?


def push_to_disc(data, filename):
    database = shelve.open(filename)
    try:
        for key in data:
            if database.has_key(key):
                count, a, b, expected = database[key]
                database[key] = count+data[key], a, b, expected
            else:
                database[key] = data[key], 0, 0, 0
    finally:
        database.close()

    data.clear()

Call that once for each input dictionary and your data will be written out to a disc file and the internal dictionary cleared without any great spike of memory use.


Can I use the mmap() feature on bsddb or any .db file? Most of the time I do
updates, not inserts! I don't want to rewrite all the time 300MB file.
I want to update it. What I do need for it? Know the maximal length of a string
value keept in the .db file? Can I get rid of locking support in those huge
files?

Definitely I can improve my algorithm. But I believe I'll always have to work
with those huge files.
Martin
--
http://mail.python.org/mailman/listinfo/python-list

Re: Writing huge Sets() to disk

Reply via email to