Thank you Werner! I am playing with your object counter. But before I can interpret the results, I have to ask a possibly stupid question. Can someone explain to me this behavior of python 2.6 on x86 linux 2.6.24:
1) start interpreter Check the size of the interpreter process: $ps -orss -p24316 RSS 4212 Size is 4MB, fine. 2) Allocate a huge chunk of memory: >>> x=range(1,10000000) Check the size of the interpreter process: $ps -orss -p24316 RSS 322460 The RSS memory grew by 322460KB =314MB, as I expected 3) Delete this object: >>> del x Check the size of the interpreter process: $ps -orss -p24316 RSS 244332 So, the memory is far off its original value! It's up by nearly 240MB, and it never gets released. Why is that? 4) >>> import gc >>> gc.collect() 0 $ps -orss -p24316 RSS 244404 Not much luck here! > -----Original Message----- > From: twisted-python-boun...@twistedmatrix.com [mailto:twisted-python- > boun...@twistedmatrix.com] On Behalf Of Werner Thie > Sent: Friday, February 19, 2010 4:10 PM > To: Twisted general discussion > Subject: Re: [Twisted-Python] debugging a memory leak > > Hi Alec > > ...and they promised you that with a gc'ed language there will never be > a memory problem again, you just plain forget about it. > > I was stuck in the same position as you and after lots of probing the > following attempt helped a lot to correct what was later proofed to be > overly optimistic coding by holding on to objects for > performance/practical reasons in other objects. Producing non collect > able cycles in twisted is probably as easy as to forget about memory > when you have Alzheimer. > > Proofing and working on the problem was only possible on the production > machine under real load situations. I went ahead and created a manhole > service on the production server, allowing me to peek at the python > object space without disturbing it too much. What I used as a tool was > the code you find later on included. > > After cleaning all the self produced cycles out our servers processes > stabilized at roughly 280 to 320 MB per process and are now running > stable for months with more than 20k logins per day and a usual time of > connect per user on the average of 25 minutes playing games delivered by > nevow/athena LivePages. > > As I said before, all cycles I found in our SW were introduced by > patterns like > > def beforeRender(self, ctx): > self.session = inevow.ISession(ctx) > > The included code helps to identify the amount of objects being around. > Although it's a primitive tool it shines the light where its needed and > if you see certain object counts run away then you have at least > identified the surrounding where the non collect able cycles are built. > > Why didn't I use heapy/guppy and found out that way? I wasn't able to > find the evidence for what I was suspecting with all the tools I tried > (and boy I tried for WEEKS). Avid users of heapy will most probably > disagree and tell me it would have been easy. But in a situation as this > everything that works to move you out of that pothole you're in is the > right thing to do. > > HTH, Werner > > exc = [ > "function", > "type", > "list", > "dict", > "tuple", > "wrapper_descriptor", > "module", > "method_descriptor", > "member_descriptor", > "instancemethod", > "builtin_function_or_method", > "frame", > "classmethod", > "classmethod_descriptor", > "_Environ", > "MemoryError", > "_Printer", > "_Helper", > "getset_descriptor", > "weakreaf" > ] > > inc = [ > 'myFirstSuspect', > 'mySecondSuspect' > ] > > prev = {} > > def dumpObjects(delta=True, limit=0, include=inc, exclude=[]): > global prev > if include != [] and exclude != []: > print 'cannot use include and exclude at the same time' > return > print 'working with:' > print ' delta: ', delta > print ' limit: ', limit > print ' include: ', include > print ' exclude: ', exclude > objects = {} > gc.collect() > oo = gc.get_objects() > for o in oo: > if getattr(o, "__class__", None): > name = o.__class__.__name__ > if ((exclude == [] and include == []) or \ > (exclude != [] and name not in exclude) or \ > (include != [] and name in include)): > objects[name] = objects.get(name, 0) + 1 > ## if more: > ## print o > pk = prev.keys() > pk.sort() > names = objects.keys() > names.sort() > for name in names: > if limit == 0 or objects[name] > limit: > if not prev.has_key(name): > prev[name] = objects[name] > dt = objects[name] - prev[name] > if delta or dt != 0: > print '%0.6d -- %0.6d -- ' % (dt, objects[name]), name > prev[name] = objects[name] > > def getObjects(oname): > """ > gets an object list with all the named objects out of the sea of > gc'ed objects > """ > olist = [] > objects = {} > gc.collect() > oo = gc.get_objects() > for o in oo: > if getattr(o, "__class__", None): > name = o.__class__.__name__ > if (name == oname): > olist.append(o) > return olist > > dumpObject = dumpobj.dumpObj > > Alec Matusis wrote: > > I modified a tolerably leaking (about 40MB/day) Twisted server: when the > new > > code was pushed, the memory leak became catastrophic (about > 100MB/hr). > > We could tolerate 40MB/day, but the new code needs to be debugged. > > First, I reverted to the old version, that is leaking 40MB/day (The leak > > rate is actually proportional to the number of new connections per second, > > (which correlates with the CPU utilization of the process): if CPU as > > measured by $top jumps to >90%, the leak can accelerate to 50MB/hr) > > I took two steps to debug the leak: > > > > 1) Using guppy/heapy via manhole: > > > >>>> hp = hpy() > >>>> h = hp.heap() > >>>> h > > Partition of a set of 1157084 objects. Total size = 140911144 bytes. > > Index Count % Size % Cumulative % Kind (class / dict of class) > > 0 785798 68 48463680 34 48463680 34 str > > 1 7357 1 24660664 18 73124344 52 dict of service.TagSession > > 2 11735 1 12298280 9 85422624 61 dict of > > twisted.internet.base.DelayedCall > > 3 7377 1 7731096 5 93153720 66 dict of > > twisted.internet.tcp.Server > > 4 7375 1 7729000 5 100882720 72 dict of protocols.TagProtocol > > 5 30925 3 7174600 5 108057320 77 __builtin__.set > > 6 9193 1 6373336 5 114430656 81 dict (no owner) > > 7 15557 1 3396904 2 117827560 84 list > > 8 44833 4 3227976 2 121055536 86 types.BuiltinFunctionType > > 9 38142 3 3051360 2 124106896 88 types.MethodType > > <328 more rows. Type e.g. '_.more' to view.> > > > > Note that the total size of all objects is 140911144 bytes > > The 1st, 3d and 4th items in this list show the actual number of connected > > clients. I wait for about 48 hrs, and then execute the same command, and I > > see approximately the same Total size, of 130MB. > > So the total size that Heapy sees via the manhole is stable, fluctuating > > around 140MB. > > > > The problem is that the total RSS size of the process visible by the OS is > > much larger, it is 871680KB = 851MB: > > $ps -o pid,vsz,rss,sz,size -p 11236 > > PID VSZ RSS SZ SZ > > 11236 1303180 871680 325795 1174388 > > > > It is this total RSS size that keeps leaking at 40MB per day. > > > > As far as I understand, this means that this is not the problem with purely > > Python code, since Heapy shows that the total size of all Python objects is > > more or less constant. > > Is this a correct assumption? > > > > 2) So now I turn to valgrind. I am no expert in using valgrind, so what I > > did was based only on general logic/rough guesses. > > Since I cannot run this under valgrind on a production machine due to > > performance reasons, I recompile python on the staging machine: > > ./configure --enable-shared --without-pymalloc > > --prefix=/nail/encap/python-2.6.4-valgrind > > I then follow the instructions in > > http://svn.python.org/projects/python/trunk/Misc/README.valgrind > > Then I run twistd process like this: > > > > valgrind --tool=memcheck > > --suppressions=/nail/sys/src/Python-2.6.4/Misc/valgrind-python.supp > > --leak-check=full --log-file=/tmp/valgrind.log /usr/local/bin/twistd > > --no_save --reactor=epoll --pidfile=logs/tagserv.pid > > --logfile=logs/tagserv.log --python=tagserv.py > > > > The memory for the process shown by the OS is now 5x normal, and the > > performance is about 5x worse, since it's running inside valgrind's > > synthetic CPU. > > Because this is done on the staging box where I cannot accurately > reproduce > > the real load, the memory leaks from simulated load seen by $ps -o rss are > > pretty small, about 1 to 10MB. > > Still, I am interested in finding out what they are. > > > > Now I encounter the problems with my understanding of how to use > valgrind. > > This may not be the appropriate list for this, but perhaps someone could > > recognize the problem. > > When I start the server, about 240 lines is written to valgrind log file, > > --log-file=/tmp/valgrind.log > > When I shut it down, it adds another 100 lines. > > No matter what I do in between, it always results in the log file with > > exactly 343 lines. > > I can have server runs with a leak of 1MB, or with 10MB, but in the end, I > > get pretty much the same log file. Moreover, when look for all lost memory > > reports: > > $grep lost valgrind.log.1 > > 17,352 bytes in 31 blocks are possibly lost in loss record 49 of 62 > > 203,312 bytes in 478 blocks are possibly lost in loss record 57 of 62 > > definitely lost: 0 bytes in 0 blocks. > > possibly lost: 220,664 bytes in 509 blocks. > > 64 bytes in 2 blocks are definitely lost in loss record 12 of 63 > > 17,352 bytes in 31 blocks are possibly lost in loss record 50 of 63 > > 203,824 bytes in 479 blocks are possibly lost in loss record 58 of 63 > > definitely lost: 64 bytes in 2 blocks. > > possibly lost: 221,176 bytes in 510 blocks. > > 47 bytes in 1 blocks are definitely lost in loss record 8 of 63 > > 128 bytes in 4 blocks are definitely lost in loss record 16 of 63 > > 584 (104 direct, 480 indirect) bytes in 2 blocks are definitely lost in > > loss record 26 of 63 > > 1,008 bytes in 6 blocks are definitely lost in loss record 31 of 63 > > 22,296 bytes in 41 blocks are possibly lost in loss record 50 of 63 > > 183,368 bytes in 381 blocks are possibly lost in loss record 59 of 63 > > definitely lost: 1,287 bytes in 13 blocks. > > indirectly lost: 480 bytes in 20 blocks. > > possibly lost: 205,664 bytes in 422 blocks. > > > > If I add up all those numbers, I get less than 1MB. How do I track down the > > 10MB leak? > > > > Are there any alternative strategies in finding this leak? > > > > > > > > > > _______________________________________________ > > Twisted-Python mailing list > > Twisted-Python@twistedmatrix.com > > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python > > > _______________________________________________ > Twisted-Python mailing list > Twisted-Python@twistedmatrix.com > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python _______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python