27.05.2013 04:20, Yuichi SEINO wrote: > Hi, > > 2013/5/24 Vladislav Bogdanov <bub...@hoster-ok.com>: >> 24.05.2013 06:34, Andrew Beekhof wrote: >>> Any help figuring out where the leaks might be would be very much >>> appreciated :) >> >> One (and the only) suspect is unfortunately crmd itself. >> It has private heap grown from 2708 to 3680 kB. >> >> All other relevant differences are in qb shm buffers, which are >> controlled and may grow until they reach configured size. >> >> @Yuichi >> I would recommend to try running under valgrind on a testing cluster to >> figure out is that a memleak (lost memory) or some history data >> (referenced memory). Latter may be a logical memleak though. You may >> look in /etc/sysconfig/pacemaker for details. > > I got valgrind for about 2 days. And, I attached valgrind in ACT node > and SBY node.
I do not see any "direct" memory leaks (repeating 'definitely-lost' allocations) there. So what we see is probably one of: * Cache/history/etc, which grows up to some limit (or expired at the some point in time). * Unlimited/not-expirable lists/hashes of data structures, which are correctly freed at exit (f.e like dlm_controld has(had???) for a debugging buffer or like glibc resolver had in EL3). This cannot be caught with valgrind if you use it in a standard way. I believe we have former one. To prove that, it would be very interesting to run under valgrind *debugger* (--vgdb=yes|full) for some long enough (2-3 weeks) period of time and periodically get memory allocation state from there (with 'monitor leak_check full reachable any' gdb command). I wanted to do that a long time ago, but unfortunately did not have enough spare time to even try that (although I tried to valgrind other programs that way). This is described in valgrind documentation: http://valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.gdbserver We probably do not need to specify '--vgdb-error=0' because we do not need to install watchpoints at the start (and we do not need/want to immediately connect to crmd with gdb to tell it to continue), we just need to periodically get status of memory allocations (stop-leak_check-cont sequence). Probably that should be done in a 'fast' manner, so crmd does not stop for a long time, and the rest of pacemaker does not see it 'hanged'. Again, I did not try that, and I do not know if it's even possible to do that with crmd. And, as pacemaker heavily utilizes glib, which has own memory allocator (slices), it is better to switch it to a 'standard' malloc/free for debugging with G_SLICE=always-malloc env var. Last, I did memleak checks for a 'static' (i.e. no operations except monitors are performed) cluster for ~1.1.8, and did not find any. It would be interesting to see if that is true for an 'active' one, which starts/stops resources, handles failures, etc. > > Sincerely, > Yuichi > >> >>> >>> Also, the measurements are in pages... could you run "getconf PAGESIZE" and >>> let us know the result? >>> I'm guessing 4096 bytes. >>> >>> On 23/05/2013, at 5:47 PM, Yuichi SEINO <seino.clust...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I retry the test after we updated packages to the latest tag and OS. >>>> glue and booth is latest. >>>> >>>> * Environment >>>> OS:RHEL 6.4 >>>> cluster-glue:latest(commit:2755:8347e8c9b94f) + >>>> patch[detail:http://www.gossamer-threads.com/lists/linuxha/dev/85787] >>>> resource-agent:v3.9.5 >>>> libqb:v0.14.4 >>>> corosync:v2.3.0 >>>> pacemaker:v1.1.10-rc2 >>>> crmsh:v1.2.5 >>>> booth:latest(commit:67e1208973de728958432aaba165766eac1ce3a0) >>>> >>>> * Test procedure >>>> we regularly switch a ticket. The previous test also used the same way. >>>> And, There was no a memory leak when we tested pacemaker-1.1 before >>>> pacemaker use libqb. >>>> >>>> * Result >>>> As a result, I think that crmd may cause the memory leak. >>>> >>>> crmd smaps(a total of each addresses) >>>> In detail, we attached smaps of start and end. And, I recorded smaps >>>> every 1 minutes. >>>> >>>> Start >>>> RSS: 7396 >>>> SHR(Shared_Clean+Shared_Dirty):3560 >>>> Private(Private_Clean+Private_Dirty):3836 >>>> >>>> Interbal(about 30h later) >>>> RSS:18464 >>>> SHR:14276 >>>> Private:4188 >>>> >>>> End(about 70h later) >>>> RSS:19104 >>>> SHR:14336 >>>> Private:4768 >>>> >>>> Sincerely, >>>> Yuichi >>>> >>>> 2013/5/15 Yuichi SEINO <seino.clust...@gmail.com>: >>>>> Hi, >>>>> >>>>> I ran the test for about two days. >>>>> >>>>> Environment >>>>> >>>>> OS:RHEL 6.3 >>>>> pacemaker-1.1.9-devel (commit 138556cb0b375a490a96f35e7fbeccc576a22011) >>>>> corosync-2.3.0 >>>>> cluster-glue >>>>> latest+patch(detail:http://www.gossamer-threads.com/lists/linuxha/dev/85787) >>>>> libqb- 0.14.4 >>>>> >>>>> There may be a memory leak in crmd and lrmd. I regularly got rss of ps. >>>>> >>>>> start-up >>>>> crmd:5332 >>>>> lrmd:3625 >>>>> >>>>> interval(about 30h later) >>>>> crmd:7716 >>>>> lrmd:3744 >>>>> >>>>> ending(about 60h later) >>>>> crmd:8336 >>>>> lrmd:3780 >>>>> >>>>> I still don't run a test that pacemaker-1.1.10-rc2 use. So, I will run >>>>> its test. >>>>> >>>>> Sincerely, >>>>> Yuichi >>>>> >>>>> -- >>>>> Yuichi SEINO >>>>> METROSYSTEMS CORPORATION >>>>> E-mail:seino.clust...@gmail.com >>>> >>>> >>>> >>>> -- >>>> Yuichi SEINO >>>> METROSYSTEMS CORPORATION >>>> E-mail:seino.clust...@gmail.com >>>> <smaps_log.tar.gz>_______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > -- > Yuichi SEINO > METROSYSTEMS CORPORATION > E-mail:seino.clust...@gmail.com > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org