On 7 May 2014, at 8:31 pm, Greg Murphy <greg.mur...@gamesparks.com> wrote:
> Thanks Andrew, much appreciated. > > I’ll try upgrading to 1.11 and report back with how it goes. At this point it may even be worth trying the .12 release candidate > > > > On 07/05/2014 01:20, "Andrew Beekhof" <and...@beekhof.net> wrote: > >> >> On 6 May 2014, at 7:47 pm, Greg Murphy <greg.mur...@gamesparks.com> wrote: >> >>> Here you go - I’ve only run lrmd for 30 minutes since installing the >>> debug >>> package, but hopefully that’s enough - if not, let me know and I’ll do a >>> longer capture. >>> >> >> I'll keep looking, but almost everything so far seems to be from or >> related to the g_dbus API: >> >> ... >> ==37625== by 0x6F20E30: g_dbus_proxy_new_for_bus_sync (in >> /usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.3800.1) >> ==37625== by 0x507B90B: get_proxy (upstart.c:66) >> ==37625== by 0x507B9BF: upstart_init (upstart.c:85) >> ==37625== by 0x507C88E: upstart_job_exec (upstart.c:429) >> ==37625== by 0x10CE03: lrmd_rsc_dispatch (lrmd.c:879) >> ==37625== by 0x4E5F112: crm_trigger_dispatch (mainloop.c:105) >> ==37625== by 0x58A13B5: g_main_context_dispatch (in >> /lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1) >> ==37625== by 0x58A1707: ??? (in >> /lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1) >> ==37625== by 0x58A1B09: g_main_loop_run (in >> /lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1) >> ==37625== by 0x10AC3A: main (main.c:314) >> >> Which is going to be called every time an upstart job is run (ie. >> recurring monitor of an upstart resource) >> >> There were several problems with that API and we removed all use of it in >> 1.1.11. >> I'm quite confident that most, if not all, of the memory issues would go >> away if you upgraded. >> >> >>> >>> >>> On 06/05/2014 10:08, "Andrew Beekhof" <and...@beekhof.net> wrote: >>> >>>> Oh, any any chance you could install the debug packages? It will make >>>> the >>>> output even more useful :-) >>>> >>>> On 6 May 2014, at 7:06 pm, Andrew Beekhof <and...@beekhof.net> wrote: >>>> >>>>> >>>>> On 6 May 2014, at 6:05 pm, Greg Murphy <greg.mur...@gamesparks.com> >>>>> wrote: >>>>> >>>>>> Attached are the valgrind outputs from two separate runs of lrmd with >>>>>> the >>>>>> suggested variables set. Do they help narrow the issue down? >>>>> >>>>> They do somewhat. I'll investigate. But much of the memory is still >>>>> reachable: >>>>> >>>>> ==26203== indirectly lost: 17,945,950 bytes in 642,546 blocks >>>>> ==26203== possibly lost: 2,805 bytes in 60 blocks >>>>> ==26203== still reachable: 26,104,781 bytes in 544,782 blocks >>>>> ==26203== suppressed: 8,652 bytes in 176 blocks >>>>> ==26203== Reachable blocks (those to which a pointer was found) are >>>>> not >>>>> shown. >>>>> ==26203== To see them, rerun with: --leak-check=full >>>>> --show-reachable=yes >>>>> >>>>> Could you add the --show-reachable=yes to VALGRIND_OPTS variable? >>>>> >>>>>> >>>>>> >>>>>> Thanks >>>>>> >>>>>> Greg >>>>>> >>>>>> >>>>>> On 02/05/2014 03:01, "Andrew Beekhof" <and...@beekhof.net> wrote: >>>>>> >>>>>>> >>>>>>> On 30 Apr 2014, at 9:01 pm, Greg Murphy <greg.mur...@gamesparks.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi >>>>>>>> >>>>>>>> I¹m running a two-node Pacemaker cluster on Ubuntu Saucy (13.10), >>>>>>>> kernel 3.11.0-17-generic and the Ubuntu Pacemaker package, version >>>>>>>> 1.1.10+git20130802-1ubuntu1. >>>>>>> >>>>>>> The problem is that I have no way of knowing what code is/isn't >>>>>>> included >>>>>>> in '1.1.10+git20130802-1ubuntu1'. >>>>>>> You could try setting the following in your environment before >>>>>>> starting >>>>>>> pacemaker though >>>>>>> >>>>>>> # Variables for running child daemons under valgrind and/or checking >>>>>>> for >>>>>>> memory problems >>>>>>> G_SLICE=always-malloc >>>>>>> MALLOC_PERTURB_=221 # or 0 >>>>>>> MALLOC_CHECK_=3 # or 0,1,2 >>>>>>> PCMK_valgrind_enabled=lrmd >>>>>>> VALGRIND_OPTS="--leak-check=full --trace-children=no >>>>>>> --num-callers=25 >>>>>>> --log-file=/var/lib/pacemaker/valgrind-%p >>>>>>> --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions >>>>>>> --gen-suppressions=all" >>>>>>> >>>>>>> >>>>>>>> The cluster is configured with a DRBD master/slave set and then a >>>>>>>> failover resource group containing MySQL (along with its DRBD >>>>>>>> filesystem) and a Zabbix Proxy and Agent. >>>>>>>> >>>>>>>> Since I built the cluster around two months ago I¹ve noticed that >>>>>>>> on >>>>>>>> the the active node the memory footprint of lrmd gradually grows to >>>>>>>> quite a significant size. The cluster was last restarted three >>>>>>>> weeks >>>>>>>> ago, and now lrmd has over 1GB of mapped memory on the active node >>>>>>>> and >>>>>>>> only 151MB on the passive node. Current excerpts from >>>>>>>> /proc/PID/status >>>>>>>> are: >>>>>>>> >>>>>>>> Active node >>>>>>>> VmPeak: >>>>>>>> 1146740 kB >>>>>>>> VmSize: >>>>>>>> 1146740 kB >>>>>>>> VmLck: >>>>>>>> 0 kB >>>>>>>> VmPin: >>>>>>>> 0 kB >>>>>>>> VmHWM: >>>>>>>> 267680 kB >>>>>>>> VmRSS: >>>>>>>> 188764 kB >>>>>>>> VmData: >>>>>>>> 1065860 kB >>>>>>>> VmStk: >>>>>>>> 136 kB >>>>>>>> VmExe: >>>>>>>> 32 kB >>>>>>>> VmLib: >>>>>>>> 10416 kB >>>>>>>> VmPTE: >>>>>>>> 2164 kB >>>>>>>> VmSwap: >>>>>>>> 822752 kB >>>>>>>> >>>>>>>> Passive node >>>>>>>> VmPeak: >>>>>>>> 220832 kB >>>>>>>> VmSize: >>>>>>>> 155428 kB >>>>>>>> VmLck: >>>>>>>> 0 kB >>>>>>>> VmPin: >>>>>>>> 0 kB >>>>>>>> VmHWM: >>>>>>>> 4568 kB >>>>>>>> VmRSS: >>>>>>>> 3880 kB >>>>>>>> VmData: >>>>>>>> 74548 kB >>>>>>>> VmStk: >>>>>>>> 136 kB >>>>>>>> VmExe: >>>>>>>> 32 kB >>>>>>>> VmLib: >>>>>>>> 10416 kB >>>>>>>> VmPTE: >>>>>>>> 172 kB >>>>>>>> VmSwap: >>>>>>>> 0 kB >>>>>>>> >>>>>>>> During the last week or so I¹ve taken a couple of snapshots of >>>>>>>> /proc/PID/smaps on the active node, and the heap particularly >>>>>>>> stands >>>>>>>> out >>>>>>>> as growing: (I have the full outputs captured if they¹ll help) >>>>>>>> >>>>>>>> 20140422 >>>>>>>> 7f92e1578000-7f92f218b000 rw-p 00000000 00:00 0 >>>>>>>> [heap] >>>>>>>> Size: 274508 kB >>>>>>>> Rss: 180152 kB >>>>>>>> Pss: 180152 kB >>>>>>>> Shared_Clean: 0 kB >>>>>>>> Shared_Dirty: 0 kB >>>>>>>> Private_Clean: 0 kB >>>>>>>> Private_Dirty: 180152 kB >>>>>>>> Referenced: 120472 kB >>>>>>>> Anonymous: 180152 kB >>>>>>>> AnonHugePages: 0 kB >>>>>>>> Swap: 91568 kB >>>>>>>> KernelPageSize: 4 kB >>>>>>>> MMUPageSize: 4 kB >>>>>>>> Locked: 0 kB >>>>>>>> VmFlags: rd wr mr mw me ac >>>>>>>> >>>>>>>> >>>>>>>> 20140423 >>>>>>>> 7f92e1578000-7f92f305e000 rw-p 00000000 00:00 0 >>>>>>>> [heap] >>>>>>>> Size: 289688 kB >>>>>>>> Rss: 184136 kB >>>>>>>> Pss: 184136 kB >>>>>>>> Shared_Clean: 0 kB >>>>>>>> Shared_Dirty: 0 kB >>>>>>>> Private_Clean: 0 kB >>>>>>>> Private_Dirty: 184136 kB >>>>>>>> Referenced: 69748 kB >>>>>>>> Anonymous: 184136 kB >>>>>>>> AnonHugePages: 0 kB >>>>>>>> Swap: 103112 kB >>>>>>>> KernelPageSize: 4 kB >>>>>>>> MMUPageSize: 4 kB >>>>>>>> Locked: 0 kB >>>>>>>> VmFlags: rd wr mr mw me ac >>>>>>>> >>>>>>>> 20140430 >>>>>>>> 7f92e1578000-7f92fc01d000 rw-p 00000000 00:00 0 >>>>>>>> [heap] >>>>>>>> Size: 436884 kB >>>>>>>> Rss: 140812 kB >>>>>>>> Pss: 140812 kB >>>>>>>> Shared_Clean: 0 kB >>>>>>>> Shared_Dirty: 0 kB >>>>>>>> Private_Clean: 744 kB >>>>>>>> Private_Dirty: 140068 kB >>>>>>>> Referenced: 43600 kB >>>>>>>> Anonymous: 140812 kB >>>>>>>> AnonHugePages: 0 kB >>>>>>>> Swap: 287392 kB >>>>>>>> KernelPageSize: 4 kB >>>>>>>> MMUPageSize: 4 kB >>>>>>>> Locked: 0 kB >>>>>>>> VmFlags: rd wr mr mw me ac >>>>>>>> >>>>>>>> I noticed in the release notes for 1.1.10-rc1 >>>>>>>> >>>>>>>> >>>>>>>> (https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1 >>>>>>>> .1 >>>>>>>> 0-r >>>>>>>> c1) that there was work done to fix "crmd: lrmd: stonithd: fixed >>>>>>>> memory >>>>>>>> leaks² but I¹m not sure which particular bug this was related to. >>>>>>>> (And >>>>>>>> those fixes should be in the version I¹m running anyway). >>>>>>>> >>>>>>>> I¹ve also spotted a few memory leak fixes in >>>>>>>> https://github.com/beekhof/pacemaker, but I¹m not sure whether they >>>>>>>> relate to my issue (assuming I have a memory leak and this isn¹t >>>>>>>> expected behaviour). >>>>>>>> >>>>>>>> Is there additional debugging that I can perform to check whether I >>>>>>>> have a leak, or is there enough evidence to justify upgrading to >>>>>>>> 1.1.11? >>>>>>>> >>>>>>>> Thanks in advance >>>>>>>> >>>>>>>> Greg Murphy >>>>>>>> _______________________________________________ >>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> >>>>>> >>>>>> <lrmd.tgz>_______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>> >>> >>> <lrmd-dbg.tgz>_______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org