07.09.2012 03:03, Andrew Beekhof wrote: ...
>>> cib shows tons of reachable memory in finished child processes. Not >>> important but log is huge, so it is very hard to find real errors. >> >> I found two minor (very slow) leaks in cib too: >> >> ==1732== 80 bytes in 20 blocks are still reachable in loss record 42 of 66 >> ==1732== at 0x4C26FDE: malloc (vg_replace_malloc.c:236) >> ==1732== by 0x410549: cib_get_operation_id (common.c:225) > > This one is (bounded) initialisation code. We should add it to the > suppressions file. > >> ==1732== by 0x40B370: cib_process_request (callbacks.c:679) >> ==1732== by 0x40D066: cib_common_callback_worker (callbacks.c:265) >> ==1732== by 0x40D25E: cib_common_callback (callbacks.c:325) >> ==1732== by 0x7995813: ??? (in /usr/lib64/libqb.so.0.14.1) >> ==1732== by 0x7995B53: qb_ipcs_dispatch_connection_request (in >> /usr/lib64/libqb.so.0.14.1) >> ==1732== by 0x526FA44: gio_read_socket (mainloop.c:353) >> ==1732== by 0x74D1F0D: g_main_context_dispatch (in >> /lib64/libglib-2.0.so.0.2200.5) >> ==1732== by 0x74D5937: ??? (in /lib64/libglib-2.0.so.0.2200.5) >> ==1732== by 0x74D5D54: g_main_loop_run (in >> /lib64/libglib-2.0.so.0.2200.5) >> ==1732== by 0x40DB26: cib_init (main.c:551) >> ==1732== by 0x40E18C: main (main.c:250) >> ==1732== >> >> ==1732== 99 bytes in 3 blocks are still reachable in loss record 43 of 66 >> ==1732== at 0x4C25A28: calloc (vg_replace_malloc.c:467) >> ==1732== by 0x525AF9E: crm_itoa (utils.c:325) >> ==1732== by 0x4E3267C: get_node_uuid (cluster.c:166) >> ==1732== by 0x4E3318F: crm_get_peer (membership.c:229) > > This the uuid cache. > It will only grow if you add more nodes. OK, got it. > > Probably the cib needs to call empty_uuid_cache() before exit. > >> ==1732== by 0x4E38524: pcmk_cpg_deliver (corosync.c:538) >> ==1732== by 0x58A7EAE: cpg_dispatch (cpg.c:412) >> ==1732== by 0x4E35981: pcmk_cpg_dispatch (corosync.c:577) >> ==1732== by 0x5270966: mainloop_gio_callback (mainloop.c:533) >> ==1732== by 0x74D1F0D: g_main_context_dispatch (in >> /lib64/libglib-2.0.so.0.2200.5) >> ==1732== by 0x74D5937: ??? (in /lib64/libglib-2.0.so.0.2200.5) >> ==1732== by 0x74D5D54: g_main_loop_run (in >> /lib64/libglib-2.0.so.0.2200.5) >> ==1732== by 0x40DB26: cib_init (main.c:551) >> ==1732== by 0x40E18C: main (main.c:250) >> >>> >>> crmd has some errors, most annoying are following two (they repeat >>> constantly, and I see first one for other processes too): >>> >>> ==1737== Syscall param socketcall.sendto(msg) points to uninitialised >>> byte(s) >>> ==1737== at 0x7AE6002: send (in /lib64/libc-2.12.so) >>> ==1737== by 0x869FE7F: qb_ipc_us_send (in /usr/lib64/libqb.so.0.13.0) >>> ==1737== by 0x86A0FE5: qb_ipcc_us_setup_connect (in >>> /usr/lib64/libqb.so.0.13.0) > > Not really a big issue (although I agree it looks like it should be). > If anything libqb needs to memset a stack variable to 0 before it sets a > subset of the available fields. > > Probably we should just suppress it. I reported it to corosync ML, hopefully Angus will see it there. Ah just found that there is bugzilla on libqb github page. > >>> ==1737== by 0x869DD33: qb_ipcc_connect (in /usr/lib64/libqb.so.0.13.0) >>> ==1737== by 0x5885459: crm_ipc_connect (ipc.c:341) >>> ==1737== by 0x589DC27: mainloop_add_ipc_client (mainloop.c:613) >>> ==1737== by 0x545C842: cib_native_signon_raw (cib_native.c:222) >>> ==1737== by 0x41802D: do_cib_control (cib.c:167) >>> ==1737== by 0x407726: s_crmd_fsa_actions (fsa.c:315) >>> ==1737== by 0x408D85: s_crmd_fsa (fsa.c:250) >>> ==1737== by 0x4109CD: crm_fsa_trigger (callbacks.c:251) >>> ==1737== by 0x589C3C2: crm_trigger_dispatch (mainloop.c:105) >>> ==1737== by 0x81DAF0D: g_main_context_dispatch (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x81DE937: ??? (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x81DED54: g_main_loop_run (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x404D9D: crmd_init (main.c:139) >>> ==1737== by 0x7A1BCDC: (below main) (in /lib64/libc-2.12.so) >>> ==1737== Address 0x7feffd514 is on thread 1's stack >>> ==1737== >>> ==1737== Invalid read of size 8 >>> ==1737== at 0x7B2FD44: __strspn_sse42 (in /lib64/libc-2.12.so) >>> ==1737== by 0x588AA53: crm_get_msec (utils.c:640) > > This i dont understand. This one is valgrind issue on EL6. http://old.nabble.com/Safe-to-suppress-%22Invalid-read-of-size-8%22-in-strspn--td32833092.html > >>> ==1737== by 0x588AD28: check_time (utils.c:85) >>> ==1737== by 0x588C8E8: cluster_option (utils.c:215) >>> ==1737== by 0x588CB8D: verify_all_options (utils.c:287) >>> ==1737== by 0x40B9AA: config_query_callback (control.c:779) >>> ==1737== by 0x5457393: cib_native_callback (cib_utils.c:631) >>> ==1737== by 0x545CD86: cib_native_dispatch_internal (cib_native.c:120) >>> ==1737== by 0x589D71F: mainloop_gio_callback (mainloop.c:522) >>> ==1737== by 0x81DAF0D: g_main_context_dispatch (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x81DE937: ??? (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x81DED54: g_main_loop_run (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x404D9D: crmd_init (main.c:139) >>> ==1737== by 0x7A1BCDC: (below main) (in /lib64/libc-2.12.so) >>> ==1737== Address 0xc158bd0 is 0 bytes inside a block of size 4 alloc'd >>> ==1737== at 0x4C26FDE: malloc (vg_replace_malloc.c:236) >>> ==1737== by 0x7A7D871: strdup (in /lib64/libc-2.12.so) >>> ==1737== by 0x588CA70: cluster_option (utils.c:211) >>> ==1737== by 0x588CB8D: verify_all_options (utils.c:287) >>> ==1737== by 0x40B9AA: config_query_callback (control.c:779) >>> ==1737== by 0x5457393: cib_native_callback (cib_utils.c:631) >>> ==1737== by 0x545CD86: cib_native_dispatch_internal (cib_native.c:120) >>> ==1737== by 0x589D71F: mainloop_gio_callback (mainloop.c:522) >>> ==1737== by 0x81DAF0D: g_main_context_dispatch (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x81DE937: ??? (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x81DED54: g_main_loop_run (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x404D9D: crmd_init (main.c:139) >>> ==1737== by 0x7A1BCDC: (below main) (in /lib64/libc-2.12.so) >>> >>> And I suspect some leakage here: > > Looks it. > >>> >>> ==1737== 51,314 (1,968 direct, 49,346 indirect) bytes in 82 blocks are >>> definitely lost in loss record 248 of 249 >>> ==1737== at 0x4C25A28: calloc (vg_replace_malloc.c:467) >>> ==1737== by 0x5CB77E1: lrmd_key_value_add (lrmd_client.c:100) >>> ==1737== by 0x41E199: do_lrm_rsc_op (lrm.c:1701) >>> ==1737== by 0x420779: do_lrm_invoke (lrm.c:1450) >>> ==1737== by 0x40C423: send_msg_via_ipc (messages.c:939) >>> ==1737== by 0x40D2DF: relay_message (messages.c:454) >>> ==1737== by 0x40F827: route_message (messages.c:322) >>> ==1737== by 0x411A6E: crmd_ha_msg_filter (callbacks.c:96) >>> ==1737== by 0x40547F: crmd_ais_dispatch (corosync.c:108) >>> ==1737== by 0x5673569: pcmk_cpg_deliver (corosync.c:551) >>> ==1737== by 0x65B0EAE: cpg_dispatch (cpg.c:412) >>> ==1737== by 0x5670981: pcmk_cpg_dispatch (corosync.c:577) >>> ==1737== by 0x589D966: mainloop_gio_callback (mainloop.c:533) >>> ==1737== by 0x81DAF0D: g_main_context_dispatch (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x81DE937: ??? (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x81DED54: g_main_loop_run (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1737== by 0x404D9D: crmd_init (main.c:139) >>> ==1737== by 0x7A1BCDC: (below main) (in /lib64/libc-2.12.so) >>> >>> lrmd seems not to clean up gio channels properly: > > Hmmm. I'll try and get most of these fixed today. > >>> >>> ==1734== 8,946 (8,520 direct, 426 indirect) bytes in 71 blocks are >>> definitely lost in loss record 147 of 152 >>> ==1734== at 0x4C26FDE: malloc (vg_replace_malloc.c:236) >>> ==1734== by 0x71997D2: g_malloc (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x71C67F4: g_io_channel_unix_new (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x4E52470: mainloop_add_fd (mainloop.c:660) >>> ==1734== by 0x5067870: services_os_action_execute (services_linux.c:456) >>> ==1734== by 0x403AA6: lrmd_rsc_dispatch (lrmd.c:696) >>> ==1734== by 0x4E513C2: crm_trigger_dispatch (mainloop.c:105) >>> ==1734== by 0x7190F0D: g_main_context_dispatch (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x7194937: ??? (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x7194D54: g_main_loop_run (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x402427: main (main.c:302) >>> ==1734== >>> ==1734== 8,946 (8,520 direct, 426 indirect) bytes in 71 blocks are >>> definitely lost in loss record 148 of 152 >>> ==1734== at 0x4C26FDE: malloc (vg_replace_malloc.c:236) >>> ==1734== by 0x71997D2: g_malloc (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x71C67F4: g_io_channel_unix_new (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x4E52470: mainloop_add_fd (mainloop.c:660) >>> ==1734== by 0x50678AE: services_os_action_execute (services_linux.c:465) >>> ==1734== by 0x403AA6: lrmd_rsc_dispatch (lrmd.c:696) >>> ==1734== by 0x4E513C2: crm_trigger_dispatch (mainloop.c:105) >>> ==1734== by 0x7190F0D: g_main_context_dispatch (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x7194937: ??? (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x7194D54: g_main_loop_run (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x402427: main (main.c:302) >>> ==1734== >>> ==1734== 65,394 (62,280 direct, 3,114 indirect) bytes in 519 blocks are >>> definitely lost in loss record 151 of 152 >>> ==1734== at 0x4C26FDE: malloc (vg_replace_malloc.c:236) >>> ==1734== by 0x71997D2: g_malloc (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x71C67F4: g_io_channel_unix_new (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x4E52470: mainloop_add_fd (mainloop.c:660) >>> ==1734== by 0x5067870: services_os_action_execute (services_linux.c:456) >>> ==1734== by 0x50676B4: recurring_action_timer (services_linux.c:212) >>> ==1734== by 0x719161A: ??? (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x7190F0D: g_main_context_dispatch (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x7194937: ??? (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x7194D54: g_main_loop_run (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x402427: main (main.c:302) >>> ==1734== >>> ==1734== 65,394 (62,280 direct, 3,114 indirect) bytes in 519 blocks are >>> definitely lost in loss record 152 of 152 >>> ==1734== at 0x4C26FDE: malloc (vg_replace_malloc.c:236) >>> ==1734== by 0x71997D2: g_malloc (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x71C67F4: g_io_channel_unix_new (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x4E52470: mainloop_add_fd (mainloop.c:660) >>> ==1734== by 0x50678AE: services_os_action_execute (services_linux.c:465) >>> ==1734== by 0x50676B4: recurring_action_timer (services_linux.c:212) >>> ==1734== by 0x719161A: ??? (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x7190F0D: g_main_context_dispatch (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x7194937: ??? (in /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x7194D54: g_main_loop_run (in >>> /lib64/libglib-2.0.so.0.2200.5) >>> ==1734== by 0x402427: main (main.c:302) >>> >>> >>>> >>>> The only one I'm really concerned about is the lrmd. >>>> >>>>> >>>>> Should I make full cluster restart or rolling one is ok? >>>> >>>> To have the sysconfig values take effect? Either. >>> >>> To correctly find all possible leaks which depend on execution path. >>> >>>> >>>>> >>>>>> >>>>>> first to rule out glib's funky allocator. >>>>>> >>>>>>> >>>>>>> I can send CIB contents if needed. >>>>>>> >>>>>>> Vladislav >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org