2014-03-19 0:55 GMT+09:00 David Vossel <dvos...@redhat.com>: > ----- Original Message ----- >> From: "Kazunori INOUE" <kazunori.ino...@gmail.com> >> To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> >> Sent: Tuesday, March 18, 2014 12:30:01 AM >> Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11 >> >> 2014-03-18 8:03 GMT+09:00 David Vossel <dvos...@redhat.com>: >> > >> > ----- Original Message ----- >> >> From: "Kazunori INOUE" <kazunori.ino...@gmail.com> >> >> To: "The Pacemaker cluster resource manager" >> >> <pacemaker@oss.clusterlabs.org> >> >> Sent: Monday, March 17, 2014 4:51:11 AM >> >> Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11 >> >> >> >> 2014-03-17 16:37 GMT+09:00 Kazunori INOUE <kazunori.ino...@gmail.com>: >> >> > 2014-03-15 4:08 GMT+09:00 David Vossel <dvos...@redhat.com>: >> >> >> >> >> >> >> >> >> ----- Original Message ----- >> >> >>> From: "Kazunori INOUE" <kazunori.ino...@gmail.com> >> >> >>> To: "pm" <pacemaker@oss.clusterlabs.org> >> >> >>> Sent: Friday, March 14, 2014 5:52:38 AM >> >> >>> Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11 >> >> >>> >> >> >>> Hi, >> >> >>> >> >> >>> When specifying the node name in UPPER case and performing >> >> >>> crm_resource, crmd was aborted. >> >> >>> (The real node name is a LOWER case.) >> >> >> >> >> >> https://github.com/ClusterLabs/pacemaker/pull/462 >> >> >> >> >> >> does that fix it? >> >> >> >> >> > >> >> > Since behavior of glib is strange somehow, the result is NO. >> >> > I tested this brunch. >> >> > https://github.com/davidvossel/pacemaker/tree/lrm-segfault >> >> > * Red Hat Enterprise Linux Server release 6.4 (Santiago) >> >> > * glib2-2.22.5-7.el6.x86_64 >> >> > >> >> > strcase_equal() is not called from g_hash_table_lookup(). >> >> > >> >> > [x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409 >> >> > ...snip... >> >> > (gdb) b lrm.c:1232 >> >> > Breakpoint 1 at 0x4251d0: file lrm.c, line 1232. >> >> > (gdb) b strcase_equal >> >> > Breakpoint 2 at 0x429828: file lrm_state.c, line 95. >> >> > (gdb) c >> >> > Continuing. >> >> > >> >> > Breakpoint 1, do_lrm_invoke (action=288230376151711744, >> >> > cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER, >> >> > msg_data=0x7fff8d679540) at lrm.c:1232 >> >> > 1232 lrm_state = lrm_state_find(target_node); >> >> > (gdb) s >> >> > lrm_state_find (node_name=0x1d4c650 "X3650H") at lrm_state.c:267 >> >> > 267 { >> >> > (gdb) n >> >> > 268 if (!node_name) { >> >> > (gdb) n >> >> > 271 return g_hash_table_lookup(lrm_state_table, node_name); >> >> > (gdb) p g_hash_table_size(lrm_state_table) >> >> > $1 = 1 >> >> > (gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))->data >> >> > $2 = 0x1c791a0 "x3650h" >> >> > (gdb) p node_name >> >> > $3 = 0x1d4c650 "X3650H" >> >> > (gdb) n >> >> > 272 } >> >> > (gdb) n >> >> > do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE, >> >> > cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540) >> >> > at lrm.c:1234 >> >> > 1234 if (lrm_state == NULL && is_remote_node) { >> >> > (gdb) n >> >> > 1240 CRM_ASSERT(lrm_state != NULL); >> >> > (gdb) n >> >> > >> >> > Program received signal SIGABRT, Aborted. >> >> > 0x0000003787e328a5 in raise () from /lib64/libc.so.6 >> >> > (gdb) >> >> > >> >> > >> >> > I wonder why... so I will continue investigation. >> >> > >> >> > >> >> >> >> I read the code of g_hash_table_lookup(). >> >> Key is compared by the hash value generated by crm_str_hash before >> >> strcase_equal() is performed. >> > >> > good catch. I've updated the patch in this pull request. Can you give it a >> > go? >> > >> > https://github.com/ClusterLabs/pacemaker/pull/462 >> > >> fail-count is not cleared only in this. >> >> $ crm_resource -C -r p1 -N X3650H >> Cleaning up p1 on X3650H >> Waiting for 1 replies from the CRMd. OK >> >> $ grep fail-count /var/log/ha-log >> Mar 18 13:53:36 x3650g attrd[3610]: debug: attrd_client_message: >> Broadcasting fail-count-p1[X3650H] = (null) >> $ >> >> $ crm_mon -rf1 >> Last updated: Tue Mar 18 13:54:51 2014 >> Last change: Tue Mar 18 13:53:36 2014 by hacluster via crmd on x3650h >> Stack: corosync >> Current DC: x3650h (3232261384) - partition with quorum >> Version: 1.1.10-83553fa >> 2 Nodes configured >> 1 Resources configured >> >> >> Online: [ x3650g x3650h ] >> >> Full list of resources: >> >> p1 (ocf::pacemaker:Dummy): Stopped >> >> Migration summary: >> * Node x3650h: >> p1: migration-threshold=1 fail-count=1 last-failure='Tue Mar 18 >> 13:53:19 2014' >> * Node x3650g: >> $ >> >> >> So this change also seems to be necessary. > > yep, added your patch to the pull request > https://github.com/davidvossel/pacemaker/commit/c118ac5b5244890c19e4c7b2f5a39208d362b61d > > I found another one in stonith that I fixed. > > https://github.com/ClusterLabs/pacemaker/pull/462 > > Are we good for merging this now? > > -- Vossel >
I think that you may merge since there is no defect recognized for the moment. P.S. I test about some major commands which can specify a node name from now. It takes one week or more. * crm_standby * crm_resource * crm_failcount * and everything else. If a defect is discovered, I will report it. Thanks. > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org