----- Original Message ----- > From: "Kazunori INOUE" <kazunori.ino...@gmail.com> > To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> > Sent: Friday, January 10, 2014 5:23:04 AM > Subject: Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1 > > 2014/1/9 Andrew Beekhof <and...@beekhof.net>: > > > > On 8 Jan 2014, at 9:15 pm, Kazunori INOUE <kazunori.ino...@gmail.com> > > wrote: > > > >> 2014/1/8 Andrew Beekhof <and...@beekhof.net>: > >>> > >>> On 18 Dec 2013, at 9:50 pm, Kazunori INOUE <kazunori.ino...@gmail.com> > >>> wrote: > >>> > >>>> Hi David, > >>>> > >>>> 2013/12/18 David Vossel <dvos...@redhat.com>: > >>>>> > >>>>> That's a really weird one... I don't see how it is possible for op->id > >>>>> to be NULL there. You might need to give valgrind a shot to detect > >>>>> whatever is really going on here. > >>>>> > >>>>> -- Vossel > >>>>> > >>>> Thank you for advice. I try it. > >>> > >>> Any update on this? > >>> > >> > >> We are still investigating a cause. It was not reproduced when I gave > >> valgrind.. > >> And it was reproduced in RC3. > > > > So it happened RC3 - valgrind, but not RC3 + valgrind? > > Thats concerning. > > > > Nothing in the valgrind output? > > > > The cause was found. > > 230 gboolean > 231 operation_finalize(svc_action_t * op) > 232 { > 233 int recurring = 0; > 234 > 235 if (op->interval) { > 236 if (op->cancel) { > 237 op->status = PCMK_LRM_OP_CANCELLED; > 238 cancel_recurring_action(op); > 239 } else { > 240 recurring = 1; > 241 op->opaque->repeat_timer = g_timeout_add(op->interval, > 242 > recurring_action_timer, (void *)op); > 243 } > 244 } > 245 > 246 if (op->opaque->callback) { > 247 op->opaque->callback(op); > 248 } > 249 > 250 op->pid = 0; > 251 > 252 if (!recurring) { > 253 /* > 254 * If this is a recurring action, do not free explicitly. > 255 * It will get freed whenever the action gets cancelled. > 256 */ > 257 services_action_free(op); > 258 return TRUE; > 259 } > 260 return FALSE; > 261 } > > When op->id is not 0, in cancel_recurring_action function (l.238), op > is not removed from hash table. > However, op is freed in services_action_free function (l.257). That > is, the freed data remains in hash table. > Then, g_hash_table_lookup function may look up the freed data. > > Therefore, when g_hash_table_replace function was called (in > services_action_async function), I added change so that > g_hash_table_remove function might certainly be called. > As of now, segfault has not happened.
Awesome, thanks for tracking this down. I created a modified version of your patch and put it up for review as a pacemaker pull request. https://github.com/ClusterLabs/pacemaker/pull/408 -- Vossel _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org