Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

renayama19661014 Thu, 20 Oct 2011 18:34:59 -0700

Hi Alan,

Thank you for comment.


We reproduce a problem, too and are going to send a report.
However, the problem does not reappear for the moment.

Best Regards,
Hideo Yamauchi.

--- On Thu, 2011/10/20, Alan Robertson <al...@unix.sh> wrote:

> Hi,
> 
> I've seen a very similar problem in a recent release.  In fact, I'm in the 
> process of reproducing it so that it can be properly logged and so on.  When 
> I get the right data for the bug report, I'll attach it to the bug.
> 
> FWIW: I'm pretty sure that the signal was properly received by attrd.  I 
> haven't looked at the attrd code, but my guess is that either it didn't issue 
> the correct function call for exiting from mainloop - or that the mainloop 
> code didn't actually exit.  FWIW - it probably doesn't matter at all what the 
> priority for signal handling is - since attrd consumes nearly no CPU.  Too 
> bad it doesn't log receiving the signal or beginning the process of exiting...
> 
> Another random thought - I suppose attrd could be clobbering some memory 
> which mainloop needs to properly process an exit.  Doesn't seem likely - but 
> neither of the above options seem very likely either.
> 
> 
> ----------------------------
> An historical note on an early bug that had similar symptoms (but affected 
> every process - not just attrd).
> 
> First - what caused such a problem (a very long time ago):
>     There is a window between the checking for signals and going to sleep in 
> the poll call where
>         such that a signal might be ignored for a while.
> 
>     The glib mainloop code has three entry points called each time a signal 
> is received:
>             prepare, check, dispatch.
> 
> There is a poll call which occurs between the prepare and check steps.  If a 
> signal comes in after the prepare call returns, but before the code goes to 
> sleep in the poll system call, it will be ignored until
> the poll system call returns.  It will get caught on the next iteration of 
> the loop.
> 
> The fix was fairly simple - the signal handling code instructs the mainloop 
> infrastructure to call poll with an argument which prevents it from staying 
> asleep longer than a second.
> 
> Then the code processes the signal correctly.
> 
> 
> On 10/17/2011 07:19 PM, renayama19661...@ybb.ne.jp wrote:
> > Hi,
> > 
> > We sometimes fail in a stop of attrd.
> > 
> > Step1. start a cluster in 2 nodes
> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > Step3. stop the second node after time passed a 
> > little.(/etc/init.d/heartbeat
> > stop.)
> > 
> > The attrd catches the TERM signal, but does not stop.
> > 
> > (snip)
> > Oct  5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0)
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel 
> > to
> > 12238 is not connected
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel:
> > Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 
> > failed
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply 
> > to
> > crmd failed: reply failed
> > Oct  5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing
> > /usr/lib64/heartbeat/attrd process group 12237 with signal 15
> > Oct  5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 
> > operations
> > (4123.00us average, 0% utilization) in the last 10min
> > Oct  5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> > channel took 1010 ms (>  100 ms)
> > Oct  5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> > channel took 1010 ms (>  100 ms)
> > Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) 
> > before
> > being called (GSource: 0xd28010)
> > Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431583547 should have started at 431583444
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for send local status was delayed 1030 ms (>  1010 ms) 
> > before
> > being called (GSource: 0xd27dd0)
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431584254 should have started at 431584151
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) 
> > before
> > being called (GSource: 0xd28010)
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431584254 should have started at 431584151
> > Oct  5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working 
> > on
> > write child took 1010 ms (>  100 ms)
> > Oct  5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on
> > Heartbeat API channel took 1010 ms (>  100 ms)
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for send local status was delayed 1030 ms (>  1010 ms) 
> > before
> > being called (GSource: 0xd27dd0)
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431607988 should have started at 431607885
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) 
> > before
> > being called (GSource: 0xd28010)
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431607988 should have started at 431607885
> > (snip)
> > 
> > We try the reproduction of the phenomenon, but do not reappear very much.
> > 
> > The same phenomenon is reported by the next email.
> > However, the argument of the problem is over on the way.
> > 
> >   * http://www.gossamer-threads.com/lists/linuxha/pacemaker/62147
> > 
> > The phenomenon occurred by the next combination.
> >   * pacemaker-1.0.11
> >   * resource-agents-3.9.2
> >   * cluster-glue-1.0.7
> >   * heartbeat-3.0.5
> > 
> > I registered these contents with Bugzilla.
> >   * http://bugs.clusterlabs.org/show_bug.cgi?id=5004
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: 
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > 
> 
> 
> --     Alan Robertson<al...@unix.sh>
> 
> "Openness is the foundation and preservative of friendship...  Let me claim 
> from you at all times your undisguised opinions." - William Wilberforce
> 

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Reply via email to