Hi Alan, Thank you for comment.
We reproduce a problem, too and are going to send a report. However, the problem does not reappear for the moment. Best Regards, Hideo Yamauchi. --- On Thu, 2011/10/20, Alan Robertson <al...@unix.sh> wrote: > Hi, > > I've seen a very similar problem in a recent release. In fact, I'm in the > process of reproducing it so that it can be properly logged and so on. When > I get the right data for the bug report, I'll attach it to the bug. > > FWIW: I'm pretty sure that the signal was properly received by attrd. I > haven't looked at the attrd code, but my guess is that either it didn't issue > the correct function call for exiting from mainloop - or that the mainloop > code didn't actually exit. FWIW - it probably doesn't matter at all what the > priority for signal handling is - since attrd consumes nearly no CPU. Too > bad it doesn't log receiving the signal or beginning the process of exiting... > > Another random thought - I suppose attrd could be clobbering some memory > which mainloop needs to properly process an exit. Doesn't seem likely - but > neither of the above options seem very likely either. > > > ---------------------------- > An historical note on an early bug that had similar symptoms (but affected > every process - not just attrd). > > First - what caused such a problem (a very long time ago): > There is a window between the checking for signals and going to sleep in > the poll call where > such that a signal might be ignored for a while. > > The glib mainloop code has three entry points called each time a signal > is received: > prepare, check, dispatch. > > There is a poll call which occurs between the prepare and check steps. If a > signal comes in after the prepare call returns, but before the code goes to > sleep in the poll system call, it will be ignored until > the poll system call returns. It will get caught on the next iteration of > the loop. > > The fix was fairly simple - the signal handling code instructs the mainloop > infrastructure to call poll with an argument which prevents it from staying > asleep longer than a second. > > Then the code processes the signal correctly. > > > On 10/17/2011 07:19 PM, renayama19661...@ybb.ne.jp wrote: > > Hi, > > > > We sometimes fail in a stop of attrd. > > > > Step1. start a cluster in 2 nodes > > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > Step3. stop the second node after time passed a > > little.(/etc/init.d/heartbeat > > stop.) > > > > The attrd catches the TERM signal, but does not stop. > > > > (snip) > > Oct 5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0) > > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel > > to > > 12238 is not connected > > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel: > > Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 > > failed > > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply > > to > > crmd failed: reply failed > > Oct 5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing > > /usr/lib64/heartbeat/attrd process group 12237 with signal 15 > > Oct 5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 > > operations > > (4123.00us average, 0% utilization) in the last 10min > > Oct 5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC > > channel took 1010 ms (> 100 ms) > > Oct 5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC > > channel took 1010 ms (> 100 ms) > > Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd28010) > > Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431583547 should have started at 431583444 > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for send local status was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd27dd0) > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431584254 should have started at 431584151 > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd28010) > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431584254 should have started at 431584151 > > Oct 5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working > > on > > write child took 1010 ms (> 100 ms) > > Oct 5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on > > Heartbeat API channel took 1010 ms (> 100 ms) > > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for send local status was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd27dd0) > > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431607988 should have started at 431607885 > > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd28010) > > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431607988 should have started at 431607885 > > (snip) > > > > We try the reproduction of the phenomenon, but do not reappear very much. > > > > The same phenomenon is reported by the next email. > > However, the argument of the problem is over on the way. > > > > * http://www.gossamer-threads.com/lists/linuxha/pacemaker/62147 > > > > The phenomenon occurred by the next combination. > > * pacemaker-1.0.11 > > * resource-agents-3.9.2 > > * cluster-glue-1.0.7 > > * heartbeat-3.0.5 > > > > I registered these contents with Bugzilla. > > * http://bugs.clusterlabs.org/show_bug.cgi?id=5004 > > > > Best Regards, > > Hideo Yamauchi. > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > > > -- Alan Robertson<al...@unix.sh> > > "Openness is the foundation and preservative of friendship... Let me claim > from you at all times your undisguised opinions." - William Wilberforce > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker