Re: [Pacemaker] hangs pending

Andrew Beekhof Fri, 10 Jan 2014 02:01:17 -0800

On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> wrote:


> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>:
> 
>>  On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>   08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>:
>>>>   On 29 Nov 2013, at 7:17 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>>    Hi, ALL.
>>>>> 
>>>>>    I'm still trying to cope with the fact that after the fence - node 
>>>>> hangs in "pending".
>>>>   Please define "pending".  Where did you see this?
>>>   In crm_mon:
>>>   ......
>>>   Node dev-cluster2-node2 (172793105): pending
>>>   ......
>>> 
>>>   The experiment was like this:
>>>   Four nodes in cluster.
>>>   On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>   Thereafter, the remaining start it constantly reboot, under various 
>>> pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>   Then in the log fell out "Too many failures ...."
>>>   All this time in the status in crm_mon is "pending".
>>>   Depending on the wind direction changed to "UNCLEAN"
>>>   Much time has passed and I can not accurately describe the behavior...
>>> 
>>>   Now I am in the following state:
>>>   I tried locate the problem. Came here with this.
>>>   I set big value in property stonith-timeout="600s".
>>>   And got the following behavior:
>>>   1. pkill -4 corosync
>>>   2. from node with DC call my fence agent "sshbykey"
>>>   3. It sends reboot victim and waits until she comes to life again.
>>  Hmmm.... what version of pacemaker?
>>  This sounds like a timing issue that we fixed a while back
> 
> Was a version 1.1.11 from December 3.
> Now try full update and retest.

That should be recent enough.  Can you create a crm_report the next time you 
reproduce?

> 
>>>     Once the script makes sure that the victim will rebooted and again 
>>> available via ssh - it exit with 0.
>>>     All command is logged both the victim and the killer - all right.
>>>   4. A little later, the status of the (victim) nodes in crm_mon changes to 
>>> online.
>>>   5. BUT... not one resource don't start! Despite the fact that 
>>> "crm_simalate -sL" shows the correct resource to start:
>>>     * Start   pingCheck:3  (dev-cluster2-node2)
>>>   6. In this state, we spend the next 600 seconds.
>>>     After completing this timeout causes another node (not DC) decides to 
>>> kill again our victim.
>>>     All command again is logged both the victim and the killer - All 
>>> documented :)
>>>   7. NOW all resource started in right sequence.
>>> 
>>>   I almost happy, but I do not like: two reboots and 10 minutes of waiting 
>>> ;)
>>>   And if something happens on another node, this the behavior is 
>>> superimposed on old and not any resources not start until the last node 
>>> will not reload twice.
>>> 
>>>   I tried understood this behavior.
>>>   As I understand it:
>>>   1. Ultimately, in ./lib/fencing/st_client.c call 
>>> internal_stonith_action_execute().
>>>   2. It make fork and pipe from tham.
>>>   3. Async call mainloop_child_add with callback to  
>>> stonith_action_async_done.
>>>   4. Add timeout  g_timeout_add to TERM and KILL signals.
>>> 
>>>   If all right must - call stonith_action_async_done, remove timeout.
>>>   For some reason this does not happen. I sit and think ....
>>>>>    At this time, there are constant re-election.
>>>>>    Also, I noticed the difference when you start pacemaker.
>>>>>    At normal startup:
>>>>>    * corosync
>>>>>    * pacemakerd
>>>>>    * attrd
>>>>>    * pengine
>>>>>    * lrmd
>>>>>    * crmd
>>>>>    * cib
>>>>> 
>>>>>    When hangs start:
>>>>>    * corosync
>>>>>    * pacemakerd
>>>>>    * attrd
>>>>>    * pengine
>>>>>    * crmd
>>>>>    * lrmd
>>>>>    * cib.
>>>>   Are you referring to the order of the daemons here?
>>>>   The cib should not be at the bottom in either case.
>>>>>    Who knows who runs lrmd?
>>>>   Pacemakerd.
>>>>>    _______________________________________________
>>>>>    Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>>    Project Home: http://www.clusterlabs.org
>>>>>    Getting started: 
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>   ,
>>>>   _______________________________________________
>>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>   _______________________________________________
>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>>   Project Home: http://www.clusterlabs.org
>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>   Bugs: http://bugs.clusterlabs.org
>>  ,
>>  _______________________________________________
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

Reply via email to