Re: [Pacemaker] hangs pending

Andrew Beekhof Tue, 04 Mar 2014 16:06:12 -0800

On 25 Feb 2014, at 8:30 pm, Andrey Groshev <gre...@yandex.ru> wrote:


> 
> 
> 21.02.2014, 12:04, "Andrey Groshev" <gre...@yandex.ru>:
>> 21.02.2014, 05:53, "Andrew Beekhof" <and...@beekhof.net>:
>> 
>>>  On 19 Feb 2014, at 7:53 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>   19.02.2014, 09:49, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>   On 19 Feb 2014, at 4:18 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>>>    19.02.2014, 09:08, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>    On 19 Feb 2014, at 4:00 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>>>>>     19.02.2014, 06:48, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>     On 18 Feb 2014, at 11:05 pm, Andrey Groshev <gre...@yandex.ru> 
>>>>>>>>> wrote:
>>>>>>>>>>      Hi, ALL and Andrew!
>>>>>>>>>> 
>>>>>>>>>>      Today is a good day - I killed a lot, and a lot of shooting at 
>>>>>>>>>> me.
>>>>>>>>>>      In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>      Except resources on the node are important to me eight 
>>>>>>>>>> processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>      I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>      Behavior does not depend of number signal - it's good.
>>>>>>>>>>      If STONITH send reboot to the node - it rebooted and rejoined 
>>>>>>>>>> the cluster - too it's good.
>>>>>>>>>>      But the behavior is different from killing various demons.
>>>>>>>>>> 
>>>>>>>>>>      Turned four groups:
>>>>>>>>>>      1. corosync,cib - STONITH work 100%.
>>>>>>>>>>      Kill via any signals - call STONITH and reboot.
>>>>>>>>>> 
>>>>>>>>>>      2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>      Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>      Sometimes restart daemon and restart resources with large delay 
>>>>>>>>>> MS:pgsql.
>>>>>>>>>>      One time after restart crmd - pgsql don't restart.
>>>>>>>>>> 
>>>>>>>>>>      3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>      This daemons simple restart, resources - stay running.
>>>>>>>>>> 
>>>>>>>>>>      4. pacemakerd - nothing happens.
>>>>>>>>>>      And then I can kill any process of the third group. They do not 
>>>>>>>>>> restart.
>>>>>>>>>>      Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>>>>> 
>>>>>>>>>>      What do you think about this?
>>>>>>>>>>      The main question of this topic - we decided.
>>>>>>>>>>      But this varied behavior - another big problem.
>>>>>>>>>> 
>>>>>>>>>>      Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>>>>>>>     Which of the various conditions above do the logs cover?
>>>>>>>>     All various in day.
>>>>>>>    Are you trying to torture me?
>>>>>>>    Can you give me a rough idea what happened when?
>>>>>>    No, there is 8 processes on the 4th signal and repeats the 
>>>>>> experiments with unknown outcome :)
>>>>>>    Easier to conduct new experiments and individual new logs .
>>>>>>    Which variant is more interesting?
>>>>>   The long delay in restarting pgsql.
>>>>>   Everything else seems correct.
>>>>   He even don't tried start pgsql.
>>>>   In Logs tree the tests.
>>>>   kill -s4 lrmd pid.
>>>>   1. STONITH
>>>>   2. STONITH
>>>>   3. hangs
>>>  Its waiting on a value for default_ping_set
>>> 
>>>  It seems we're calling monitor for pingCheck but for some reason its not 
>>> performing an update:
>>> 
>>>  # grep 2632.*lrmd.*pingCheck 
>>> /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log
>>>  Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     
>>> info: process_lrmd_get_rsc_info: Resource 'pingCheck' not found (3 active 
>>> resources)
>>>  Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     
>>> info: process_lrmd_get_rsc_info: Resource 'pingCheck:3' not found (3 active 
>>> resources)
>>>  Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     
>>> info: process_lrmd_rsc_register: Added 'pingCheck' to the rsc list (4 
>>> active resources)
>>>  Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>> debug: log_execute: executing - rsc:pingCheck action:monitor call_id:19
>>>  Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>> debug: operation_finished: pingCheck_monitor_0:2658 - exited with rc=0
>>>  Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>> debug: operation_finished: pingCheck_monitor_0:2658:stderr [ -- empty -- ]
>>>  Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>> debug: operation_finished: pingCheck_monitor_0:2658:stdout [ -- empty -- ]
>>>  Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>> debug: log_finished: finished - rsc:pingCheck action:monitor call_id:19 
>>> pid:2658 exit-code:0 exec-time:2039ms queue-time:0ms
>>>  Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>> debug: log_execute: executing - rsc:pingCheck action:monitor call_id:20
>>>  Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>> debug: operation_finished: pingCheck_monitor_10000:2816 - exited with rc=0
>>>  Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>> debug: operation_finished: pingCheck_monitor_10000:2816:stderr [ -- empty 
>>> -- ]
>>>  Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>> debug: operation_finished: pingCheck_monitor_10000:2816:stdout [ -- empty 
>>> -- ]
>>> 
>>>  Could you add:
>>> 
>>>    export OCF_TRACE_RA=1
>>> 
>>>  to the top of the ping agent and retest?
>> 
>> Today the fourth time worked.
>> I even doubted if the difference is how to kill (kill -s 4 pid or pkill -4 
>> lrmd)
>> Logs http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2
> Hi,
> You  haven't watched it?

Not yet. I've been hitting ACLs with a large hammer.
Where are we up to with this?  Do I disregard this one and look at the most 
recent email?

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

Reply via email to