Re: [Pacemaker] pacemaker shutdown waits for a failover

Andrew Beekhof Tue, 05 Aug 2014 20:08:19 -0700

On 3 Aug 2014, at 4:07 pm, Liron Amitzi <lir...@imperva.com> wrote:

>>>>> When I run "service pacemaker stop" it takes a long time, I see that it 
>>>>> stops all the resources, then starts them on the other node, and only 
>>>>> then the "stop" command is completed.
>>>> 
>>>> Ahhh! It was the DC.
>>>> 
>>>> It appears to be deliberate, I found this commit from 2008 where the 
>>>> behaviour was introduced:
>>>> https://github.com/beekhof/pacemaker/commit/7bf55f0
>>>> 
>>>> I could change it, but I'm no longer sure this would be a good idea as it 
>>>> would increase service downtime.
>>>> (Electing and bootstrapping a new DC introduces additional delays before 
>>>> the cluster can bring up any resources).
>>>> 
>>>> I assume there is a particular resource that takes a long time to start?
>>>> 
>>> Yes, mainly the JavaSrv takes quite a lot of time...
>> 
>> Do you have any resources that need to start after JavaSrv?
>> If not there might be some magic you can use...
> 
> No I don't, the Java is the last one. If I manage to do a "magic" it will 
> help me a lot...


1. You _may_ be able to set op_no_wait as a meta-attribute for your java 
resource.
2. You could change the agent's start action to return early and set a large 
start-delay for the recurring monitor operation (we usually recommend the exact 
opposite)
3. You could set start-delay > START_DELAY_THRESHOLD (aka. 5 * 60 * 1000)

#2 might be the least worst

> 
>>> So you say this is by design since the server I'm rebooting is the DC, and 
>>> I suffer because my resources take long time to start?
>> 
>> Essentially, yes.
>> 
>>> Got it, thanks a lot for your response.
>>> 
>>>> 
>>>>> I have 3 resources, IP, OracleDB and JavaSrv
>>>>> 
>>>>> This is the output on the screen:
>>>>> [root@ha1 ~]# service pacemaker stop
>>>>> Signaling Pacemaker Cluster Manager to terminate:          [  OK  ]
>>>>> Waiting for cluster services to 
>>>>> >unload:....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
>>>>>                                               [  OK ]
>>>>> [root@ha1 ~]#
>>>>> 
>>>>> And these are parts of the log (/var/log/cluster/corosync.log):
>>>>> Jun 29 15:14:15 [28031] ha1    pengine:   notice: stage6:  Scheduling 
>>>>> Node ha1 for shutdown
>>>>> Jun 29 15:14:15 [28031] ha1    pengine:   notice: LogActions:      Move   
>>>>>  ip_resource     (Started ha1 -> ha2)
>>>>> Jun 29 15:14:15 [28031] ha1    pengine:   notice: LogActions:      Move   
>>>>>  OracleDB        (Started ha1 -> ha2)
>>>>> Jun 29 15:14:15 [28031] ha1    pengine:   notice: LogActions:      Move   
>>>>>  JavaSrv    (Started ha1 -> ha2)
>>>>> Jun 29 15:14:15 [28032] ha1       crmd:     info: te_rsc_command:  
>>>>> Initiating action 12: stop JavaSrv_stop_0 on ha1 (local)
>>>>> Jun 29 15:14:15 ha1 lrmd: [28029]: info: rsc:JavaSrv:16: stop
>>>>> ...
>>>>> Jun 29 15:14:41 [28032] ha1       crmd:     info: process_lrm_event:      
>>>>>  LRM operation JavaSrv_stop_0 (call=16, rc=0, cib-update=447, 
>>>>> confirmed=true) ok
>>>>> Jun 29 15:14:41 [28032] ha1       crmd:     info: te_rsc_command:  
>>>>> Initiating action 9: stop OracleDB_stop_0 on ha1 (local)
>>>>> Jun 29 15:14:41 ha1 lrmd: [28029]: info: cancel_op: operation monitor[13] 
>>>>> on lsb::ha-dbora::OracleDB for client 28032, its parameters: 
>>>>> CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_timeout=[600000] 
>>>>> CRM_meta_interval=[60000]  cancelled
>>>>> Jun 29 15:14:41 ha1 lrmd: [28029]: info: rsc:OracleDB:17: stop
>>>>> ...
>>>>> Jun 29 15:15:08 [28032] ha1       crmd:     info: process_lrm_event:      
>>>>>  LRM operation OracleDB_stop_0 (call=17, rc=0, cib-update=448, 
>>>>> confirmed=true) ok
>>>>> Jun 29 15:15:08 [28032] ha1       crmd:     info: te_rsc_command:  
>>>>> Initiating action 7: stop ip_resource_stop_0 on ha1 (local)
>>>>> ...
>>>>> Jun 29 15:15:08 [28032] ha1       crmd:     info: process_lrm_event:      
>>>>>  LRM operation ip_resource_stop_0 (call=18, rc=0, cib-update=449, 
>>>>> confirmed=true) ok
>>>>> Jun 29 15:15:08 [28032] ha1       crmd:     info: te_rsc_command:  
>>>>> Initiating action 8: start ip_resource_start_0 on ha2
>>>>> Jun 29 15:15:08 [28032] ha1       crmd:     info: te_crm_command:  
>>>>> Executing crm-event (21): do_shutdown on ha1
>>>>> Jun 29 15:15:08 [28032] ha1       crmd:     info: te_crm_command:  
>>>>> crm-event (21) is a local shutdown
>>>>> Jun 29 15:15:09 [28032] ha1       crmd:     info: te_rsc_command:  
>>>>> Initiating action 10: start OracleDB_start_0 on ha2
>>>>> Jun 29 15:15:51 [28032] ha1       crmd:     info: te_rsc_command:  
>>>>> Initiating action 11: monitor OracleDB_monitor_60000 on ha2
>>>>> Jun 29 15:15:51 [28032] ha1       crmd:     info: te_rsc_command:  
>>>>> Initiating action 13: start JavaSrv_start_0 on ha2
>>>>> ...
>>>>> Jun 29 15:27:09 [28023] ha1 pacemakerd:     info: pcmk_child_exit:        
>>>>>  Child process cib exited (pid=28027, rc=0)
>>>>> Jun 29 15:27:09 [28023] ha1 pacemakerd:   notice: pcmk_shutdown_worker:   
>>>>>  Shutdown complete
>>>>> Jun 29 15:27:09 [28023] ha1 pacemakerd:     info: main:    Exiting 
>>>>> pacemakerd
>>>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] pacemaker shutdown waits for a failover

Reply via email to