Hi All, We came across a mysterious phenomenon on a test of the drbd environment.
It is the following procedure. Step 1) Start two nodes. Step 2) Cause the hang of the kernel in an active node. Step 3) In a standby node, the cancellation of the monitor of drbd is carried out. The cancellation of the monitor of drbd is completed, but a timer occurs. Because it was completed, the cancellation of the monitor of drbd should stop the timer. (snip) Jul 25 23:06:16 de02 crmd: [18190]: info: te_rsc_command: Initiating action 23: cancel prmDrPostgreSQLDB1:1_monitor_20000 on de02 (local) Jul 25 23:06:16 de02 lrmd: [18187]: info: cancel_op: operation monitor[53] on ocf::drbd::prmDrPostgreSQLDB1:1 for client 18190, its parameters: CRM_meta_interval=[20000] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_active_resource=[ ] CRM_meta_notify_slave_resource=[ ] CRM_meta_notify_start_resource=[prmDrPostgreSQLDB1:0 prmDrPostgreSQLDB1:1 prmDrCRM_meta_notify_active_uname=[ ] CRM_meta_notify_promote_resource=[ ] CRM_meta_notify_stop_uname=[ ] drbd_resource=[r0] CRM_meta_notify_master_uname=[ ] CRM_meta_master_max=[1] CRM_meta_no cancelled Jul 25 23:06:16 de02 crmd: [18190]: info: send_direct_ack: ACK'ing resource op prmDrPostgreSQLDB1:1_monitor_20000 from 23:10:0:beda4e4c-4386-45f6-b984-26fb71ae0fdf: lrm_invoke-lrmd-1311602776-148 Jul 25 23:06:16 de02 crmd: [18190]: info: process_te_message: Processing (N)ACK lrm_invoke-lrmd-1311602776-148 from de02 Jul 25 23:06:16 de02 crmd: [18190]: info: match_graph_event: Action prmDrPostgreSQLDB1:1_monitor_20000 (23) confirmed on de02 (rc=0) Jul 25 23:06:16 de02 crmd: [18190]: info: te_rsc_command: Initiating action 24: cancel prmDrPostgreSQLDB2:1_monitor_20000 on de02 (local) (snip) Jul 25 23:06:16 de02 crmd: [18190]: info: match_graph_event: Action prmDrPostgreSQLDB1:1_monitor_20000 (23) confirmed on de02 (rc=0) Jul 25 23:06:16 de02 crmd: [18190]: info: te_rsc_command: Initiating action 24: cancel prmDrPostgreSQLDB2:1_monitor_20000 on de02 (local) Jul 25 23:06:16 de02 lrmd: [18187]: info: cancel_op: operation monitor[54] on ocf::drbd::prmDrPostgreSQLDB2:1 for client 18190, its parameters: CRM_meta_interval=[20000] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_active_resource=[ ] CRM_meta_notify_slave_resource=[prmDrPostgreSQLDB1:0 prmDrPostgreSQLDB1:1 ] CRM_meta_notify_start_resource=[prmDrPostgreSQLDB2:0 prmDrPostgreSQLDB2:1 ] CRM_meta_notify_active_uname=[ ] CRM_meta_notify_promote_resource=[prmDrPostgreSQLDB1:1 prmDrPostgreSQLDB2:1 ] CRM_meta_notify_stop_uname=[ ] drbd_res cancelled Jul 25 23:06:16 de02 crmd: [18190]: info: send_direct_ack: ACK'ing resource op prmDrPostgreSQLDB2:1_monitor_20000 from 24:10:0:beda4e4c-4386-45f6-b984-26fb71ae0fdf: lrm_invoke-lrmd-1311602776-150 Jul 25 23:06:16 de02 crmd: [18190]: info: process_te_message: Processing (N)ACK lrm_invoke-lrmd-1311602776-150 from de02 Jul 25 23:06:16 de02 crmd: [18190]: info: match_graph_event: Action prmDrPostgreSQLDB2:1_monitor_20000 (24) confirmed on de02 (rc=0) (snip) Jul 25 23:07:36 de02 crmd: [18190]: WARN: action_timer_callback: Timer popped (timeout=20000, abort_level=0, complete=false) Jul 25 23:07:36 de02 crmd: [18190]: ERROR: print_elem: Aborting transition, action lost: [Action 23]: Completed (id: prmDrPostgreSQLDB1:1_monitor_20000, loc: de02, priority: 0) Jul 25 23:07:36 de02 crmd: [18190]: info: abort_transition_graph: action_timer_callback:493 - Triggered transition abort (complete=0) : Action lost Jul 25 23:07:36 de02 crmd: [18190]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000 Jul 25 23:07:36 de02 crmd: [18190]: info: update_abort_priority: Abort action done superceeded by restart Jul 25 23:07:36 de02 crmd: [18190]: WARN: action_timer_callback: Timer popped (timeout=20000, abort_level=1000000, complete=false) Jul 25 23:07:36 de02 crmd: [18190]: ERROR: print_elem: Aborting transition, action lost: [Action 24]: Completed (id: prmDrPostgreSQLDB2:1_monitor_20000, loc: de02, priority: 0) (snip) crmd fails in stop_te_timer processing, and a timer seems to work. Only this one time of phenomenon occurred. It did not occur on several times of other tests. I registered this problem with Bugzilla. * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2639 Best Regards, Hideo Yamauchi. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker