Hi, We have a 2-node Pacemaker (1.1.7 on Fedora Core 16) cluster that runs a PostgreSQL database on a DRBD drive, Apache, Tomcat and a few custom services. The load is pretty low (a few HTTP clients, a few database clients, infrequent database updates - every hour or so). It worked perfectly for more than a year, surviving three hardware failures with no problems.
Now something strange happened. It started with these messages: Mar 23 07:21:13 srvplan1 lrmd: [1125]: WARN: ping:0:monitor process (PID 10011) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 07:21:13 srvplan1 lrmd: [1125]: info: ping:0:monitor process (PID -10011) died before killing (try 1) Mar 23 07:21:17 srvplan1 lrmd: [1125]: WARN: TcpPlanServer:0:monitor process (PID 10012) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 07:21:17 srvplan1 lrmd: [1125]: info: TcpPlanServer:0:monitor process (PID -10012) died before killing (try 1) Mar 23 07:21:17 srvplan1 lrmd: [1125]: WARN: PlanServer:0:monitor process (PID 10013) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 07:21:17 srvplan1 lrmd: [1125]: info: PlanServer:0:monitor process (PID -10013) died before killing (try 1) Mar 23 07:21:17 srvplan1 lrmd: [1125]: WARN: PlanUpdateClient:0:monitor process (PID 10014) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 07:21:17 srvplan1 lrmd: [1125]: info: PlanUpdateClient:0:monitor process (PID -10014) died before killing (try 1) It all looks weird to me because if the processes hanged, then it is very unlikely that they all suddenly died before killing. Then there was silence for more than 45 minutes, no even the usual "monitor" messages. Moreover, at this point the web services were already unresponsive. Then this: Mar 23 08:06:38 srvplan1 lrmd: [1125]: info: rsc:postgresql:47: monitor Mar 23 08:06:38 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters: crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor] CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0] pgdata=[/hd2/pgsql] stayed in operation list for 2730970 ms (longer than 10000 ms) Mar 23 08:06:38 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[29] on ocf::tomcat::tomcat:0 for client 1128, its parameters: java_home=[/usr/lib/jvm/jre] tomcat_user=[tomcat] catalina_home=[/usr/share/tomcat] CRM_meta_timeout=[30000] CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_interval=[30000] script_log=[/home/tmo/log/tomcat.log] statusurl=[http://127.0.0.1:8080/status/] CRM_meta_globally_unique=[false] stayed in operation list for 2717870 ms (longer than 10000 ms) Mar 23 08:06:38 srvplan1 lrmd: [1125]: info: rsc:DBIP:45: monitor Mar 23 08:06:38 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[45] on ocf::IPaddr2::DBIP for client 1128, its parameters: cidr_netmask=[24] crm_feature_set=[3.0.6] CRM_meta_name=[monitor] CRM_meta_interval=[10000] CRM_meta_timeout=[20000] ip=[xxx.xxx.xxx.185] stayed in operation list for 2737850 ms (longer than 10000 ms) Mar 23 08:06:39 srvplan1 lrmd: [1125]: info: rsc:drbd_pgdrive:0:42: monitor Mar 23 08:06:39 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[42] on ocf::drbd::drbd_pgdrive:0 for client 1128, its parameters: drbd_resource=[pgdrive] CRM_meta_role=[Master] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_start_resource=[drbd_pgdrive:0 ] CRM_meta_notify_inactive_resource=[drbd_pgdrive:0 ] CRM_meta_notify_master_uname=[srvplan2 ] CRM_meta_timeout=[60000] CRM_meta_name=[monitor] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_promote_uname=[ ] crm_feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_start_uname=[srvplan1 stayed in operation list for 2738840 ms (longer than 10000 ms) Mar 23 08:06:40 srvplan1 lrmd: [1125]: info: rsc:apache:0:26: monitor Mar 23 08:06:40 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[26] on ocf::apache::apache:0 for client 1128, its parameters: testregex=[</html>] CRM_meta_timeout=[30000] CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false] configfile=[/etc/httpd/conf/httpd.conf] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_interval=[10000] CRM_meta_globally_unique=[false] stayed in operation list for 2739830 ms (longer than 10000 ms) Mar 23 08:06:47 srvplan1 lrmd: [1125]: info: cancel_op: operation monitor[29] on ocf::tomcat::tomcat:0 for client 1128, its parameters: java_home=[/usr/lib/jvm/jre] tomcat_user=[tomcat] catalina_home=[/usr/share/tomcat] CRM_meta_timeout=[30000] CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_interval=[30000] script_log=[/home/tmo/log/tomcat.log] statusurl=[http://127.0.0.1:8080/status/] CRM_meta_globally_unique=[false] cancelled These 2738840 ms look like that 45-something minutes between these and the previous messages. Then the situation evolved into this: Mar 23 08:06:47 srvplan1 lrmd: [1125]: info: rsc:tomcat:0:58: stop Mar 23 08:06:47 srvplan1 lrmd: [1125]: info: rsc:tomcat:0:59: start Mar 23 08:06:48 srvplan1 lrmd: [1125]: info: rsc:TcpPlanServer:0:22: monitor Mar 23 08:06:48 srvplan1 lrmd: [1125]: info: rsc:PlanServer:0:32: monitor Mar 23 08:06:48 srvplan1 lrmd: [1125]: info: rsc:PlanUpdateClient:0:24: monitor Mar 23 08:06:51 srvplan1 lrmd: [1125]: info: rsc:ping:0:25: monitor Mar 23 08:06:52 srvplan1 lrmd: [1125]: info: rsc:tomcat:0:60: monitor Mar 23 08:17:53 srvplan1 lrmd: [1125]: WARN: TcpPlanServer:0:monitor process (PID 20113) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 08:17:53 srvplan1 lrmd: [1125]: info: TcpPlanServer:0:monitor process (PID -20113) died before killing (try 1) Mar 23 08:17:57 srvplan1 lrmd: [1125]: WARN: apache:0:monitor process (PID 20108) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 08:17:59 srvplan1 lrmd: [1125]: WARN: tomcat:0:monitor process (PID 20109) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 08:18:02 srvplan1 lrmd: [1125]: WARN: apache:0:monitor process (PID 20108) timed out (try 2). Killing with signal SIGKILL (9). Mar 23 08:18:04 srvplan1 lrmd: [1125]: WARN: tomcat:0:monitor process (PID 20109) timed out (try 2). Killing with signal SIGKILL (9). Mar 23 08:18:07 srvplan1 lrmd: [1125]: ERROR: TrackedProcTimeoutFunction: apache:0:monitor process (PID 20108) will not die! Mar 23 08:18:09 srvplan1 lrmd: [1125]: ERROR: TrackedProcTimeoutFunction: tomcat:0:monitor process (PID 20109) will not die! Mar 23 08:18:27 srvplan1 lrmd: [1125]: WARN: drbd_pgdrive:0:monitor process (PID 20107) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 08:18:32 srvplan1 lrmd: [1125]: WARN: drbd_pgdrive:0:monitor process (PID 20107) timed out (try 2). Killing with signal SIGKILL (9). Mar 23 08:18:37 srvplan1 lrmd: [1125]: ERROR: TrackedProcTimeoutFunction: drbd_pgdrive:0:monitor process (PID 20107) will not die! So the first time they died too soon, and this time they refuse to die? And it was already an hour into the problem! Then another 45 minutes passed: Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: operation monitor[42] on ocf::drbd::drbd_pgdrive:0 for client 1128, its parameters: drbd_resource=[pgdrive] CRM_m eta_role=[Master] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_start_resource=[drbd_pgdrive:0 ] CRM_meta_notify_inactive_resource=[drbd_pgdrive:0 ] CRM_m eta_notify_master_uname=[srvplan2 ] CRM_meta_timeout=[60000] CRM_meta_name=[monitor] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_promote_uname=[ ] crm _feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_start_uname=[srvplan1 : pid [20107] timed out Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: operation monitor[26] on ocf::apache::apache:0 for client 1128, its parameters: testregex=[</html>] CRM_meta_time out=[30000] CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false] configfile=[/etc/httpd/conf/httpd.conf] CRM_meta_clone_node_max=[1] CRM_me ta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_interval=[10000] CRM_meta_globally_unique=[false] : pid [20108] timed out Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: operation monitor[60] on ocf::tomcat::tomcat:0 for client 1128, its parameters: java_home=[/usr/lib/jvm/jre] tomc at_user=[tomcat] catalina_home=[/usr/share/tomcat] CRM_meta_timeout=[30000] CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_c lone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_interval=[30000] script_log=[/home/tmo/log/tomcat.log] statusurl=[http://127.0.0.1:8080/s tatus/] CRM_meta_globally_unique=[false] : pid [20109] timed out Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[24] on ocf::anything::PlanUpdateClient:0 for client 1128, its parameters: logfile=[/home/tmo/planupdateclient/PlanUpdateClient.log.txt] CRM_meta_timeout=[20000] CRM_meta_name=[monitor] binfile=[/home/tmo/planupdateclient/PlanUpdateClient] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] pidfile=[/home/tmo/planupdateclient/PlanUpdateClient.pid] CRM_meta_interval=[10000] user=[tmo] CRM_meta_globally_unique=[false] stayed in operation list for 2746640 ms (longer than 10000 ms) Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters: crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor] CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0] pgdata=[/hd2/pgsql] stayed in operation list for 2746620 ms (longer than 10000 ms) Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[45] on ocf::IPaddr2::DBIP for client 1128, its parameters: cidr_netmask=[24] crm_feature_set=[3.0.6] CRM_meta_name=[monitor] CRM_meta_interval=[10000] CRM_meta_timeout=[20000] ip=[xxx.xxx.xxx.185] stayed in operation list for 2746570 ms (longer than 10000 ms) Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: cancel_op: operation monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters: crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor] CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0] pgdata=[/hd2/pgsql] cancelled Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: flush_op: process for operation monitor[47] on ocf::pgsql::postgresql for client 1128 still running, flush delayed Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: on_msg_cancel_op: operation 47 running, cancel pending Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: perform_op:2906: operation monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters: crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor] CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0] pgdata=[/hd2/pgsql] for rsc is already running. Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: perform_op:2916: postponing all ops on resource postgresql by 1000 ms Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: cancel_op: operation monitor[60] on ocf::tomcat::tomcat:0 for client 1128, its parameters: java_home=[/usr/lib/jvm/jre] tomcat_user=[tomcat] catalina_home=[/usr/share/tomcat] CRM_meta_timeout=[30000] CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_interval=[30000] script_log=[/home/tmo/log/tomcat.log] statusurl=[http://127.0.0.1:8080/status/] CRM_meta_globally_unique=[false] cancelled Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: cancel_op: operation monitor[26] on ocf::apache::apache:0 for client 1128, its parameters: testregex=[</html>] CRM_meta_timeout=[30000] CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false] configfile=[/etc/httpd/conf/httpd.conf] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_interval=[10000] CRM_meta_globally_unique=[false] cancelled Mar 23 09:03:23 srvplan1 lrmd: [1125]: info: perform_op:2906: operation monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters: crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor] CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0] pgdata=[/hd2/pgsql] for rsc is already running. Mar 23 09:03:23 srvplan1 lrmd: [1125]: info: perform_op:2916: postponing all ops on resource postgresql by 1000 ms Mar 23 09:03:24 srvplan1 lrmd: [1125]: info: perform_op:2906: operation monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters: crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor] CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0] pgdata=[/hd2/pgsql] for rsc is already running. Mar 23 09:03:24 srvplan1 lrmd: [1125]: info: perform_op:2916: postponing all ops on resource postgresql by 1000 ms It goes on for some time, and then at some point Pacemaker decides to stop PostgreSQL: Mar 23 09:03:50 srvplan1 lrmd: [1125]: WARN: operation monitor[45] on ocf::IPaddr2::DBIP for client 1128, its parameters: cidr_netmask=[24] crm_feature_set=[3.0.6] CRM_meta_name=[monitor] CRM_meta_interval=[10000] CRM_meta_timeout=[20000] ip=[xxx.xxx.xxx.185] : pid [20736] timed out Mar 23 09:03:50 srvplan1 lrmd: [1125]: WARN: operation monitor[24] on ocf::anything::PlanUpdateClient:0 for client 1128, its parameters: logfile=[/home/tmo/planupdateclient/PlanUpdateClient.log.txt] CRM_meta_timeout=[20000] CRM_meta_name=[monitor] binfile=[/home/tmo/planupdateclient/PlanUpdateClient] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] pidfile=[/home/tmo/planupdateclient/PlanUpdateClient.pid] CRM_meta_interval=[10000] user=[tmo] CRM_meta_globally_unique=[false] : pid [20734] timed out Mar 23 09:03:50 srvplan1 lrmd: [1125]: WARN: postgresql:monitor process (PID 20735) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 09:03:50 srvplan1 lrmd: [1125]: info: perform_op:2906: operation monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters: crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor] CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0] pgdata=[/hd2/pgsql] for rsc is already running. Mar 23 09:03:51 srvplan1 lrmd: [1125]: info: perform_op:2916: postponing all ops on resource postgresql by 1000 ms Mar 23 09:03:51 srvplan1 lrmd: [1125]: info: rsc:drbd_pgdrive:0:62: notify Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation notify[62] on ocf::drbd::drbd_pgdrive:0 for client 1128, its parameters: drbd_resource=[pgdrive] CRM_meta_notify_stop_resource=[drbd_pgdrive:0 ] CRM_meta_notify_start_resource=[drbd_pgdrive:0 ] CRM_meta_notify_inactive_resource=[ ] CRM_meta_notify_master_uname=[srvplan1 ] CRM_meta_notify_promote_uname=[srvplan1 ] CRM_meta_timeout=[20000] CRM_meta_name=[start] CRM_meta_notify_demote_resource=[drbd_pgdrive:0 ] CRM_meta_notify_start_uname=[srvplan1 ] crm_feature_set=[3.0.6] CRM_meta_clone_node_max=[ stayed in operation list for 28170 ms (longer than 10000 ms) Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[32] on ocf::anything::PlanServer:0 for client 1128, its parameters: logfile=[/home/tmo/PlanServer/PlanServer.log.txt] CRM_meta_timeout=[20000] CRM_meta_name=[monitor] binfile=[/home/tmo/PlanServer/PlanServer] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] pidfile=[/home/tmo/PlanServer/PlanServer.pid] CRM_meta_interval=[10000] user=[tmo] CRM_meta_globally_unique=[false] stayed in operation list for 2776430 ms (longer than 10000 ms) Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[25] on ocf::ping::ping:0 for client 1128, its parameters: CRM_meta_timeout=[20000] multiplier=[1000] CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] host_list=[xxx.xxx.xxx.2] CRM_meta_clone_max=[2] CRM_meta_interval=[10000] CRM_meta_globally_unique=[false] stayed in operation list for 2774360 ms (longer than 10000 ms) Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 110 ms (> 100 ms) before being called (GSource: 0x12bcc00) Mar 23 09:03:51 srvplan1 lrmd: [1125]: info: G_SIG_dispatch: started at 2234731673 should have started at 2234731662 Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: operation monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters: crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor] CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0] pgdata=[/hd2/pgsql] : pid [20735] timed out Mar 23 09:03:51 srvplan1 lrmd: [1125]: info: rsc:postgresql:61: stop Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation stop[61] on ocf::pgsql::postgresql for client 1128, its parameters: crm_feature_set=[3.0.6] CRM_meta_name=[stop] CRM_meta_timeout=[60000] pgdata=[/hd2/pgsql] stayed in operation list for 28170 ms (longer than 10000 ms) Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: drbd_pgdrive:0:notify process (PID 20776) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: PlanServer:0:monitor process (PID 20777) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: ping:0:monitor process (PID 20778) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: operation notify[62] on ocf::drbd::drbd_pgdrive:0 for client 1128, its parameters: drbd_resource=[pgdrive] CRM_meta_notify_stop_resource=[drbd_pgdrive:0 ] CRM_meta_notify_start_resource=[drbd_pgdrive:0 ] CRM_meta_notify_inactive_resource=[ ] CRM_meta_notify_master_uname=[srvplan1 ] CRM_meta_notify_promote_uname=[srvplan1 ] CRM_meta_timeout=[20000] CRM_meta_name=[start] CRM_meta_notify_demote_resource=[drbd_pgdrive:0 ] CRM_meta_notify_start_uname=[srvplan1 ] crm_feature_set=[3.0.6] CRM_meta_clone_node_max=[: pid [20776] timed out Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[42] on ocf::drbd::drbd_pgdrive:0 for client 1128, its parameters: drbd_resource=[pgdrive] CRM_meta_role=[Master] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_start_resource=[drbd_pgdrive:0 ] CRM_meta_notify_inactive_resource=[drbd_pgdrive:0 ] CRM_meta_notify_master_uname=[srvplan2 ] CRM_meta_timeout=[60000] CRM_meta_name=[monitor] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_promote_uname=[ ] crm_feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_start_uname=[srvplan1 stayed in operation list for 40500 ms (longer than 10000 ms) Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: operation monitor[32] on ocf::anything::PlanServer:0 for client 1128, its parameters: logfile=[/home/tmo/PlanServer/PlanServer.log.txt] CRM_meta_timeout=[20000] CRM_meta_name=[monitor] binfile=[/home/tmo/PlanServer/PlanServer] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] pidfile=[/home/tmo/PlanServer/PlanServer.pid] CRM_meta_interval=[10000] user=[tmo] CRM_meta_globally_unique=[false] : pid [20777] timed out Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: operation monitor[25] on ocf::ping::ping:0 for client 1128, its parameters: CRM_meta_timeout=[20000] multiplier=[1000] CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] host_list=[xxx.xxx.xxx.2] CRM_meta_clone_max=[2] CRM_meta_interval=[10000] CRM_meta_globally_unique=[false] : pid [20778] timed out Mar 23 09:04:11 srvplan1 lrmd: [1125]: info: rsc:tomcat:0:63: stop Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation stop[63] on ocf::tomcat::tomcat:0 for client 1128, its parameters: java_home=[/usr/lib/jvm/jre] tomcat_user=[tomcat] catalina_home=[/usr/share/tomcat] CRM_meta_timeout=[120000] CRM_meta_name=[stop] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] script_log=[/home/tmo/log/tomcat.log] statusurl=[http://127.0.0.1:8080/status/] CRM_meta_globally_unique=[false] stayed in operation list for 48100 ms (longer than 10000 ms) Mar 23 09:04:11 srvplan1 lrmd: [1125]: info: rsc:apache:0:64: stop Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation stop[64] on ocf::apache::apache:0 for client 1128, its parameters: testregex=[</html>] CRM_meta_timeout=[60000] CRM_meta_name=[stop] crm_feature_set=[3.0.6] CRM_meta_notify=[false] configfile=[/etc/httpd/conf/httpd.conf] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_globally_unique=[false] stayed in operation list for 48100 ms (longer than 10000 ms) Mar 23 09:04:51 srvplan1 lrmd: [1125]: WARN: postgresql:stop process (PID 20779) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 09:04:51 srvplan1 lrmd: [1125]: WARN: operation stop[61] on ocf::pgsql::postgresql for client 1128, its parameters: crm_feature_set=[3.0.6] CRM_meta_name=[stop] CRM_meta_timeout=[60000] pgdata=[/hd2/pgsql] : pid [20779] timed out Mar 23 09:04:51 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the operation operation monitor[32] on ocf::anything::PlanServer:0 for client 1128, its parameters: logfile=[/home/tmo/PlanServer/PlanServer.log.txt] CRM_meta_timeout=[20000] CRM_meta_name=[monitor] binfile=[/home/tmo/PlanServer/PlanServer] crm_feature_set=[3.0.6] CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2] pidfile=[/home/tmo/PlanServer/PlanServer.pid] CRM_meta_interval=[10000] user=[tmo] CRM_meta_globally_unique=[false] stayed in operation list for 30010 ms (longer than 10000 ms) Mar 23 09:05:11 srvplan1 lrmd: [1125]: WARN: drbd_pgdrive:0:monitor process (PID 20790) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 09:05:11 srvplan1 lrmd: [1125]: WARN: apache:0:stop process (PID 20792) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 09:05:11 srvplan1 lrmd: [1125]: WARN: PlanServer:0:monitor process (PID 20797) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 09:05:16 srvplan1 lrmd: [1125]: WARN: drbd_pgdrive:0:monitor process (PID 20790) timed out (try 2). Killing with signal SIGKILL (9). Mar 23 09:05:16 srvplan1 lrmd: [1125]: WARN: apache:0:stop process (PID 20792) timed out (try 2). Killing with signal SIGKILL (9). Mar 23 09:05:16 srvplan1 lrmd: [1125]: WARN: PlanServer:0:monitor process (PID 20797) timed out (try 2). Killing with signal SIGKILL (9). Mar 23 09:05:21 srvplan1 lrmd: [1125]: ERROR: TrackedProcTimeoutFunction: drbd_pgdrive:0:monitor process (PID 20790) will not die! Mar 23 09:05:21 srvplan1 lrmd: [1125]: ERROR: TrackedProcTimeoutFunction: apache:0:stop process (PID 20792) will not die! Mar 23 09:05:21 srvplan1 lrmd: [1125]: ERROR: TrackedProcTimeoutFunction: PlanServer:0:monitor process (PID 20797) will not die! Mar 23 09:06:11 srvplan1 lrmd: [1125]: WARN: tomcat:0:stop process (PID 20791) timed out (try 1). Killing with signal SIGTERM (15). Mar 23 09:06:16 srvplan1 lrmd: [1125]: WARN: tomcat:0:stop process (PID 20791) timed out (try 2). Killing with signal SIGKILL (9). Mar 23 09:06:21 srvplan1 lrmd: [1125]: ERROR: TrackedProcTimeoutFunction: tomcat:0:stop process (PID 20791) will not die! Mar 23 09:06:23 srvplan1 lrmd: [1125]: info: perform_op:2906: operation monitor[42] on ocf::drbd::drbd_pgdrive:0 for client 1128, its parameters: drbd_resource=[pgdrive] CRM_meta_role=[Master] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_start_resource=[drbd_pgdrive:0 ] CRM_meta_notify_inactive_resource=[drbd_pgdrive:0 ] CRM_meta_notify_master_uname=[srvplan2 ] CRM_meta_timeout=[60000] CRM_meta_name=[monitor] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_promote_uname=[ ] crm_feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_start_uname=[srvplan1 for rsc is already running. Mar 23 09:06:23 srvplan1 lrmd: [1125]: info: perform_op:2916: postponing all ops on resource drbd_pgdrive:0 by 1000 ms At this point the second node finally realizes something is wrong there, fences the first node and takes over. After reboot, everything looks like it's working fine now. Needless to say, 1 hour 45 minutes is a bit too long for a recovery. Got any ideas where to look? Basically I'd like Pacemaker to detect whatever happened and migrate to another node before trying to monitor, restart or whatever else it tried to do with those resources. As far as I understand, Pacemaker is supposed to restart a service as soon as the monitor operation fails (provided that I didn't specify on-fail for the monitor action). Why didn't it try to restart any resources until 45 minutes later? I expected to see something like this: monitor fails -> restart fails -> STONITH _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org