Hi all, is there a better way to detect a failed resource than to run "crm_mon -1 -r"?
Example, I just 'created' a failed resource: crm_mon -1 -r Failed actions: ost_janlus_27_start_0 (node=vm3, call=108, rc=2, status=complete): invalid parameter This cannot easily parsed using 'grep', as "Failed actions:" is a complete section. Well, using a python or perl script, it still wouldn't be too difficult. But, how can I figure out the resource name there? I cannot run "crm resource cleanup ost_janlus_27_start_0", as this is obviously not the resource name. I also cannot simply cut off "start_0", as there are also other actions that might fail. In fact, crm_mon output is here already annoying to be run as human being, as for a clean-up a simple copy-and-paste using mouse clicks does not work, as I always have to cut off the action... Cleaning up dozens to hundreds resources manually is not an option, so we have a script that goes over all resources and does that. However, in larger clusters that can easily take up to 90 minutes. For a small size cluster, [r...@vm3 ~]# time crm resource cleanup ost_janlus_27 Cleaning up ost_janlus_27 on vm6 Cleaning up ost_janlus_27 on vm7 Cleaning up ost_janlus_27 on vm8 Cleaning up ost_janlus_27 on vm1 Cleaning up ost_janlus_27 on vm2 Cleaning up ost_janlus_27 on vm3 real 0m7.129s user 0m0.471s sys 0m0.106s [r...@vm3 ~]# time crm resource cleanup ost_janlus_27 vm6 Cleaning up ost_janlus_27 on vm6 real 0m1.348s user 0m0.203s sys 0m0.071s [r...@vm3 ~]# time cluster_resources cleanup resource: mds-janlus-grp Cleaning up vg_janlus on vm6 Cleaning up mgs on vm6 Cleaning up mdt_janlus on vm6 Cleaning up vg_janlus on vm7 Cleaning up mgs on vm7 [...] real 3m35.463s user 0m13.704s sys 0m3.440s (cluster_resources is a small front end for crm to run it for all of our resources) So about 1.35s per resource. No problem to do that for a few resources on all nodes on a 3 node system. But already annoyingly over 3 minutes for 28 resources and 6 nodes on our default small size systems. And definitely not an option anymore on a 18 node cluster with 230 resources (calculated time: 230 resources * 18 nodes * 1.35 s = 5589 s = 1.5 *hours*). And cleaning up 230 resources manually if something went wrong on the cluster is also no fun and also is not really fast. So I'm looking for *any* sane way to clean up resources or at least for a good parse-able way to get failed resources and the corresponding node. Thanks, Bernd -- Bernd Schubert DataDirect Networks _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker