Hi, so we all have seen plenty of cases where people, when asked what is wrong with their resources, report the pengine logfile (unpack_rsc_op).
Now, we all know that this is not the real error message, but just the PE analyzing the state of the cluster, based on the error to exit code mapping by the RA. Apparently, this is extremely hard to understand, and it seems very hard for people to find the "real" error. Which in turn makes it very hard for them to fix their clusters. This is a real problem, one we've seen on the mailing lists, IRC, and quite a few customer incidents. Thinking about this, I have two suggestions. 1. Unique operation id The transition graph already includes an unique identifier for each action. If this was made, maybe, a bit shorter, and provided to the RA as part of the environment, the RA could include this as part of each log message - and if then this was also included in the CIB, crm_mon/pengine could provide the key which users could feed to grep and much more quickly find out what exactly has been going wrong. The LRM could log this "Operation <id> start" .. "Operation <id> end", and then a simple grep would suffice to grab everything in-between, narrowing down the log section considerably. This would enhance even RAs which were not modified to include the op key in their logging. 2. Verbose error reporting The PE et al only care and interpret the exit code. While the exit code is differentiated enough to categorize the error and allows the cluster to figure out how to respond, it is not sufficient for users to figure out what is wrong. Case in point: "not installed" - what, exactly, is not installed? A possible thought would be for the RA to print a one-line summary to stderr, and record this in the CIB along with the machine-readable encoded error. This would only be used for reporting to users. I think 1) is fairly easily implemented, and would be a big step forward. 2) is more complicated, but would make reporting via the GUI etc much more helpful. Comments? Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker