On Aug 1, 2009, at 12:56 PM, Lars Marowsky-Bree wrote:
Hi,
so we all have seen plenty of cases where people, when asked what is
wrong with their resources, report the pengine logfile
(unpack_rsc_op).
Now, we all know that this is not the real error message, but just the
PE analyzing the state of the cluster, based on the error to exit code
mapping by the RA.
Apparently, this is extremely hard to understand, and it seems very
hard
for people to find the "real" error. Which in turn makes it very hard
for them to fix their clusters. This is a real problem, one we've seen
on the mailing lists, IRC, and quite a few customer incidents.
Thinking about this, I have two suggestions.
1. Unique operation id
The transition graph already includes an unique identifier for each
action. If this was made, maybe, a bit shorter, and provided to the RA
as part of the environment, the RA could include this as part of each
log message - and if then this was also included in the CIB,
crm_mon/pengine could provide the key which users could feed to grep
and
much more quickly find out what exactly has been going wrong.
That would be transaction_key, which tells you which crmd instance,
graph, action number, and expected result every action has.
Just log it at the various places you want.
Though I don't see the point, grepping for the resource id is usually
just as effective.
The LRM could log this "Operation <id> start" .. "Operation <id>
end",
and then a simple grep would suffice to grab everything in-between,
narrowing down the log section considerably. This would enhance even
RAs
which were not modified to include the op key in their logging.
2. Verbose error reporting
The PE et al only care and interpret the exit code. While the exit
code
is differentiated enough to categorize the error and allows the
cluster
to figure out how to respond, it is not sufficient for users to figure
out what is wrong. Case in point: "not installed" - what, exactly, is
not installed?
Entirely dependent on the RA as you well know.
A possible thought would be for the RA to print a one-line summary to
stderr, and record this in the CIB along with the machine-readable
encoded error. This would only be used for reporting to users.
No.
We already log error output when an action fails.
Again, easily found by grepping for the resource ID.
I'd suggest focusing on improving the error logging that most RAs have
rather than adding yet more mechanisms for achieving the same thing.
I think 1) is fairly easily implemented, and would be a big step
forward. 2) is more complicated, but would make reporting via the GUI
etc much more helpful.
Comments?
Regards,
Lars
--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar
Wilde
_______________________________________________
Pacemaker mailing list
[email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
-- Andrew
_______________________________________________
Pacemaker mailing list
[email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker