Hi folks,

I'd like to use crm_mon for monitoring & email notifications, but I've hit a snag when it comes to incorporating it into the crm configuration. When I run crm_mon manually from the command line (with no cluster crm configurations), it all works great, but obviously running crm_mon on every cluster member manually would result in a litany of duplicated messages for each resource migration, which is why I'm looking to incorporate it into the cluster. Unfortunately, the exact same crm_mon configuration, when entered into the cib, fails to work, and doesn't print out any errors. To get the crm_mon configuration into the cib, I first tried using the scriptable crm utility, but it didn't seem to like that very much:

# crm configure primitive ResourceMonitor ocf:pacemaker:ClusterMon params pidfile="/var/run/crm_mon.pid" htmlfile="/var/tmp/crm_mon.html" extra_options="-T o...@example.com -F 'Cluster Monitor <clustermoni...@example.com>' -H smtp.example.com:25 -P '[LDAP Cluster]: Resource Changes Detected'" op monitor interval="10s" timeout="20s" element nvpair: Relax-NG validity error : Type ID doesn't allow value 'resourcemonitor-instance_attributes-...@example.com' element nvpair: Relax-NG validity error : Element nvpair failed to validate attributes
Relax-NG validity error : Extra element nvpair in interleave
element nvpair: Relax-NG validity error : Element instance_attributes failed to validate content
Relax-NG validity error : Extra element instance_attributes in interleave
element cib: Relax-NG validity error : Element cib failed to validate content crm_verify[1762]: 2010/12/05_19:23:03 ERROR: main: CIB did not pass DTD/schema validation
Errors found during check: config not valid
ERROR: ResourceMonitor: parameter -F does not exist
ERROR: ResourceMonitor: parameter [LDAP Cluster]: Resource Changes Detected does not exist ERROR: ResourceMonitor: parameter Cluster Monitor <clustermoni...@example.com> does not exist
ERROR: ResourceMonitor: parameter -H does not exist
ERROR: ResourceMonitor: parameter smtp.example.com:25 does not exist
ERROR: ResourceMonitor: parameter o...@example.com does not exist
ERROR: ResourceMonitor: parameter -P does not exist
WARNING: ResourceMonitor: default timeout 20s for start is smaller than the advised 90 WARNING: ResourceMonitor: default timeout 20s for stop is smaller than the advised 100

I know those are valid options since it works from the CLI, so I tried going through the crm shell instead, hoping it was just an interpolation issue or something like that. That approach appeared to work (albeit with a timeout threshold warning):

crm(live)configure# primitive ResourceMonitor ocf:pacemaker:ClusterMon params pidfile="/var/run/crm_mon.pid" htmlfile="/var/tmp/crm_mon.html" extra_options="-T o...@example.com -F 'Cluster Monitor <clustermoni...@example.com>' -H smtp.example.com:25 -P '[LDAP Cluster]: Resource Changes Detected'" op monitor interval="10s" timeout="20s" WARNING: ResourceMonitor: default timeout 20s for start is smaller than the advised 90 WARNING: ResourceMonitor: default timeout 20s for stop is smaller than the advised 100
crm(live)configure# commit
WARNING: ResourceMonitor: default timeout 20s for start is smaller than the advised 90 WARNING: ResourceMonitor: default timeout 20s for stop is smaller than the advised 100
crm(live)configure# exit
bye


After adding it via the crm shell, the crm_mon daemon is definitely running (and migrates to another node if I shut down or restart corosync on the node currently running crm_mon), but I'm not getting any email messages. My mail sever logs confirm the message never gets there when the crm_mon configuration is in the cluster. This same command works when run manually from the command line, and there are no errors or warnings in the logs, so I'm not sure what to attribute the problem to. Here are the cluster log messages resulting from a simple resource migration on the host running the crm_mon daemon that was spawned by the cluster:

Dec 5 20:05:00 ldap3 external/ipmi[7032]: [7041]: debug: ipmitool output: Chassis Power is on Dec 5 20:05:03 ldap3 cib: [6496]: info: cib_process_request: Operation complete: op cib_delete for section constraints (origin=ldap4/crm_resource/3, version=0.78.4): ok (rc=0) Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - <cib admin_epoch="0" epoch="78" num_updates="4" > Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - <configuration > Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - <constraints > Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - <rsc_location id="cli-prefer-ClusterIP" > Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - <rule id="cli-prefer-rule-ClusterIP" > Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - <expression value="ldap4" id="cli-prefer-expr-ClusterIP" /> Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - </rule> Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - </rsc_location> Dec 5 20:05:03 ldap3 crmd: [6500]: info: abort_transition_graph: need_abort:59 - Triggered transition abort (complete=1) : Non-status change Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - </constraints> Dec 5 20:05:03 ldap3 crmd: [6500]: info: need_abort: Aborting on change to admin_epoch Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - </configuration> Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - </cib> Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + <cib admin_epoch="0" epoch="79" num_updates="1" > Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + <configuration > Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + <constraints > Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_pe_invoke: Query 63: Requesting the current CIB: S_POLICY_ENGINE Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + <rsc_location id="cli-prefer-ClusterIP" > Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + <rule id="cli-prefer-rule-ClusterIP" > Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + <expression value="ldap3" id="cli-prefer-expr-ClusterIP" /> Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + </rule> Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + </rsc_location> Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + </constraints> Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + </configuration> Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + </cib> Dec 5 20:05:03 ldap3 cib: [6496]: info: cib_process_request: Operation complete: op cib_modify for section constraints (origin=ldap4/crm_resource/4, version=0.79.1): ok (rc=0) Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_pe_invoke_callback: Invoking the PE: query=63, ref=pe_calc-dc-1291597503-34, seq=88, quorate=1 Dec 5 20:05:03 ldap3 pengine: [6499]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 5 20:05:03 ldap3 pengine: [6499]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Dec 5 20:05:03 ldap3 pengine: [6499]: info: determine_online_status: Node ldap3 is online Dec 5 20:05:03 ldap3 pengine: [6499]: info: determine_online_status: Node ldap4 is online Dec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print: ClusterIP (ocf::heartbeat:IPaddr2): Started ldap4 Dec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print: ldap3-stonith (stonith:external/ipmi): Started ldap4 Dec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print: ldap4-stonith (stonith:external/ipmi): Started ldap3 Dec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print: ResourceMonitor (ocf::pacemaker:ClusterMon): Started ldap3 Dec 5 20:05:03 ldap3 pengine: [6499]: notice: RecurringOp: Start recurring monitor (10s) for ClusterIP on ldap3 Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Move resource ClusterIP (Started ldap4 -> ldap3) Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leave resource ldap3-stonith (Started ldap4) Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leave resource ldap4-stonith (Started ldap3) Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leave resource ResourceMonitor (Started ldap3) Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Dec 5 20:05:03 ldap3 crmd: [6500]: info: unpack_graph: Unpacked transition 3: 4 actions in 4 synapses Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_te_invoke: Processing graph 3 (ref=pe_calc-dc-1291597503-34) derived from /var/lib/pengine/pe-input-100.bz2 Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiating action 9: stop ClusterIP_stop_0 on ldap4 Dec 5 20:05:03 ldap3 cib: [7044]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-64.raw Dec 5 20:05:03 ldap3 pengine: [6499]: info: process_pe_message: Transition 3: PEngine Input stored in: /var/lib/pengine/pe-input-100.bz2 Dec 5 20:05:03 ldap3 cib: [7044]: info: write_cib_contents: Wrote version 0.79.0 of the CIB to disk (digest: 8689b11ceba2dad1a9d93d704ff47580) Dec 5 20:05:03 ldap3 cib: [7044]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.DN94W1 (digest: /var/lib/heartbeat/crm/cib.vicJ0i) Dec 5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: Action ClusterIP_stop_0 (9) confirmed on ldap4 (rc=0) Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiating action 10: start ClusterIP_start_0 on ldap3 (local) Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_lrm_rsc_op: Performing key=10:3:0:32629690-f4fb-43b9-a251-8f6b25e60220 op=ClusterIP_start_0 )
Dec  5 20:05:03 ldap3 lrmd: [6497]: info: rsc:ClusterIP:13: start
Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_pseudo_action: Pseudo action 5 fired and confirmed Dec 5 20:05:03 ldap3 IPaddr2[7045]: INFO: ip -f inet addr add 10.1.1.163/32 brd 10.1.1.163 dev eth1
Dec  5 20:05:03 ldap3 IPaddr2[7045]: INFO: ip link set eth1 up
Dec 5 20:05:03 ldap3 IPaddr2[7045]: INFO: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.1.1.163 eth1 10.1.1.163 auto not_used not_used Dec 5 20:05:03 ldap3 crmd: [6500]: info: process_lrm_event: LRM operation ClusterIP_start_0 (call=13, rc=0, cib-update=64, confirmed=true) ok Dec 5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: Action ClusterIP_start_0 (10) confirmed on ldap3 (rc=0) Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiating action 11: monitor ClusterIP_monitor_10000 on ldap3 (local) Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_lrm_rsc_op: Performing key=11:3:0:32629690-f4fb-43b9-a251-8f6b25e60220 op=ClusterIP_monitor_10000 )
Dec  5 20:05:03 ldap3 lrmd: [6497]: info: rsc:ClusterIP:14: monitor
Dec 5 20:05:03 ldap3 crmd: [6500]: info: process_lrm_event: LRM operation ClusterIP_monitor_10000 (call=14, rc=0, cib-update=65, confirmed=false) ok Dec 5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: Action ClusterIP_monitor_10000 (11) confirmed on ldap3 (rc=0) Dec 5 20:05:03 ldap3 crmd: [6500]: info: run_graph: ==================================================== Dec 5 20:05:03 ldap3 crmd: [6500]: notice: run_graph: Transition 3 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-100.bz2): Complete Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_graph_trigger: Transition 3 is now complete Dec 5 20:05:03 ldap3 crmd: [6500]: info: notify_crmd: Transition 3 status: done - <null> Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: Starting PEngine Recheck Timer


Here is the output of 'crm configure show':
node ldap3
node ldap4
primitive ClusterIP ocf:heartbeat:IPaddr2 \
    params ip="10.1.1.163" cidr_netmask="32" \
    op monitor interval="10s"
primitive ResourceMonitor ocf:pacemaker:ClusterMon \
params pidfile="/var/run/crm_mon.pid" htmlfile="/var/tmp/crm_mon.html" extra_options="-T o...@example.com -F 'Cluster Monitor <clustermoni...@example.com>' -H smtp.example.com:25 -P '[LDAP Cluster]: Resource Changes Detected'" \
    op monitor interval="10s" timeout="20s"
primitive ldap3-stonith stonith:external/ipmi \
params hostname="ldap3" ipaddr="10.1.0.5" userid="****" passwd="****" interface="lan" \
    op monitor interval="60s" timeout="30s"
primitive ldap4-stonith stonith:external/ipmi \
params hostname="ldap4" ipaddr="10.1.0.6" userid="****" passwd="****" interface="lan" \
    op monitor interval="60s" timeout="30s"
location cli-prefer-ClusterIP ClusterIP \
    rule $id="cli-prefer-rule-ClusterIP" inf: #uname eq ldap3
location ldap3-stonith-cmdsrc ldap3-stonith -inf: ldap3
location ldap4-stonith-cmdsrc ldap4-stonith -inf: ldap4
property $id="cib-bootstrap-options" \
    dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
    cluster-infrastructure="openais" \
    expected-quorum-votes="2" \
    stonith-enabled="true" \
    no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
    resource-stickiness="100"

Other than the monitoring, everything seems to work pretty well, but I don't want to deploy this in production without a good real-time monitor of the resource changes, so I'd appreciate any suggestions as to why crm_mon works when run manually, but not when configured in the cluster. For reference, I'm running on Ubuntu Server 10.04 LTS (Lucid), and these are the packages I'm using:

cluster-agents   1:1.0.3-2ubuntu1
cluster-glue   1.0.5-1
corosync   1.2.0-0ubuntu1
libcluster-glue   1.0.5-1
libcorosync-dev   1.2.0-0ubuntu1
libcorosync4   1.2.0-0ubuntu1
libopenais3   1.1.2-0ubuntu1
openais   1.1.2-0ubuntu1
pacemaker   1.0.8+hg15494-2ubuntu2
pacemaker-dev   1.0.8+hg15494-2ubuntu2

Thanks!

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to