[Pacemaker] ClusterMon

Ryan Steele Sun, 05 Dec 2010 17:31:14 -0800

Hi folks,

I'd like to use crm_mon for monitoring & email notifications, but I'vehit a snag when it comes to incorporating it into the crmconfiguration. When I run crm_mon manually from the command line (withno cluster crm configurations), it all works great, but obviouslyrunning crm_mon on every cluster member manually would result in alitany of duplicated messages for each resource migration, which is whyI'm looking to incorporate it into the cluster. Unfortunately, theexact same crm_mon configuration, when entered into the cib, fails towork, and doesn't print out any errors. To get the crm_monconfiguration into the cib, I first tried using the scriptable crmutility, but it didn't seem to like that very much:

# crm configure primitive ResourceMonitor ocf:pacemaker:ClusterMonparams pidfile="/var/run/crm_mon.pid" htmlfile="/var/tmp/crm_mon.html"extra_options="-T o...@example.com -F 'Cluster Monitor<clustermoni...@example.com>' -H smtp.example.com:25 -P '[LDAP Cluster]:Resource Changes Detected'" op monitor interval="10s" timeout="20s"element nvpair: Relax-NG validity error : Type ID doesn't allow value'resourcemonitor-instance_attributes-...@example.com'element nvpair: Relax-NG validity error : Element nvpair failed tovalidate attributes

Relax-NG validity error : Extra element nvpair in interleave

element nvpair: Relax-NG validity error : Element instance_attributesfailed to validate content

Relax-NG validity error : Extra element instance_attributes in interleave

element cib: Relax-NG validity error : Element cib failed to validatecontentcrm_verify[1762]: 2010/12/05_19:23:03 ERROR: main: CIB did not passDTD/schema validation

Errors found during check: config not valid
ERROR: ResourceMonitor: parameter -F does not exist

ERROR: ResourceMonitor: parameter [LDAP Cluster]: Resource ChangesDetected does not existERROR: ResourceMonitor: parameter Cluster Monitor<clustermoni...@example.com> does not exist

ERROR: ResourceMonitor: parameter -H does not exist
ERROR: ResourceMonitor: parameter smtp.example.com:25 does not exist
ERROR: ResourceMonitor: parameter o...@example.com does not exist
ERROR: ResourceMonitor: parameter -P does not exist

WARNING: ResourceMonitor: default timeout 20s for start is smaller thanthe advised 90WARNING: ResourceMonitor: default timeout 20s for stop is smaller thanthe advised 100

I know those are valid options since it works from the CLI, so I triedgoing through the crm shell instead, hoping it was just an interpolationissue or something like that. That approach appeared to work (albeitwith a timeout threshold warning):

crm(live)configure# primitive ResourceMonitor ocf:pacemaker:ClusterMonparams pidfile="/var/run/crm_mon.pid" htmlfile="/var/tmp/crm_mon.html"extra_options="-T o...@example.com -F 'Cluster Monitor<clustermoni...@example.com>' -H smtp.example.com:25 -P '[LDAP Cluster]:Resource Changes Detected'" op monitor interval="10s" timeout="20s"WARNING: ResourceMonitor: default timeout 20s for start is smaller thanthe advised 90WARNING: ResourceMonitor: default timeout 20s for stop is smaller thanthe advised 100

crm(live)configure# commit

WARNING: ResourceMonitor: default timeout 20s for start is smaller thanthe advised 90WARNING: ResourceMonitor: default timeout 20s for stop is smaller thanthe advised 100

crm(live)configure# exit
bye

After adding it via the crm shell, the crm_mon daemon is definitelyrunning (and migrates to another node if I shut down or restart corosyncon the node currently running crm_mon), but I'm not getting any emailmessages. My mail sever logs confirm the message never gets there whenthe crm_mon configuration is in the cluster. This same command workswhen run manually from the command line, and there are no errors orwarnings in the logs, so I'm not sure what to attribute the problem to.Here are the cluster log messages resulting from a simple resourcemigration on the host running the crm_mon daemon that was spawned by thecluster:

Dec 5 20:05:00 ldap3 external/ipmi[7032]: [7041]: debug: ipmitooloutput: Chassis Power is onDec 5 20:05:03 ldap3 cib: [6496]: info: cib_process_request: Operationcomplete: op cib_delete for section constraints(origin=ldap4/crm_resource/3, version=0.78.4): ok (rc=0)Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -<cib admin_epoch="0" epoch="78" num_updates="4" >Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -<configuration >Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -<constraints >Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -<rsc_location id="cli-prefer-ClusterIP" >Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -<rule id="cli-prefer-rule-ClusterIP" >Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -<expression value="ldap4" id="cli-prefer-expr-ClusterIP" />Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -</rule>Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -</rsc_location>Dec 5 20:05:03 ldap3 crmd: [6500]: info: abort_transition_graph:need_abort:59 - Triggered transition abort (complete=1) : Non-status changeDec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -</constraints>Dec 5 20:05:03 ldap3 crmd: [6500]: info: need_abort: Aborting on changeto admin_epochDec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -</configuration>Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -</cib>Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: Statetransition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALCcause=C_FSA_INTERNAL origin=abort_transition_graph ]Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +<cib admin_epoch="0" epoch="79" num_updates="1" >Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: All 2cluster nodes are eligible to run resources.Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +<configuration >Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +<constraints >Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_pe_invoke: Query 63:Requesting the current CIB: S_POLICY_ENGINEDec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +<rsc_location id="cli-prefer-ClusterIP" >Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +<rule id="cli-prefer-rule-ClusterIP" >Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +<expression value="ldap3" id="cli-prefer-expr-ClusterIP" />Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +</rule>Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +</rsc_location>Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +</constraints>Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +</configuration>Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +</cib>Dec 5 20:05:03 ldap3 cib: [6496]: info: cib_process_request: Operationcomplete: op cib_modify for section constraints(origin=ldap4/crm_resource/4, version=0.79.1): ok (rc=0)Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_pe_invoke_callback:Invoking the PE: query=63, ref=pe_calc-dc-1291597503-34, seq=88, quorate=1Dec 5 20:05:03 ldap3 pengine: [6499]: notice: unpack_config: On loss ofCCM Quorum: IgnoreDec 5 20:05:03 ldap3 pengine: [6499]: info: unpack_config: Node scores:'red' = -INFINITY, 'yellow' = 0, 'green' = 0Dec 5 20:05:03 ldap3 pengine: [6499]: info: determine_online_status:Node ldap3 is onlineDec 5 20:05:03 ldap3 pengine: [6499]: info: determine_online_status:Node ldap4 is onlineDec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print:ClusterIP (ocf::heartbeat:IPaddr2): Started ldap4Dec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print:ldap3-stonith (stonith:external/ipmi): Started ldap4Dec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print:ldap4-stonith (stonith:external/ipmi): Started ldap3Dec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print:ResourceMonitor (ocf::pacemaker:ClusterMon): Started ldap3Dec 5 20:05:03 ldap3 pengine: [6499]: notice: RecurringOp: Startrecurring monitor (10s) for ClusterIP on ldap3Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Move resourceClusterIP (Started ldap4 -> ldap3)Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leaveresource ldap3-stonith (Started ldap4)Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leaveresource ldap4-stonith (Started ldap3)Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leaveresource ResourceMonitor (Started ldap3)Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: Statetransition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESScause=C_IPC_MESSAGE origin=handle_response ]Dec 5 20:05:03 ldap3 crmd: [6500]: info: unpack_graph: Unpackedtransition 3: 4 actions in 4 synapsesDec 5 20:05:03 ldap3 crmd: [6500]: info: do_te_invoke: Processing graph3 (ref=pe_calc-dc-1291597503-34) derived from/var/lib/pengine/pe-input-100.bz2Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiatingaction 9: stop ClusterIP_stop_0 on ldap4Dec 5 20:05:03 ldap3 cib: [7044]: info: write_cib_contents: Archivedprevious version as /var/lib/heartbeat/crm/cib-64.rawDec 5 20:05:03 ldap3 pengine: [6499]: info: process_pe_message:Transition 3: PEngine Input stored in: /var/lib/pengine/pe-input-100.bz2Dec 5 20:05:03 ldap3 cib: [7044]: info: write_cib_contents: Wroteversion 0.79.0 of the CIB to disk (digest: 8689b11ceba2dad1a9d93d704ff47580)Dec 5 20:05:03 ldap3 cib: [7044]: info: retrieveCib: Reading clusterconfiguration from: /var/lib/heartbeat/crm/cib.DN94W1 (digest:/var/lib/heartbeat/crm/cib.vicJ0i)Dec 5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: ActionClusterIP_stop_0 (9) confirmed on ldap4 (rc=0)Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiatingaction 10: start ClusterIP_start_0 on ldap3 (local)Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_lrm_rsc_op: Performingkey=10:3:0:32629690-f4fb-43b9-a251-8f6b25e60220 op=ClusterIP_start_0 )

Dec  5 20:05:03 ldap3 lrmd: [6497]: info: rsc:ClusterIP:13: start

Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_pseudo_action: Pseudoaction 5 fired and confirmedDec 5 20:05:03 ldap3 IPaddr2[7045]: INFO: ip -f inet addr add10.1.1.163/32 brd 10.1.1.163 dev eth1

Dec  5 20:05:03 ldap3 IPaddr2[7045]: INFO: ip link set eth1 up

Dec 5 20:05:03 ldap3 IPaddr2[7045]: INFO: /usr/lib/heartbeat/send_arp-i 200 -r 5 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.1.1.163eth1 10.1.1.163 auto not_used not_usedDec 5 20:05:03 ldap3 crmd: [6500]: info: process_lrm_event: LRMoperation ClusterIP_start_0 (call=13, rc=0, cib-update=64,confirmed=true) okDec 5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: ActionClusterIP_start_0 (10) confirmed on ldap3 (rc=0)Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiatingaction 11: monitor ClusterIP_monitor_10000 on ldap3 (local)Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_lrm_rsc_op: Performingkey=11:3:0:32629690-f4fb-43b9-a251-8f6b25e60220 op=ClusterIP_monitor_10000 )

Dec  5 20:05:03 ldap3 lrmd: [6497]: info: rsc:ClusterIP:14: monitor

Dec 5 20:05:03 ldap3 crmd: [6500]: info: process_lrm_event: LRMoperation ClusterIP_monitor_10000 (call=14, rc=0, cib-update=65,confirmed=false) okDec 5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: ActionClusterIP_monitor_10000 (11) confirmed on ldap3 (rc=0)Dec 5 20:05:03 ldap3 crmd: [6500]: info: run_graph:====================================================Dec 5 20:05:03 ldap3 crmd: [6500]: notice: run_graph: Transition 3(Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0,Source=/var/lib/pengine/pe-input-100.bz2): CompleteDec 5 20:05:03 ldap3 crmd: [6500]: info: te_graph_trigger: Transition 3is now completeDec 5 20:05:03 ldap3 crmd: [6500]: info: notify_crmd: Transition 3status: done - <null>Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: Statetransition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESScause=C_FSA_INTERNAL origin=notify_crmd ]Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: StartingPEngine Recheck Timer



Here is the output of 'crm configure show':
node ldap3
node ldap4
primitive ClusterIP ocf:heartbeat:IPaddr2 \
    params ip="10.1.1.163" cidr_netmask="32" \
    op monitor interval="10s"
primitive ResourceMonitor ocf:pacemaker:ClusterMon \

params pidfile="/var/run/crm_mon.pid"htmlfile="/var/tmp/crm_mon.html" extra_options="-T o...@example.com -F'Cluster Monitor <clustermoni...@example.com>' -H smtp.example.com:25 -P'[LDAP Cluster]: Resource Changes Detected'" \

    op monitor interval="10s" timeout="20s"
primitive ldap3-stonith stonith:external/ipmi \

params hostname="ldap3" ipaddr="10.1.0.5" userid="****"passwd="****" interface="lan" \

    op monitor interval="60s" timeout="30s"
primitive ldap4-stonith stonith:external/ipmi \

params hostname="ldap4" ipaddr="10.1.0.6" userid="****"passwd="****" interface="lan" \

    op monitor interval="60s" timeout="30s"
location cli-prefer-ClusterIP ClusterIP \
    rule $id="cli-prefer-rule-ClusterIP" inf: #uname eq ldap3
location ldap3-stonith-cmdsrc ldap3-stonith -inf: ldap3
location ldap4-stonith-cmdsrc ldap4-stonith -inf: ldap4
property $id="cib-bootstrap-options" \
    dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
    cluster-infrastructure="openais" \
    expected-quorum-votes="2" \
    stonith-enabled="true" \
    no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
    resource-stickiness="100"

Other than the monitoring, everything seems to work pretty well, but Idon't want to deploy this in production without a good real-time monitorof the resource changes, so I'd appreciate any suggestions as to whycrm_mon works when run manually, but not when configured in thecluster. For reference, I'm running on Ubuntu Server 10.04 LTS (Lucid),and these are the packages I'm using:


cluster-agents   1:1.0.3-2ubuntu1
cluster-glue   1.0.5-1
corosync   1.2.0-0ubuntu1
libcluster-glue   1.0.5-1
libcorosync-dev   1.2.0-0ubuntu1
libcorosync4   1.2.0-0ubuntu1
libopenais3   1.1.2-0ubuntu1
openais   1.1.2-0ubuntu1
pacemaker   1.0.8+hg15494-2ubuntu2
pacemaker-dev   1.0.8+hg15494-2ubuntu2

Thanks!

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

[Pacemaker] ClusterMon

Reply via email to