On 10/05/2013, at 8:08 PM, Steven Bambling <smbambl...@arin.net> wrote:
> > On May 10, 2013, at 5:35 AM, Steven Bambling <smbambl...@arin.net> wrote: > >> >> On May 9, 2013, at 8:05 PM, Andrew Beekhof <and...@beekhof.net> wrote: >> >>> >>> On 10/05/2013, at 12:40 AM, Steven Bambling <smbambl...@arin.net> wrote: >>> >>>> I'm having some issues with getting some cluster monitoring setup and >>>> configured on a 3 node multi-state cluster. I'm using Florian's blog as >>>> an example >>>> http://floriancrouzat.net/2013/01/monitor-a-pacemaker-cluster-with-ocfpacemakerclustermon-andor-external-agent/. >>>> >>>> When I create the primitive resource it starts on one of my nodes but >>>> spawns multiple instances of crm_mon. I don't see any reason that would >>>> cause it to spawn multiple instances, its very odd behavior. >>> >>> If you run: >>> >>> /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E >>> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h >>> /tmp/ClusterMon_SNMPMon.html >>> >>> manually a few times, what happens? Multiple processes? >> >> Yep for some reason its spawning multiple processes. >> >> root@pgdb3 ~]# ps aux | grep crm_mon >> root 30678 0.0 0.0 103244 856 pts/0 S+ 05:30 0:00 grep crm_mon >> [root@pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E >> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h >> /tmp/ClusterMon_SNMPMon.html >> [root@pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E >> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h >> /tmp/ClusterMon_SNMPMon.html >> [root@pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E >> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h >> /tmp/ClusterMon_SNMPMon.html >> [root@pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E >> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h >> /tmp/ClusterMon_SNMPMon.html >> [root@pgdb3 ~]# ps aux | grep crm_mon >> root 30772 0.0 0.0 82744 2816 ? S 05:30 0:00 >> /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E >> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h >> /tmp/ClusterMon_SNMPMon.html >> root 30781 0.0 0.0 82744 2668 ? S 05:30 0:00 >> /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E >> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h >> /tmp/ClusterMon_SNMPMon.html >> root 30784 0.0 0.0 82744 2476 ? S 05:30 0:00 >> /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E >> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h >> /tmp/ClusterMon_SNMPMon.html >> root 31134 0.0 0.0 103244 856 pts/0 S+ 05:30 0:00 grep crm_mon >> >> Put the .pid file in the tmp dir only lists 1 pid >> [root@pgdb3 ~]# cat /tmp/ClusterMon_SNMPMon.pid >> 30772 > > I take that back I doubled checked and the SNMPMon resource was still started > which was creating the multiple processes. After I stopped the resource I > pkill'd all the crm_mon process and then ran the command again manually. Now > it seems to squash the additional processes and only allows 1 process to be > running. > > [root@pgdb3 tmp]# ps aux | grep crm_mon > root 30955 0.0 0.0 82492 2632 pts/0 S 06:05 0:00 > /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E > /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h > /tmp/ClusterMon_SNMPMon.html > root 31991 0.0 0.0 103244 852 pts/0 S+ 06:05 0:00 grep crm_mon > [root@pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E > /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h > /tmp/ClusterMon_SNMPMon.html > [root@pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E > /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h > /tmp/ClusterMon_SNMPMon.html > [root@pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E > /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h > /tmp/ClusterMon_SNMPMon.html > [root@pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E > /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h > /tmp/ClusterMon_SNMPMon.html > [root@pgdb3 tmp]# ps aux | grep crm_mon > root 30955 0.0 0.0 82492 2632 pts/0 S 06:05 0:00 > /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E > /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h > /tmp/ClusterMon_SNMPMon.html > root 32545 0.0 0.0 103244 856 pts/0 S+ 06:06 0:00 grep crm_mon Its possibly due to a stale pid file. I'm believe the following patches should fix the problem. + Andrew Beekhof (12 days ago) 479c5cc: Fix: crm_mon: Check if a process can be daemonized before forking so the parent can report an error + Andrew Beekhof (12 days ago) e549770: Fix: crm_mon: Ensure stale pid files are updated when a new process is started https://github.com/beekhof/pacemaker/commit/e549770 https://github.com/beekhof/pacemaker/commit/479c5cc Weird that it works from the command line but not the resource agent. Are the permissions on /tmp/ClusterMon_SNMPMon.pid ok? > > STEVE > >> >>> >>>> >>>> I was also looking for some clarification on what this resource >>>> provides….it looks to me that it kicks off a crm_mon in daemon mode that >>>> will update a .html file and with -E it will run an external script. But >>>> the resource itself doesn't trigger anything if another resource changes >>>> state only if the crm_mon process ( monitored with PID ) fails and it has >>>> to restart. >>> >>> Correct, it just updates the html file which you can see in your browser. >>> Or, with -E, it can send an email or snmp alert. >>> >>>> If this is correct what is the best practice for monitoring additional >>>> resource states? >>> >>> Define "additional"? >>> If the resource fails we'll normally recover it automatically. >> An example of an additional resource would be a vip using ( IPaddr2 ). Also >> I have a multi-state pgsql resource, so if the resource fails it will either >> try to restart or promote another node in the cluster to Master. >> >> v/r >> >> STEVE >> >>> >>>> >>>> v/r >>>> >>>> STEVE >>>> >>>> >>>> Below are some additional data points. >>>> >>>> >>>> Creating the Resource >>>> >>>> [root@pgdb2 tmp]# crm configure primitive SNMPMon ocf:pacemaker:ClusterMon >>>> \ >>>>> params user="root" update="30" extra_options="-E >>>>> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net" \ >>>>> op monitor on-fail="restart" interval="60" >>>> >>>> >>>> Manual crm_mon output >>>> >>>> Last updated: Thu May 9 10:24:30 2013 >>>> Last change: Thu May 9 10:20:49 2013 via cibadmin on pgdb2.example.com >>>> Stack: cman >>>> Current DC: pgdb1.example.com - partition with quorum >>>> Version: 1.1.8-7.el6-394e906 >>>> 3 Nodes configured, unknown expected votes >>>> 6 Resources configured. >>>> >>>> >>>> Node pgdb1.example.com: standby >>>> Online: [ pgdb2.example.com pgdb3.example.com ] >>>> >>>> PG_REP_VIP (ocf::heartbeat:IPaddr2): Started pgdb2.example.com >>>> PG_CLI_VIP (ocf::heartbeat:IPaddr2): Started pgdb2.example.com >>>> Master/Slave Set: msPGSQL [PGSQL] >>>> Masters: [ pgdb2.example.com ] >>>> Slaves: [ pgdb3.example.com ] >>>> Stopped: [ PGSQL:2 ] >>>> SNMPMon (ocf::pacemaker:ClusterMon): Started pgdb3.example.com >>>> >>>> PS to check for process on pgdb3 >>>> >>>> [root@pgdb3 tmp]# ps aux | grep crm_mon >>>> root 16097 0.0 0.0 82624 2784 ? S 10:20 0:00 >>>> /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E >>>> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h >>>> /tmp/ClusterMon_SNMPMon.html >>>> root 16099 0.0 0.0 82624 2660 ? S 10:20 0:00 >>>> /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E >>>> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h >>>> /tmp/ClusterMon_SNMPMon.html >>>> root 16104 0.0 0.0 82624 2448 ? S 10:20 0:00 >>>> /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E >>>> /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h >>>> /tmp/ClusterMon_SNMPMon.html >>>> root 16515 0.0 0.0 103244 852 pts/0 S+ 10:21 0:00 grep >>>> crm_mon >>>> >>>> Output from corosync.log >>>> >>>> May 09 10:20:51 [3100] pgdb3.cha.arin.net lrmd: info: >>>> process_lrmd_get_rsc_info: Resource 'SNMPMon' not found (3 active >>>> resources) >>>> May 09 10:20:51 [3100] pgdb3.cha.arin.net lrmd: info: >>>> process_lrmd_rsc_register: Added 'SNMPMon' to the rsc list (4 active >>>> resources) >>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: info: >>>> services_os_action_execute: Managed ClusterMon_meta-data_0 process >>>> 16010 exited with rc=0 >>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: notice: >>>> process_lrm_event: LRM operation SNMPMon_monitor_0 (call=61, rc=7, >>>> cib-update=28, confirmed=true) not running >>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: notice: >>>> process_lrm_event: LRM operation SNMPMon_start_0 (call=64, rc=0, >>>> cib-update=29, confirmed=true) ok >>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: notice: >>>> process_lrm_event: LRM operation SNMPMon_monitor_60000 (call=67, >>>> rc=0, cib-update=30, confirmed=false) ok >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org