Thanks guys.. I'll check out mcollective. Yeah, the root password 60 times is a bit painful, but the ssh loop would help. If I remember right, there is an API/REST call for Foreman that will give me a list of the hosts not responsive.

The problem here is that puppet was in memory, and running. It just wasn't responsive, perhaps waiting for something to happen that never did. So, checks for the process (monit/snmp/pgrep), etc would say that puppet is fine.

Are there any more bullet-proof ways of watch-dogging Puppet specifically? Could we kill the process if catalog locks are more than 30 minutes old? Or are locks on the catalog even a reality? Is this something Puppet could do on its own, in a separate thread, or does it need a new process? I'm just throwing an idea or two. Unfortunately, I lack the deep understanding of Puppet internals to know if I'm barking up the wrong tree or not.

--Kyle

On 01/27/2012 03:14 PM, Christopher Wood wrote:
While you're logging into every host to install mcollective, there are some 
other things to think about (that are easily puppetizeable):

-remote syslogging, so that lots of logs don't cause application hosts to clot
-file system monitoring for your hosts, so you get an alert before things fill 
up
-trend analysis (graphs) on the hosts, so you get an alert when something's 
trending to fill up (by inode as well depending on the host)
-something monitoring critical processes, so that if they stop responding it'll 
restart them (here I plug monit for simplicity's sake, but snmp agents and 
similar items can do this too)
something monitoring the logs which can alarm when something is absent/present 
when it shouldn't be

As to your immediate problem, try an ssh loop if you can run init scripts via 
sudo. Use -t so that sudo will have a tty. For security's sake you'll have to 
enter your password 60 times, but the experience will incentivize you to 
monitor for this problem.

echo<<XX>/tmp/h
host1
host2
XX

for h in `cat /tmp/h`; do ssh -t $h sudo /etc/init.d/puppet restart; done


Good luck.


On Sat, Jan 28, 2012 at 08:53:37AM +1100, Denmat wrote:
Hi,
Puppet's sister project, MCollective would do it. An alternative would be 
something like Rundeck.

Den

On 28/01/2012, at 3:52, Kyle Mallory<kyle.mall...@utah.edu>  wrote:

I am experiencing a curious event, and wondering if others have seen this... As 
well, I have a question related to it.

Today, I noticed my puppet summary report from Foreman this morning, that 60 of 
my 160 hosts all stopped reporting at nearly the exact same time, and have not 
since restarted.  Investigating, it appears that my puppetmaster temporarily 
ran out of disk space on the /var volume, probably in part do to logging.  I 
have log rollers running, which eventually freed up some disk space, but the 60 
hosts, have not resumed reporting.

If I dig into the logs on one of the failing agents, there are no messages from 
puppet, past 4am (here is a snippet of my logs):

Jan 27 02:44:25 kmallory3 puppet-agent[15340]: Using cached catalog
Jan 27 02:44:25 kmallory3 puppet-agent[15340]: Could not retrieve catalog; 
skipping run
Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Could not retrieve catalog from 
remote server: Error 400 on SERVER: No space left on device - 
/var/lib/puppet/yaml/facts/kmallory3.xxx.xxx.xxx.yaml
Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Using cached catalog
Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Could not retrieve catalog; 
skipping run
Jan 27 03:47:30 kmallory3 puppet-agent[15340]: Could not retrieve plugin: 
execution expired
Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Could not retrieve catalog from 
remote server: execution expired
Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Using cached catalog
Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Could not retrieve catalog; 
skipping run

Forcing a run of puppet, I get the following message:

kmallory3:/var/log# puppetd --onetime --test
notice: Ignoring --listen on onetime run
notice: Run of Puppet configuration client already in progress; skipping

After stopping and restarting the puppet service, the agent started running 
properly.  It appears that the failure from the server has caused the agent to 
hang, from which it was not able to recover gracefully.  Has anyone experienced 
this before?  We are running 2.6.1 on the large majority of our hosts, 
including this one.  Many failed, but 2/3rds keep running properly.

Now, on to my question.. Anyone got some bright ideas for how I could force 
Puppet to restart itself on a 60 machines, when Puppet isn't running??  I'm not 
really excited by the prospect of logging into 60 machines, and running a sudo 
command...  sigh.


--Kyle

--
You received this message because you are subscribed to the Google Groups "Puppet 
Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to 
puppet-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.

--
You received this message because you are subscribed to the Google Groups "Puppet 
Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to 
puppet-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.



--
You received this message because you are subscribed to the Google Groups "Puppet 
Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to 
puppet-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.

Reply via email to