On Jan 7, 9:40 pm, Andreas N <d...@pseudoterminal.org> wrote:
> On Friday, January 6, 2012 5:31:34 PM UTC+1, jcbollinger wrote:
>
> > Nothing in your log suggests that the Puppet agent is doing any work
> > when it fails.  It appears to apply a catalog successfully, then
> > create a report successfully, then nothing else.  That doesn't seem
> > like a problem in a module.  Nevertheless, you could try removing
> > classes from the affected node's configuration and testing whether
> > Puppet still freezes.
>
> John, thanks for your reply. I'll be deploying a node that includes no
> modules at all and see if a zombie process appears again.
>
> > You said the agent runs for several hours before it hangs.  Does it
> > perform multiple successful runs during that time?  That also would
> > tend to counterindicate a problem in your manifests.
>
> Yes, the agents perform several runs (with no changes to the catalog) and
> then simply freeze up, waiting for the defunct sh process to return.
>
> > I'm suspicious that something else on your systems is interfering with
> > the Puppet process; some kind of service manager, for example.  You'll
> > have to say whether that's a reasonable guess.  Alternatively, you may
> > have a system-level bug; there have been a few Ruby bugs and kernel
> > regressions that interfered with Puppet operation.
>
> Those are all pretty plain Ubuntu 10.04.3 server installations (both i386
> and x86_64), especially the ones I deployed this week, which aren't in
> production yet. What kind of service manager could there even be that
> interferes?


I was thinking along the lines of an intrusion detection system, or
perhaps a monitoring / management tool such as Nagios.  That's not to
say that I suspect Nagios in particular -- a lot of people seem to use
it together with Puppet with great success.  It sounds like such a
thing is not in your picture, however.


> > You could try using strace to determine where the failure happens,
> > though that's not as simple as it may sound.
>
> Simply trying to strace the zombie process only results in an "Operation
> not permitted". The agent process shows these lines repeatedly:
>
> Process 3741 attached - interrupt to quit
> select(8, [7], NULL, NULL, {1, 723393}) = 0 (Timeout)
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> select(8, [7], NULL, NULL, {2, 0})      = 0 (Timeout)
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> ...
>
> That doesn't tell me anything other than that the puppet agent is blocking
> on select() with a timeout of two seconds.


I kinda meant to trace a new agent process so as to catch whatever
happens when it transitions to non-functional state.  Nevertheless,
the trace does yield a bit of information.  In particular, it shows
that the agent is not fully blocked.  In that case, the fact that it
has a defunct child process that it has not collected makes me even
more suspect a Ruby bug.  I am also a bit curious what open FD 7 that
Puppet is selecting for might be, but I don't think that's directly
related to your issue.

I suggest you compare the Ruby and kernel versions installed on the
affected nodes to those installed on unaffected nodes.  It may also be
useful to compare the Puppet configuration (/etc/puppet/puppet.conf)
on failing nodes to those on non-failing nodes to see whether there
any options are set differently.  I am especially curious as to
whether the 'listen' option might be enabled when it does not need to
be (or does it?), but there might be other significant differences.


John

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to 
puppet-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.

Reply via email to