On Jan 7, 9:40 pm, Andreas N <d...@pseudoterminal.org> wrote: > On Friday, January 6, 2012 5:31:34 PM UTC+1, jcbollinger wrote: > > > Nothing in your log suggests that the Puppet agent is doing any work > > when it fails. It appears to apply a catalog successfully, then > > create a report successfully, then nothing else. That doesn't seem > > like a problem in a module. Nevertheless, you could try removing > > classes from the affected node's configuration and testing whether > > Puppet still freezes. > > John, thanks for your reply. I'll be deploying a node that includes no > modules at all and see if a zombie process appears again. > > > You said the agent runs for several hours before it hangs. Does it > > perform multiple successful runs during that time? That also would > > tend to counterindicate a problem in your manifests. > > Yes, the agents perform several runs (with no changes to the catalog) and > then simply freeze up, waiting for the defunct sh process to return. > > > I'm suspicious that something else on your systems is interfering with > > the Puppet process; some kind of service manager, for example. You'll > > have to say whether that's a reasonable guess. Alternatively, you may > > have a system-level bug; there have been a few Ruby bugs and kernel > > regressions that interfered with Puppet operation. > > Those are all pretty plain Ubuntu 10.04.3 server installations (both i386 > and x86_64), especially the ones I deployed this week, which aren't in > production yet. What kind of service manager could there even be that > interferes?
I was thinking along the lines of an intrusion detection system, or perhaps a monitoring / management tool such as Nagios. That's not to say that I suspect Nagios in particular -- a lot of people seem to use it together with Puppet with great success. It sounds like such a thing is not in your picture, however. > > You could try using strace to determine where the failure happens, > > though that's not as simple as it may sound. > > Simply trying to strace the zombie process only results in an "Operation > not permitted". The agent process shows these lines repeatedly: > > Process 3741 attached - interrupt to quit > select(8, [7], NULL, NULL, {1, 723393}) = 0 (Timeout) > sigprocmask(SIG_BLOCK, NULL, []) = 0 > sigprocmask(SIG_BLOCK, NULL, []) = 0 > select(8, [7], NULL, NULL, {2, 0}) = 0 (Timeout) > sigprocmask(SIG_BLOCK, NULL, []) = 0 > sigprocmask(SIG_BLOCK, NULL, []) = 0 > ... > > That doesn't tell me anything other than that the puppet agent is blocking > on select() with a timeout of two seconds. I kinda meant to trace a new agent process so as to catch whatever happens when it transitions to non-functional state. Nevertheless, the trace does yield a bit of information. In particular, it shows that the agent is not fully blocked. In that case, the fact that it has a defunct child process that it has not collected makes me even more suspect a Ruby bug. I am also a bit curious what open FD 7 that Puppet is selecting for might be, but I don't think that's directly related to your issue. I suggest you compare the Ruby and kernel versions installed on the affected nodes to those installed on unaffected nodes. It may also be useful to compare the Puppet configuration (/etc/puppet/puppet.conf) on failing nodes to those on non-failing nodes to see whether there any options are set differently. I am especially curious as to whether the 'listen' option might be enabled when it does not need to be (or does it?), but there might be other significant differences. John -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.