hello,

----- Original Message -----
> Hi everyone,
> 
> I run MCollective 1.2.1 together with ActiveMQ 5.5 under Scientific
> Linux 6.1 on Amazon EC2. Overall it works like a
> charm, but sometimes (eg. 1/30) discovery fails. Still the exit-code
> of mco will be 0, which is a problem for me as I
> use MCollective e.g. to trigger deployments from Jenkins.

does mco ping exhibit the same behavior if you run it often?  Does it
tend to happen more after a period of the collective being idle or just
really randomly?

> 
> I would like to ask for some feedback on the following ideas, that
> could fix this problem.
> 
> a) Increase discovery timeout
> mco offers a option to tweak the discovery timeout. What is your
> experience with increasing this value? When running
> "mco ping", I see ping times of 130ms, so 2 seconds (the default
> should be enough), or?
> Is there a way to configure is global?

it's not global - its a client setting but with those ping times it 
should be sufficient.  Discovery does exactly what mco ping does so its
a good way to diagnose

Might be worth enabling verbose gc logging on your activemq its possible
that during these times it just did a big full garbage collection which
would block it and that might indicate some tuning is needed

> 
> b) Mco should exit != 0 when no nodes are found
> I would like to see a "--batch" or "--non-interactive" mode, where
> mco has a exit code different from 0, when no nodes
> are found.

ok, you can file tickets for this

> c) Add "expected count" to mco command
> I thing there are some situation, where one knows the number of
> MCollective nodes. So what about adding a options
> "--expect-number" option to mco, where I can either give a count or
> range of expected nodes.

mcollective 1.3.x which will soon become 2.0 have a new mode of communications
where you can provide it a host list etc and it will bypass discovery, this
is ment to be used for things like deployers where you know what machines you
wish to affect and it will probably help

> 
> d) Is this normal at all?
> I have no experience with MCollective in a datacenter, so: Is this
> problem cloud/EC2 related or does it happen in
> non-cloud setups too? How could I debug what makes the discovery
> fail?

it shouldn't happen, I've seen it happen:

 - activemq doing long full garbage collections
 - network is interrupted after long periods of idle time
 - activemq is idle for a long time and was swapped etc
 - you have very busy machines that do not respond at all - unlikely in your 
case

there are probably other reasons too but these are the rough likely causes.
Amazon has a pretty aggressive idle connection timeout though so you might
enable registration just to keep the stomp connections from being idle too long

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to 
puppet-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.

Reply via email to