Hi Mike,

 

Yes, I've got this behaviour in two solaris zones in a large roll-out.
I'm on cfengine 3.0.4, Solaris 10. Hangs during network discovery and
just spins. Have yet to figure out what's going on with it. Similarly,
can't see what's different about the network setup for these two
compared to the other ~300 servers.

 

Interested in solutions or suggestions to collect more debug info. 

 

Simon

 

From: help-cfengine-boun...@cfengine.org
[mailto:help-cfengine-boun...@cfengine.org] On Behalf Of Mike Svoboda
Sent: 20 November 2010 00:34
To: help-cfengine@cfengine.org
Subject: Cfengine 3.0.5p1 daemons spinning CPU to 100% on 1 host out of
800

 

I've deployed Cfengine 3.0.5p1 across 800 hosts.  I only have an issue
with the Cfengine daemons on 1 box where it appears I am hitting a bug.
On this machine, it spins a single core to 100% user space CPU
utilization.  Here are the details.


$ /var/cfengine/bin/cf-agent -v
....
...
f3
------------------------------------------------------------------------
cf3 # Extended system discovery is only available in version Nova and
above
cf3 Additional hard class defined as: 32_bit
cf3 Additional hard class defined as: sunos_5_10
cf3 Additional hard class defined as: sunos_i86pc
cf3 Additional hard class defined as: sunos_i86pc_5_10
cf3 Additional hard class defined as: i386
cf3 Additional hard class defined as: i86pc
cf3 GNU autoconf class from compile time: compiled_on_solaris2_10
cf3 Address given by nameserver: 172.17.134.80
cf3 Interface 1: lo0
cf3 Interface 2: e1000g0
cf3 Adding alias loghost..
cf3  !! Cannot discover hardware IP, using DNS value
^C


So at the "cannot discover hardware IP" point, it hangs and spins the
CPU to 100%.  Looking at prstat -Lm output below:


$ prstat -Lm
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
PROCESS/LWPID 
 16398 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.3   0 190   0   0
cf-agent/1


Putting cf-agent into super debug mode, I see this....

Broken host:
$ /var/cfengine/bin/cf-agent -ddd
....
....
GetVariable(sys,ipv4_1[172_17_134_80]) type=(to be determined)
IsExpandable(ipv4_1[172_17_134_80]) - syntax verify
Found 0 variables in (ipv4_1[172_17_134_80])
Looking for sys.ipv4_1[172_17_134_80]
Searching for scope context sys
Found scope reference sys
GetVariable(sys,ipv4_1[172_17_134_80]): using scope 'sys' for variable
'ipv4_1[172_17_134_80]'



At which point, cf-agent hangs.  Comparing this to a working host, this
is what I see.

Working host:
GetVariable(sys,ipv4_1[172_17_134_81]) type=(to be determined)
IsExpandable(ipv4_1[172_17_134_81]) - syntax verify
Found 0 variables in (ipv4_1[172_17_134_81])
Looking for sys.ipv4_1[172_17_134_81]
Searching for scope context sys
Found scope reference sys
GetVariable(sys,ipv4_1[172_17_134_81]): using scope 'sys' for variable
'ipv4_1[172_17_134_81]'
No such variable found sys.ipv4_1[172_17_134_81]
AddVariableHash(sys.ipv4_1[172_17_134_81]=172 (string) rtype=s)
Searching for scope context sys
Found scope reference sys
CopyRvalItem(s)
ScanScalar([172])
DeleteRvalItem(l)
DeleteRval NULL
DeleteRvalItem(l)
DeleteRval NULL
Added Variable ipv4_1[172_17_134_81] at hash address 60 in scope sys
with value (omitted)
Trying to locate my IPv6 address
Unappending Trying to locate my IPv6 address
Unix_cf_popen(/sbin/ifconfig -a)
Unix_cf_pclose(pp)
cf_pwait - Waiting for process 12411
Looking for environment from cf-monitor...
Unappending Looking for environment from cf-monitor...
Searching for scope context mon
Found scope reference mon
No variable matched
NewScalar(mon,env_time,Sat Nov 20 00:28:23 2010)


So the broken host never gets to the "No such variable found
sys.ipv4_1[172_17_134_80]" statement.

So, I know this is a problem with Cfengine parsing the network
interfaces.  The only thing, is I can not see a difference at all
between the working and non-working machines.


Broken machine's ifconfig output:
$ ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu
8232 index 1
        inet 127.0.0.1 netmask ff000000 
e1000g0: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4>
mtu 1500 index 2
        inet 172.17.134.80 netmask ffffff00 broadcast 172.17.134.255
        groupname primary
        ether 0:14:4f:9e:cf:fe 
e1000g0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500
index 2
        inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
e1000g1:
flags=69000842<BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER,STANDBY,INACT
IVE> mtu 0 index 3
        inet 0.0.0.0 netmask 0 
        groupname primary
        ether 0:14:4f:9e:cf:ff 



Working machine's ifconfig output
$ ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu
8232 index 1
        inet 127.0.0.1 netmask ff000000 
e1000g0: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4>
mtu 1500 index 2
        inet 172.17.134.81 netmask ffffff00 broadcast 172.17.134.255
        groupname primary
        ether 0:14:4f:83:31:ac 
e1000g0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500
index 2
        inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
e1000g1:
flags=69000842<BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER,STANDBY,INACT
IVE> mtu 0 index 3
        inet 0.0.0.0 netmask 0 
        groupname primary
        ether 0:14:4f:83:31:ad 



So other than the inet address of e1000g0 and the ethernet addresses,
the output is exactly the same.  If I unplumb the interfaces e1000g0:1
and e1000g1 on the broken machine, the Cfengine daemons operate again.


Has anyone run into this bug before, or can help suggest anything?

Thanks!
Mike



_______________________________________________
Help-cfengine mailing list
Help-cfengine@cfengine.org
https://cfengine.org/mailman/listinfo/help-cfengine

Reply via email to