Forum: CFEngine Help
Subject: cf-exced ignores splaytime on syntax errors / unleash the stampede
Author: msvob...@linkedin.com
Link to topic: https://cfengine.com/forum/read.php?3,24225,24225#msg-24225

Hey Cfengineers

I just filed the following bug, and it seems like a pretty major issue.  I'm 
not sure if this is a bug, or if this is the default behavior of how Cfengine 
is supposed to execute.  Either way, I figured I'd bring it to attention:
https://cfengine.com/bugtracker/view.php?id=884

At LinkedIn, we execute Cfengine via cron by firing cf-execd in foreground 
mode.  We then set splaytime very high (50 minutes) The crontab entry is simple 
and looks like:

0 * * * * /var/cfengine/bin/cf-execd -F


This is our body executor control statement which configures cf-execd.  Note 
the value of splaytime and the schedule.  Cf-exced will always be fired via 
foreground mode at the Min00 class via cron, but we also enforce that in the 
class schedule here.  If cf-execd is started for some reason other than minute 
zero, it will wait until minute zero before starting splaytime of 50m.


body executor control
{
# We dont have a schedule interval because cf-execd is executed from root's 
crontab hourly. Allow splaytime
# to be set at a high value so we spread the load out on the master policy 
server over the hour.
splaytime        =>     "50";
mailmaxlines     =>     "100";
smtpserver       =>     "localhost";
schedule         =>     { "Min00", };
executorfacility        =>      "LOG_DAEMON";

# This is the command that actually drives cf-execd to execute cf-agent on the 
schedule above.
exec_command     =>     "${sys.workdir}/bin/cf-agent -f failsafe.cf && 
${sys.workdir}/bin/cf-agent";
}


This gives us two advantages:
1.  A single master policy server can support thousands of clients checking in 
over the hour.
2.  When memory leaks were an issue back in the 3.0 and 3.1 releases, we didn't 
have to worry about a long running cf-execd daemon.  Every execution was "fresh"


This has two major disadvantages:
1.  It takes us an hour to push a change through our infrastructure at the cost 
of allowing more clients to hit our MPS over that hour.
2.  The issue I'm about to describe below


So, we've been running Cfengine for several months.  We hit our first syntax 
error last Friday, and an issue rose that I didn't expect.  When cf-execd is 
started in foreground mode and a syntax error is present, its detected by 
cf-promises.  This in turn, immediatly causes cf-execd to fire its command 
statement.


     2  cf3> Cfengine - autonomous configuration engine - commence 
self-diagnostic prelude
     3  cf3> 
------------------------------------------------------------------------
     4  cf3> Work directory is /var/cfengine
     5  cf3> Making sure that locks are private...
     6  cf3> Checking integrity of the state database
     7  cf3> Checking integrity of the module directory
     8  cf3> Checking integrity of the PKI directory
     9  cf3> Looking for a source of entropy in /var/cfengine/randseed
    10  cf3>  -> Loaded private key /var/cfengine/ppkeys/localhost.priv
    11  cf3>  -> Loaded public key /var/cfengine/ppkeys/localhost.pub
    12  cf3> Setting cfengine default port to 5308 = 5308
    13  cf3> Reference time set to Wed Dec  7 19:20:53 2011
    14  cf3> CFEngine Core 3.2.0
    15  cf3> 
------------------------------------------------------------------------
    16  cf3> Host name is: esv4-linux-test04.corp.linkedin.com
    17  cf3> Operating System Type is linux
    18  cf3> Operating System Release is 2.6.32-131.2.1.el6.x86_64
    19  cf3> Architecture = x86_64
    20  cf3> Using internal soft-class linux for host 
esv4-linux-test04.corp.linkedin.com
    21  cf3> The time is now Wed Dec  7 19:20:53 2011
    22  cf3> 
------------------------------------------------------------------------
    23  cf3> # Extended system discovery is only available in version Nova and 
above
    24  cf3> Additional hard class defined as: 64_bit
    25  cf3> Additional hard class defined as: linux_2_6_32_131_2_1_el6_x86_64
    26  cf3> Additional hard class defined as: linux_x86_64
    27  cf3> Additional hard class defined as: 
linux_x86_64_2_6_32_131_2_1_el6_x86_64
    28  cf3> GNU autoconf class from compile time: compiled_on_linux_gnu
    29  cf3> Address given by nameserver: 172.18.41.51
    30  cf3> Interface 1: lo
    31  cf3> Interface 2: bond0
    32  cf3> Adding alias esv4-linux-test04.corp.linkedin.com..
    33  cf3> Trying to locate my IPv6 address
    34  cf3> Found IPv6 address fe80::221:28ff:fea5:8c80
    35  cf3> Looking for environment from cf-monitord...
    36  cf3> Unable to detect environment from cf-monitord
    37  cf3> This appears to be a redhat (or redhat-based) system.
    38  cf3> Looking for redhat linux info in "Red Hat Enterprise Linux Server 
release 6.1 (Santiago)"
    39  cf3> ***********************************************************
    40  cf3>  Loading persistent classes
    41  cf3> ***********************************************************
    42  cf3>  Persistent class cfreport_executed for 34 more minutes
    43  cf3>  Adding persistent class cfreport_executed to heap
    44  cf3> ***********************************************************
    45  cf3>  Loaded persistent memory
    46  cf3> ***********************************************************
    47  cf3>  -> No policy server (hub) watch yet registered
    48  cf3>  >> Detected change in /var/cfengine/inputs
    49  cf3>  -> Quick search detected file changes
    50  cf3>  -> Input file is changed since last validation, validating it
    51  cf3>  -> Verifying the syntax of the inputs...
    52  cf3> Checking policy with command "/var/cfengine/bin/cf-promises -f 
"/var/cfengine/inputs/promises.cf""
    53  cf3> /var/cfengine/inputs/check_snmpd.cf:10,20: syntax error, near 
token 'rocommunity_string'
    54  cf3> /var/cfengine/inputs/check_ntp.cf:1,23: Something defined outside 
of a block or missing punctuation in 
...
...
.....
    63  cf3> /var/cfengine/inputs/check_ntp.cf:1,31: syntax error, near token 
'~'
    64  Fatal cfengine error: Too many errors
    65  cf3> cf-agent was not able to get confirmation of promises from 
cf-promises, so going to failsafe
    66  cf3>   > Parsing file /var/cfengine/inputs/failsafe.cf
    67  cf3> Initiate variable convergence...
    68  cf3>  -> Checking common class promises...
    69  cf3> Executing and using module 
    70  cf3> Module context: module_site_env
    71  cf3> Activated classes: CORP
    72  cf3> Module context: module_site_env
    73  cf3> Activated classes: ESV4
    74  cf3> Module context: module_site_env
    75  cf3>  ?> defining additional global class no_site_env_defined
    76  cf3>  ?> defining additional global class guppies
    77  cf3>   > Parsing file 
/var/cfengine/inputs_site_specific/failsafe_global.cf
    78  cf3> Initiate variable convergence...
    79  cf3>  -> Checking common class promises...
    80  cf3> Executing and using module 
    81  cf3> Module context: module_site_env
    82  cf3> Activated classes: CORP
    83  cf3> Module context: module_site_env
    84  cf3> Activated classes: ESV4
    85  cf3> Module context: module_site_env
    86  cf3>   > Parsing file /var/cfengine/inputs/update.cf
    87  cf3> Initiate variable convergence...
    88  cf3>  -> Checking common class promises...
    89  cf3> Executing and using module 
    90  cf3> Module context: module_site_env
    91  cf3> Activated classes: CORP
    92  cf3> Module context: module_site_env
    93  cf3> Activated classes: ESV4
    94  cf3> Module context: module_site_env
    95  cf3> Initiate variable convergence...
    96  cf3>  -> Checking common class promises...
    97  cf3> Executing and using module 
    98  cf3> Module context: module_site_env
    99  cf3> Activated classes: CORP
   100  cf3> Module context: module_site_env
   101  cf3> Activated classes: ESV4
   102  cf3> Module context: module_site_env
   103  cf3> # Knowledge map reporting feature is only available in version 
Nova and above
   104  cf3>  -> Defined classes = { 172_18_41_51 64_bit CORP Day7 December 
ESV4 Evening GMT_Hr19 Hr19 Hr19_Q2 Lcycle_1 Min20 Min20_25 
PK_MD5_a26205cfde5272e6ddb5114f811e0458 Q2 Wednesday Yr2011 any cfengine 
cfengine_3 cfengine_3_2 cfengine_3_2_0 cfreport_executed com community_edition 
compiled_on_linux_gnu corp_linkedin_com esv4_linux_test04 
esv4_linux_test04_corp_linkedin_com 
esv4_linux_test04_corp_linkedin_com_linkedin_com executor 
fe80__221_28ff_fea5_8c80 guppies ipv4_172 ipv4_172_18 ipv4_172_18_41 
ipv4_172_18_41_51 linkedin_com linux linux_2_6_32_131_2_1_el6_x86_64 
linux_x86_64 linux_x86_64_2_6_32_131_2_1_el6_x86_64 
linux_x86_64_2_6_32_131_2_1_el6_x86_64__1_SMP_Wed_May_18_07_07_37_EDT_2011 
net_iface_bond0 no_site_env_defined redhat redhat_6 redhat_6_1 redhat_s 
redhat_s_6 redhat_s_6_1 verbose_mode x86_64 }
   105  cf3>  -> Negated Classes = { }
   106  cf3> Executing and using module 
   107  cf3> Module context: module_site_env
   108  cf3> Activated classes: CORP
   109  cf3> Module context: module_site_env
   110  cf3> Activated classes: ESV4
   111  cf3> Module context: module_site_env
   112  cf3> Initiate variable convergence...
   113  cf3>  -> Checking common class promises...
   114  cf3> Executing and using module 
   115  cf3> Module context: module_site_env
   116  cf3> Activated classes: CORP
   117  cf3> Module context: module_site_env
   118  cf3> Activated classes: ESV4
   119  cf3> Module context: module_site_env
   120  cf3> ***********************************************************
   121  cf3>  Starting executor
   122  cf3> ***********************************************************
   123  cf3> Sleeping for splaytime 0 seconds
   124  cf3> ------------------------------------------------------------------
   125  cf3>   LocalExec(not scheduled) at Wed Dec  7 19:20:53 2011
   126  cf3> ------------------------------------------------------------------
   127  cf3>  -> Command => "/var/cfengine/bin/cf-agent" -f failsafe.cf && 
"/var/cfengine/bin/cf-agent" -Dfrom_cfexecd
   128  cf3>  -> Command is executing..."/var/cfengine/bin/cf-agent" -f 
failsafe.cf && "/var/cfengine/bin/cf-agent" -Dfrom_cfexecd
   129  cf3>  -> Command is complete
   130  cf3>  -> No output




So, the bad thing, is that when you have thousands of clients executing from 
cron, and they all decide to ignore splaytime, then a single master policy 
server becomes overwhelmed trying to service all those clients at once.

cf-serverd went into a tailspin.  


# truss -p 18622
 /1:    pollsys(0x08047390, 1, 0x08047410, 0x00000000)  = 1
 /1:    accept(5, 0x08047460, 0x08047454, SOV_DEFAULT)  Err#24 EMFILE
 /1:    pollsys(0x08047390, 1, 0x08047410, 0x00000000)  = 1
 /1:    accept(5, 0x08047460, 0x08047454, SOV_DEFAULT)  Err#24 EMFILE
 /1:    pollsys(0x08047390, 1, 0x08047410, 0x00000000)  = 1
 /1:    accept(5, 0x08047460, 0x08047454, SOV_DEFAULT)  Err#24 EMFILE


Restarting cf-serverd recovered the daemon from the pain that happened at 
minute 0, but, clients still weren't auto-recovering via execution of cf-execd.

I had to manually execute a cf-agent -f failsafe.cf against thousands of 
machines before they were "recovered" and went back to their normal execution.  
Otherwise, when minute 0 came around again, they all slammed the master policy 
server at the same time.


So, I guess running out of cron probably isn't the most reliable way of 
executing.  Running as a long standing daemon though, I'm not sure would fix my 
problem.

Even if I adjusted my schedule to execute at 5 minute intervals, with a 50m 
splaytime, not all clients are going to have that splaytime of 50m.  So, what 
will end up happening with a client that evaluates to a 10m splaytime would be:

minute 3 --> splaytime 10m --> execute at minute13
minute13 --> splaytime 10m ---> execute at minute 23
minute23 --> splaytime 10m ---> execute at minute 33


Theoretically, if we set our schedule at 5 minute intervals and we have a 
syntax error, then clients are checking in at their 5m interval staggered 
instead of all at minute 0.  This gives us a 1/12 better distribution of the 
stampeding herd than what we had before executing at minute 0.

I dont want this behavior.  I only want cf-agent to execute policies and pull 
policies from the master policy servers once an hour -- but -- I dont see how 
to do this with the schedule at 5 minute intervals.  

I guess I'm in a catch 22 here... I want to protect my master policy servers 
from being slammed all at once when splaytime is ignored, but, I also want 
hourly execution.

Can anyone offer any suggestions?   Is splaytime designed to be ignored in the 
case of a syntax error?  Is this a bug in 3.2.0, or, is this by design?

Thanks
Mike

_______________________________________________
Help-cfengine mailing list
Help-cfengine@cfengine.org
https://cfengine.org/mailman/listinfo/help-cfengine

Reply via email to