Forum: CFEngine Help Subject: cf-exced ignores splaytime on syntax errors / unleash the stampede Author: msvob...@linkedin.com Link to topic: https://cfengine.com/forum/read.php?3,24225,24225#msg-24225
Hey Cfengineers I just filed the following bug, and it seems like a pretty major issue. I'm not sure if this is a bug, or if this is the default behavior of how Cfengine is supposed to execute. Either way, I figured I'd bring it to attention: https://cfengine.com/bugtracker/view.php?id=884 At LinkedIn, we execute Cfengine via cron by firing cf-execd in foreground mode. We then set splaytime very high (50 minutes) The crontab entry is simple and looks like: 0 * * * * /var/cfengine/bin/cf-execd -F This is our body executor control statement which configures cf-execd. Note the value of splaytime and the schedule. Cf-exced will always be fired via foreground mode at the Min00 class via cron, but we also enforce that in the class schedule here. If cf-execd is started for some reason other than minute zero, it will wait until minute zero before starting splaytime of 50m. body executor control { # We dont have a schedule interval because cf-execd is executed from root's crontab hourly. Allow splaytime # to be set at a high value so we spread the load out on the master policy server over the hour. splaytime => "50"; mailmaxlines => "100"; smtpserver => "localhost"; schedule => { "Min00", }; executorfacility => "LOG_DAEMON"; # This is the command that actually drives cf-execd to execute cf-agent on the schedule above. exec_command => "${sys.workdir}/bin/cf-agent -f failsafe.cf && ${sys.workdir}/bin/cf-agent"; } This gives us two advantages: 1. A single master policy server can support thousands of clients checking in over the hour. 2. When memory leaks were an issue back in the 3.0 and 3.1 releases, we didn't have to worry about a long running cf-execd daemon. Every execution was "fresh" This has two major disadvantages: 1. It takes us an hour to push a change through our infrastructure at the cost of allowing more clients to hit our MPS over that hour. 2. The issue I'm about to describe below So, we've been running Cfengine for several months. We hit our first syntax error last Friday, and an issue rose that I didn't expect. When cf-execd is started in foreground mode and a syntax error is present, its detected by cf-promises. This in turn, immediatly causes cf-execd to fire its command statement. 2 cf3> Cfengine - autonomous configuration engine - commence self-diagnostic prelude 3 cf3> ------------------------------------------------------------------------ 4 cf3> Work directory is /var/cfengine 5 cf3> Making sure that locks are private... 6 cf3> Checking integrity of the state database 7 cf3> Checking integrity of the module directory 8 cf3> Checking integrity of the PKI directory 9 cf3> Looking for a source of entropy in /var/cfengine/randseed 10 cf3> -> Loaded private key /var/cfengine/ppkeys/localhost.priv 11 cf3> -> Loaded public key /var/cfengine/ppkeys/localhost.pub 12 cf3> Setting cfengine default port to 5308 = 5308 13 cf3> Reference time set to Wed Dec 7 19:20:53 2011 14 cf3> CFEngine Core 3.2.0 15 cf3> ------------------------------------------------------------------------ 16 cf3> Host name is: esv4-linux-test04.corp.linkedin.com 17 cf3> Operating System Type is linux 18 cf3> Operating System Release is 2.6.32-131.2.1.el6.x86_64 19 cf3> Architecture = x86_64 20 cf3> Using internal soft-class linux for host esv4-linux-test04.corp.linkedin.com 21 cf3> The time is now Wed Dec 7 19:20:53 2011 22 cf3> ------------------------------------------------------------------------ 23 cf3> # Extended system discovery is only available in version Nova and above 24 cf3> Additional hard class defined as: 64_bit 25 cf3> Additional hard class defined as: linux_2_6_32_131_2_1_el6_x86_64 26 cf3> Additional hard class defined as: linux_x86_64 27 cf3> Additional hard class defined as: linux_x86_64_2_6_32_131_2_1_el6_x86_64 28 cf3> GNU autoconf class from compile time: compiled_on_linux_gnu 29 cf3> Address given by nameserver: 172.18.41.51 30 cf3> Interface 1: lo 31 cf3> Interface 2: bond0 32 cf3> Adding alias esv4-linux-test04.corp.linkedin.com.. 33 cf3> Trying to locate my IPv6 address 34 cf3> Found IPv6 address fe80::221:28ff:fea5:8c80 35 cf3> Looking for environment from cf-monitord... 36 cf3> Unable to detect environment from cf-monitord 37 cf3> This appears to be a redhat (or redhat-based) system. 38 cf3> Looking for redhat linux info in "Red Hat Enterprise Linux Server release 6.1 (Santiago)" 39 cf3> *********************************************************** 40 cf3> Loading persistent classes 41 cf3> *********************************************************** 42 cf3> Persistent class cfreport_executed for 34 more minutes 43 cf3> Adding persistent class cfreport_executed to heap 44 cf3> *********************************************************** 45 cf3> Loaded persistent memory 46 cf3> *********************************************************** 47 cf3> -> No policy server (hub) watch yet registered 48 cf3> >> Detected change in /var/cfengine/inputs 49 cf3> -> Quick search detected file changes 50 cf3> -> Input file is changed since last validation, validating it 51 cf3> -> Verifying the syntax of the inputs... 52 cf3> Checking policy with command "/var/cfengine/bin/cf-promises -f "/var/cfengine/inputs/promises.cf"" 53 cf3> /var/cfengine/inputs/check_snmpd.cf:10,20: syntax error, near token 'rocommunity_string' 54 cf3> /var/cfengine/inputs/check_ntp.cf:1,23: Something defined outside of a block or missing punctuation in ... ... ..... 63 cf3> /var/cfengine/inputs/check_ntp.cf:1,31: syntax error, near token '~' 64 Fatal cfengine error: Too many errors 65 cf3> cf-agent was not able to get confirmation of promises from cf-promises, so going to failsafe 66 cf3> > Parsing file /var/cfengine/inputs/failsafe.cf 67 cf3> Initiate variable convergence... 68 cf3> -> Checking common class promises... 69 cf3> Executing and using module 70 cf3> Module context: module_site_env 71 cf3> Activated classes: CORP 72 cf3> Module context: module_site_env 73 cf3> Activated classes: ESV4 74 cf3> Module context: module_site_env 75 cf3> ?> defining additional global class no_site_env_defined 76 cf3> ?> defining additional global class guppies 77 cf3> > Parsing file /var/cfengine/inputs_site_specific/failsafe_global.cf 78 cf3> Initiate variable convergence... 79 cf3> -> Checking common class promises... 80 cf3> Executing and using module 81 cf3> Module context: module_site_env 82 cf3> Activated classes: CORP 83 cf3> Module context: module_site_env 84 cf3> Activated classes: ESV4 85 cf3> Module context: module_site_env 86 cf3> > Parsing file /var/cfengine/inputs/update.cf 87 cf3> Initiate variable convergence... 88 cf3> -> Checking common class promises... 89 cf3> Executing and using module 90 cf3> Module context: module_site_env 91 cf3> Activated classes: CORP 92 cf3> Module context: module_site_env 93 cf3> Activated classes: ESV4 94 cf3> Module context: module_site_env 95 cf3> Initiate variable convergence... 96 cf3> -> Checking common class promises... 97 cf3> Executing and using module 98 cf3> Module context: module_site_env 99 cf3> Activated classes: CORP 100 cf3> Module context: module_site_env 101 cf3> Activated classes: ESV4 102 cf3> Module context: module_site_env 103 cf3> # Knowledge map reporting feature is only available in version Nova and above 104 cf3> -> Defined classes = { 172_18_41_51 64_bit CORP Day7 December ESV4 Evening GMT_Hr19 Hr19 Hr19_Q2 Lcycle_1 Min20 Min20_25 PK_MD5_a26205cfde5272e6ddb5114f811e0458 Q2 Wednesday Yr2011 any cfengine cfengine_3 cfengine_3_2 cfengine_3_2_0 cfreport_executed com community_edition compiled_on_linux_gnu corp_linkedin_com esv4_linux_test04 esv4_linux_test04_corp_linkedin_com esv4_linux_test04_corp_linkedin_com_linkedin_com executor fe80__221_28ff_fea5_8c80 guppies ipv4_172 ipv4_172_18 ipv4_172_18_41 ipv4_172_18_41_51 linkedin_com linux linux_2_6_32_131_2_1_el6_x86_64 linux_x86_64 linux_x86_64_2_6_32_131_2_1_el6_x86_64 linux_x86_64_2_6_32_131_2_1_el6_x86_64__1_SMP_Wed_May_18_07_07_37_EDT_2011 net_iface_bond0 no_site_env_defined redhat redhat_6 redhat_6_1 redhat_s redhat_s_6 redhat_s_6_1 verbose_mode x86_64 } 105 cf3> -> Negated Classes = { } 106 cf3> Executing and using module 107 cf3> Module context: module_site_env 108 cf3> Activated classes: CORP 109 cf3> Module context: module_site_env 110 cf3> Activated classes: ESV4 111 cf3> Module context: module_site_env 112 cf3> Initiate variable convergence... 113 cf3> -> Checking common class promises... 114 cf3> Executing and using module 115 cf3> Module context: module_site_env 116 cf3> Activated classes: CORP 117 cf3> Module context: module_site_env 118 cf3> Activated classes: ESV4 119 cf3> Module context: module_site_env 120 cf3> *********************************************************** 121 cf3> Starting executor 122 cf3> *********************************************************** 123 cf3> Sleeping for splaytime 0 seconds 124 cf3> ------------------------------------------------------------------ 125 cf3> LocalExec(not scheduled) at Wed Dec 7 19:20:53 2011 126 cf3> ------------------------------------------------------------------ 127 cf3> -> Command => "/var/cfengine/bin/cf-agent" -f failsafe.cf && "/var/cfengine/bin/cf-agent" -Dfrom_cfexecd 128 cf3> -> Command is executing..."/var/cfengine/bin/cf-agent" -f failsafe.cf && "/var/cfengine/bin/cf-agent" -Dfrom_cfexecd 129 cf3> -> Command is complete 130 cf3> -> No output So, the bad thing, is that when you have thousands of clients executing from cron, and they all decide to ignore splaytime, then a single master policy server becomes overwhelmed trying to service all those clients at once. cf-serverd went into a tailspin. # truss -p 18622 /1: pollsys(0x08047390, 1, 0x08047410, 0x00000000) = 1 /1: accept(5, 0x08047460, 0x08047454, SOV_DEFAULT) Err#24 EMFILE /1: pollsys(0x08047390, 1, 0x08047410, 0x00000000) = 1 /1: accept(5, 0x08047460, 0x08047454, SOV_DEFAULT) Err#24 EMFILE /1: pollsys(0x08047390, 1, 0x08047410, 0x00000000) = 1 /1: accept(5, 0x08047460, 0x08047454, SOV_DEFAULT) Err#24 EMFILE Restarting cf-serverd recovered the daemon from the pain that happened at minute 0, but, clients still weren't auto-recovering via execution of cf-execd. I had to manually execute a cf-agent -f failsafe.cf against thousands of machines before they were "recovered" and went back to their normal execution. Otherwise, when minute 0 came around again, they all slammed the master policy server at the same time. So, I guess running out of cron probably isn't the most reliable way of executing. Running as a long standing daemon though, I'm not sure would fix my problem. Even if I adjusted my schedule to execute at 5 minute intervals, with a 50m splaytime, not all clients are going to have that splaytime of 50m. So, what will end up happening with a client that evaluates to a 10m splaytime would be: minute 3 --> splaytime 10m --> execute at minute13 minute13 --> splaytime 10m ---> execute at minute 23 minute23 --> splaytime 10m ---> execute at minute 33 Theoretically, if we set our schedule at 5 minute intervals and we have a syntax error, then clients are checking in at their 5m interval staggered instead of all at minute 0. This gives us a 1/12 better distribution of the stampeding herd than what we had before executing at minute 0. I dont want this behavior. I only want cf-agent to execute policies and pull policies from the master policy servers once an hour -- but -- I dont see how to do this with the schedule at 5 minute intervals. I guess I'm in a catch 22 here... I want to protect my master policy servers from being slammed all at once when splaytime is ignored, but, I also want hourly execution. Can anyone offer any suggestions? Is splaytime designed to be ignored in the case of a syntax error? Is this a bug in 3.2.0, or, is this by design? Thanks Mike _______________________________________________ Help-cfengine mailing list Help-cfengine@cfengine.org https://cfengine.org/mailman/listinfo/help-cfengine