Forum: CFEngine Help Subject: Integrate Cfengine in your environment / The Cfengine challenge throw down Author: msvob...@linkedin.com Link to topic: https://cfengine.com/forum/read.php?3,24968,24968#msg-24968
LinkedIn had the pleasure of hosting Mark Burgess last month at the BayLISA meeting at our headquarters in Mountain View. At the end of Mark's presentation, I asked a question about what was the best / most reliable way of pulling in "external information" outside of your Cfengine SVN/CVS/Git repository in a safe way. For all advanced Cfengine administrators, this issue has come up in your environment in one way or another. You have to pull in an external datasource outside of your source code repository and deliver it to clients. How can you do this reliably? How can we keep with Cfengine's methologoligy that if the network link is busted, that no damage could happen to the client? So, the use case that we have in this example is a technology that Yahoo! open sourced called range: https://github.com/yahoo/range Range is a very simple HTTP lookup method that keeps metadata about machines. The server side is just a mod_range.so Apache module that you stick into httpd.conf. If you execute a range query for a host, you can return all sorts of metadata about it. This is very useful, because it allows machines to be catagorized into groups (What machines are running as webservers? What JVMs are running on machineXYZ?) We can already define global classes in Cfengine in promises.cf or local classes in any other policy -- but thats not the point. The business should define what systems belong in which classes. The Cfengine administrator should build policy. Once the Cfengine administrator is left to manually defining classes within policy, you become a bottleneck. What if you had the ability to write a policy and use a class within that policy -- but the business could control which machines were associated with that class external to Cfengine? You end up with an extremely powerful configuration management system. You aren't the bottleneck, and you give your business a whole new flexiability. The business can deploy an application to a machine, and Cfengine automatically responds in its policy because new classes have been set on the client automatically. Other business units can create classes of machines, and file a ticket for you to implement XYZ against the class they created. You are allowing your business to work the way it wants to. So, range is an excellent example of an external datasource in our environment that I need to set classes upon. This could be anything in your environment -- Zookeeper, Ning's Galaxy, or whatever else you use to manage applications on your systems. If you need to know what application / service is defined on specific machines, this information does not reside in your Cfengine repository. You have to go out and grab it. This leaves us with some problems: 1. What if the client is off the network? 2. What if the external source is unavailable? Load balancer down / intermediate network issues / etc. 3. I need to timeout my request. My module can't hang forever, or cf-agent / cf-promises will just hang 4. The classes I define have to be canonified. 5. I need persistance, which doesn't have a defined time. It could be hours / days / months before this machine gets communcation restored to the range servers. Here's my solution to this problem. In short, here's how it works. I think it would be cool if Cfengine could implement something like this into the product itself, but, thats what is awesome about the usemodule function. If you can define what you want to do programatically, you can create whatever client-side behavior you want. 1. I define my FQDN, and perform a lookup against it for all "range clusters". A range cluster is basically just a group that this machine belongs to. 2. I perform 2 additional lookups for any running applications on this machine. These are called "tag" lookups. You could create a tag for anything. CONTAINER and SERVICE are just the tags we used. 3. Canonify all of the above to happy Cfengine classes. 4. Dump the data to JSON format on the filesystem. 5. If I can't read the range servers, or my request times out, read the JSON file on the filesystem and raise a class that we'll report on. This class basically means that I read from disk instead of performing a live query. This is class persistance for an unlimited amount of time. Hopefully administrators are watching the reports and someone figures out WTF is going on with the communication problems between the client and range servers. So, this satisfies all of the above. Here's my code: # cat -n module_define_range_classes.py 1 #!/usr/bin/python2.6 6 7 import os 8 import sys 9 import signal 10 import json 11 import time 12 import platform 13 import subprocess 14 import re 15 from optparse import OptionParser 16 import site 17 site.addsitedir('/usr/local/linkedin/lib/python2.6/site-packages') 18 import seco.range 19 20 class timeout_exception(Exception): 21 pass 22 23 #################################################################################################################### 24 def process_range_data(range_data_array, range_string): 25 flush = 0 26 # We need to cannofy the strings so that we can set Cfengine classes on these. We could probably also set global variables 27 # but right now we just need classes set. First, confirm that every character in the string is alphanumeric or an underscore 28 # period, or dash. Then replace everything with an underscore. 29 # include any possible bad characters below, which will be replaced. 30 p = re.compile(r'[-.]') 31 holder = "" 32 33 # Append the arrays, regardless if they are empty or not since we are requring the mapping of arrays to indexes 34 # If we make a successful query, then set a flag which will cause us to return a true value. this determines if we should 35 # flush our current results out to range_classes.conf or if we should still read from what is on disk. 36 temp_array = [] 37 if range_data_array: 38 for item in range_data_array: 39 # range_string identifies if this is a range cluster, container, or service. 40 holder = range_string + p.sub('_',item) 41 # Make sure our item is now alphanumeric after we substituted unscores for the periods and dashes 42 if re.match('',holder): 43 temp_array.append(holder) 44 # Set the global class by printing a + sign with the name that we've verified is canonical 45 print "+" + holder 46 flush = 1 47 else: 48 print "+invalid_range_data" 49 range_classes.append(temp_array) 50 51 if flush: 52 return 1 53 #################################################################################################################### 54 def execute_range_query(): 55 flush = 0 56 range_clusters = [] 57 range_containers = [] 58 range_services = [] 59 60 def timeout_handler(signum, frame): 61 raise timeout_exception() 62 63 old_handler = signal.signal(signal.SIGALRM, timeout_handler) 64 # set a 5 second alarm 65 signal.alarm(5) 66 67 try: 68 # Grab all range clusters. This is the expensive query to run on the range servers, because they have to search all of their 69 # maps for the clusters specific for this host (bottom up search.) Once we have the clusters, we perform "tag" lookups which 70 # is a cheap operation on the range infrastructure. 71 range_object = seco.range.Range(options.url) 72 try: 73 range_clusters = range_object.expand('?' + fqdn) 74 except seco.range.RangeException, e: 75 if "NO_CLUSTER" in str(e)\ 76 or "NOCLUSTER" in str(e): 77 print "+no_range_clusters" 78 79 # We define range_clusters, range_containers, and range_services as local arrays because we append to a two deminsional array, 80 # range_classes, which is global in scope in process_range_data. 81 if range_clusters: 82 for cluster in range_clusters: 83 try: 84 for container in range_object.expand('%{'+ cluster +'}:CONTAINER'): 85 range_containers.append(container) 86 except seco.range.RangeException, e: 87 if "NO_CLUSTER" in str(e)\ 88 or "NOCLUSTER" in str(e): 89 print "+no_range_containers" 90 91 try: 92 for service in range_object.expand('%{'+ cluster +'}:SERVICE'): 93 range_services.append(service) 94 except seco.range.RangeException, e: 95 if "NO_CLUSTER" in str(e)\ 96 or "NOCLUSTER" in str(e): 97 print "+no_range_services" 98 99 # Now that we have the arrays which contain all of the clusters, containers, and services, process through them to make them 100 # Cfengine-happy strings. Flush the data out if we see any modifications. 101 if process_range_data(range_clusters, "range_clusters_"): 102 flush = 1 103 if process_range_data(range_containers, "range_containers_"): 104 flush = 1 105 if process_range_data(range_services, "range_services_"): 106 flush = 1 107 108 except timeout_exception: 109 print "+invalid_range_data" 110 finally: 111 signal.signal(signal.SIGALRM, old_handler) 112 signal.alarm(0) 113 114 if flush: 115 return 1 116 else: 117 return 0 118 #################################################################################################################### 119 def print_previous_results(): 120 try: 121 temp_array = previous_range_classes.pop() 122 if temp_array: 123 while temp_array: 124 for item in temp_array: 125 print "+" + item 126 temp_array = previous_range_classes.pop() 127 except IndexError: 128 pass 129 #################################################################################################################### 130 if __name__ == '__main__': 131 """ 132 Query the range servers and set global classes within Cfengine based upon their output. When complete, dump data into a JSON file. 133 If for whatever reason we can't query the range servers, then read the range class data from this file instead of querying the range 134 servers directly. This allows for "persistant classes". 142 """ 143 parser = OptionParser(usage ="usage: %prog ", 144 version ="%prog 1.0") 145 parser.add_option("-v", "--verbose", 146 action = "store_true", 147 dest = "verbose", 148 default = False, 149 help = "Enable verbose execution") 150 parser.add_option("-u", "--url", 151 action = "store", 152 dest = "url", 153 help = "Which URL to query against? PROD/STG load balancers. REQUIRED") 154 parser.add_option("-f", "--file", 155 action = "store", 156 dest = "file", 157 help = "Which file should we read / write range classes to for persistance? (In case the range servers are not answering)") 158 159 (options, args) = parser.parse_args() 160 161 if options.url is None: 162 print "A URL is required to execute this script. Exiting." 163 sys.exit(1) 164 165 if options.file is None: 166 options.file = "/etc/range_classes.conf" 167 168 previous_range_classes = [] 169 range_classes = [] 170 file_created = 0 171 fqdn = "" 172 173 # The below statement sets a global class that we use to key off of in promises.cf so only one execution of the script occurs 174 # per Cfengine execution. Otherwise, we'd hit this script like 5 times and overload the range servers. 175 print "+module_define_range_classes_executed" 176 177 if "linkedin.com" not in platform.node(): 178 fqdn = platform.node() + ".linkedin.com" 179 else: 180 fqdn = platform.node() 181 182 try: 183 if os.path.exists(options.file): 184 with open(options.file) as fh: 185 previous_range_classes = json.load(fh) 186 else: 187 file_created = 1 188 except: 189 file_created = 1 190 191 # we return a 1 "flush" above if we should flush our results out to disk. If not, then just set classes based off of 192 # what we found in range_classes.conf (our way of making persistant classes if the range servers don't respond) 193 if execute_range_query() or file_created: 194 try: 195 with open(options.file, mode="w") as fh: 196 json.dump(range_classes, fh, sort_keys=True, indent=2) 197 except: 198 print "We tried to dump data to the JSON files, but for whatever reason, we couldn't. Sorry" 199 else: 200 # We didn't successfully poll the range servers, so, loop through and print our previous range classes found in range_classes.conf 201 # There is no need to dump data back into a JSON file since we didn't read anything new. 202 print_previous_results() So, when we execute this script, this is what we get... $ /var/cfengine/modules/module_define_range_classes.py -u range.servers.url.linkedin.com +module_define_range_classes_executed +range_clusters_alpha_agent_1 +range_clusters_alpha_fuse_usagecontrol_1 +range_clusters_alpha_genie_services_1 +range_clusters_alpha_languagepack_1 +range_clusters_alpha_liar_life_1 +range_clusters_alpha_profile_services_1 +range_clusters_alpha_tether_1 +range_containers_agent +range_containers_fuse_usagecontrol +range_containers_genie_services +range_containers_languagepack +range_containers_liar_life +range_containers_profile_services +range_containers_tether +range_services_agent +range_services_fuse_usagecontrol +range_services_genie_services +range_services_language_pack_cs_CZ +range_services_language_pack_da_DK +range_services_language_pack_de_DE +range_services_language_pack_en_US +range_services_language_pack_es_ES +range_services_language_pack_fr_FR +range_services_language_pack_in_ID +range_services_language_pack_it_IT +range_services_language_pack_ja_JP +range_services_language_pack_ko_KR +range_services_language_pack_ms_MY +range_services_language_pack_nl_NL +range_services_language_pack_no_NO +range_services_language_pack_pl_PL +range_services_language_pack_pt_BR +range_services_language_pack_ro_RO +range_services_language_pack_ru_RU +range_services_language_pack_sv_SE +range_services_language_pack_tr_TR +range_services_liar_life +range_services_profile_services +range_services_tether All of this data is dumped to JSON format into /etc/range_classes.conf. We read from this file if our network path to the range servers is busted. This allows the clients to continue to set these classes until range returns data to set it otherwise. $ cat /etc/range_classes.conf [ [ "range_clusters_alpha_agent_1", "range_clusters_alpha_fuse_usagecontrol_1", "range_clusters_alpha_genie_services_1", "range_clusters_alpha_languagepack_1", "range_clusters_alpha_liar_life_1", "range_clusters_alpha_profile_services_1", "range_clusters_alpha_tether_1" ], [ "range_containers_agent", "range_containers_fuse_usagecontrol", "range_containers_genie_services", "range_containers_languagepack", "range_containers_liar_life", "range_containers_profile_services", "range_containers_tether" ], [ "range_services_agent", "range_services_fuse_usagecontrol", "range_services_genie_services", "range_services_language_pack_cs_CZ", "range_services_language_pack_da_DK", "range_services_language_pack_de_DE", "range_services_language_pack_en_US", "range_services_language_pack_es_ES", "range_services_language_pack_fr_FR", "range_services_language_pack_in_ID", "range_services_language_pack_it_IT", "range_services_language_pack_ja_JP", "range_services_language_pack_ko_KR", "range_services_language_pack_ms_MY", "range_services_language_pack_nl_NL", "range_services_language_pack_no_NO", "range_services_language_pack_pl_PL", "range_services_language_pack_pt_BR", "range_services_language_pack_ro_RO", "range_services_language_pack_ru_RU", "range_services_language_pack_sv_SE", "range_services_language_pack_tr_TR", "range_services_liar_life", "range_services_profile_services", "range_services_tether" ] ] Here, I'll introduce a time.sleep(10) at line 71 just to simiulate the range servers not responding: $ time /var/cfengine/modules/module_define_range_classes.py -u range.servers.url.linkedin.com +module_define_range_classes_executed +invalid_range_data +range_services_agent +range_services_fuse_usagecontrol +range_services_genie_services +range_services_language_pack_cs_CZ +range_services_language_pack_da_DK +range_services_language_pack_de_DE +range_services_language_pack_en_US +range_services_language_pack_es_ES +range_services_language_pack_fr_FR +range_services_language_pack_in_ID +range_services_language_pack_it_IT +range_services_language_pack_ja_JP +range_services_language_pack_ko_KR +range_services_language_pack_ms_MY +range_services_language_pack_nl_NL +range_services_language_pack_no_NO +range_services_language_pack_pl_PL +range_services_language_pack_pt_BR +range_services_language_pack_ro_RO +range_services_language_pack_ru_RU +range_services_language_pack_sv_SE +range_services_language_pack_tr_TR +range_services_liar_life +range_services_profile_services +range_services_tether +range_containers_agent +range_containers_fuse_usagecontrol +range_containers_genie_services +range_containers_languagepack +range_containers_liar_life +range_containers_profile_services +range_containers_tether +range_clusters_alpha_agent_1 +range_clusters_alpha_fuse_usagecontrol_1 +range_clusters_alpha_genie_services_1 +range_clusters_alpha_languagepack_1 +range_clusters_alpha_liar_life_1 +range_clusters_alpha_profile_services_1 +range_clusters_alpha_tether_1 real 0m5.098s user 0m0.070s sys 0m0.031s So, this behaved exactly like we were expecting. Once we passed the 5 second timeout, we read from the JSON file to set the classes instead of performing the live query. We also raised the global class invalid_range_data, so we will report that we've read from JSON. The only other takeaway from this script is at line 175. We raise a class, module_define_range_classes_executed, so we only execute this script a single time. When cf-agent runs through all of the modules in promises.cf, it will execute the modules like 5-6 times. This in turn, will end up hammering our range servers. So in promises.cf, I call this script using the below code: !module_define_range_classes_executed:: "discover_range_crud" expression => usemodule("module_define_range_classes.py -u range.servers.url.linkedin.com", ""); I hope this helps someone else trying to figure out how to pull in external datasources / information into Cfengine client execution. If you are a Cfengine guru / expert, then please consider sharing some of your code / policies. I'm not sharing this for my well being. I'm sharing this with you because I want (I need) you to share what you are doing within your organization. I need automation ideas. I want to expand what we're doing at LinkedIn, and I need your help doing so. This list isn't just for the n00bs asking basic Cfengine questions. I want to learn something from you, and it'll help the n00bs too. Share what you're automating and amaze me. If I end up implementing some uber cool automation idea you've shared, I'll send you a cookie (seriously, it'll be one of those big chocolate chip cookies that you see at the mall.) Thanks Mike _______________________________________________ Help-cfengine mailing list Help-cfengine@cfengine.org https://cfengine.org/mailman/listinfo/help-cfengine