As an update on the nature of the error on the Chef side: It appears my nodes entered the problem state a while ago as a result of a chef-client run failure during our debug of network performance issues with the Intel 10G ixgbe driver that was packaged with Ubuntu 12.04.1. During these tests we took our network interfaces down on these nodes, though I'm not convinced that alone would cause the following chain of errors.
In the /var/log/crowbar-join-chef.log on both of the nodes currently in the problem state we see the information from the original provisioning date in April 2013 and then no information (as would be expected) until the dates where we see the problem state manifest. This seems to be where the missing /etc/chef/client.pem originated. What follows is my best interpretation of the sequence that introduced the error state. I relates to a failed batch of chef-client runs due to a 500 internal server error on the chef server. The sequence ends in a report that the client no longer has a /etc/chef/client.pem and therefore tries to re-register itself with the chef validator.pem. [Thu, 19 Dec 2013 02:50:29 -0600] ERROR: Server returned error for http://172.16.128.101:4000/nodes/da0-36-9f-04-83-40.os.uabgrid.uab.edu, re trying 5/5 in 50s [Thu, 19 Dec 2013 02:51:21 -0600] INFO: HTTP Request Returned 500 Internal Server Error: Connection refused - connect(2) ================================================================================ Recipe Compile Error in /var/cache/chef/cookbooks/provisioner/recipes/base.rb ================================================================================ Net::HTTPFatalError ------------------- 500 "Internal Server Error" Cookbook Trace: --------------- /var/cache/chef/cookbooks/provisioner/recipes/base.rb:56:in `from_file' Relevant File Content: ---------------------- /var/cache/chef/cookbooks/provisioner/recipes/base.rb: 49: search(:node, "roles:provisioner-server AND provisioner_config_environment:#{node[:provisioner][:config][:environment]}") do |n| 50: pkey = n["crowbar"]["ssh"]["root_pub_key"] rescue nil 51: if !pkey.nil? and pkey != node["crowbar"]["ssh"]["access_keys"][n.name] 52: node["crowbar"]["ssh"]["access_keys"][n.name] = pkey 53: node_modified = true 54: end 55: end 56>> node.save if node_modified 57: 58: template "/root/.ssh/authorized_keys" do 59: owner "root" 60: group "root" 61: mode "0700" 62: action :create 63: source "authorized_keys.erb" 64: variables(:keys => node["crowbar"]["ssh"]["access_keys"]) [Thu, 19 Dec 2013 02:51:21 -0600] ERROR: Running exception handlers [Thu, 19 Dec 2013 02:51:21 -0600] FATAL: Saving node information to /var/cache/chef/failed-run-data.json [Thu, 19 Dec 2013 02:51:21 -0600] ERROR: Exception handlers complete [Thu, 19 Dec 2013 02:51:21 -0600] FATAL: Stacktrace dumped to /var/cache/chef/chef-stacktrace.out [Thu, 19 Dec 2013 02:51:21 -0600] FATAL: Net::HTTPFatalError: 500 "Internal Server Error" -------- 2013-12-19 02:53:12 -0600: Running chef-client [Thu, 19 Dec 2013 02:53:12 -0600] INFO: *** Chef 10.14.4 *** [Thu, 19 Dec 2013 02:53:13 -0600] INFO: Client key /etc/chef/client.pem is not present - registering [Thu, 19 Dec 2013 02:53:13 -0600] INFO: HTTP Request Returned 409 Conflict: Client already exists [Thu, 19 Dec 2013 02:53:13 -0600] INFO: HTTP Request Returned 403 Forbidden: You are not allowed to take this action. ================================================================================ Chef encountered an error attempting to create the client "da0-36-9f-04-83-40.os.uabgrid.uab.edu" ================================================================================ Authorization Error: -------------------- Your validation client is not authorized to create the client for this node (HTTP 403). >From what I can tell the recipe that could potentially delete the /etc/chef/client.pem is cookbooks/provisioner/templates/default/crowbar_join.ubuntu.sh.erb. So this recipe appears to have been triggered at some point in the above sequence. The snippet of interest from this recipe that appears to trigger the "problem" state is: # Only transition to problem state if the second run fails. echo "Running Chef Client (pass 2)" if ! log_to chef chef-client ; then log_to ifup ifup -a post_state $HOSTNAME "recovering" echo "Error Path" echo "Syncing Time (pass 3)" sync_time echo "Removing Chef Cache" rm -rf /var/cache/chef/* echo "Checking Install Integrity" log_to apt /usr/bin/apt-get -q --force-yes -y install echo "Running Chef Client (pass 3) - apt/cache cleanup" if ! log_to chef chef-client ; then log_to ifup ifup -a echo "Error Path" echo "Syncing Time (pass 4)" sync_time echo "Removing Chef Cache" rm -rf /var/cache/chef/* echo "Checking Install Integrity" log_to apt /usr/bin/apt-get -q --force-yes -y install echo "Checking Keys" rm -f /etc/chef/client.pem post_state $HOSTNAME "hardware-updated" echo "Running Chef Client (pass 4) - password cleanup" if ! log_to chef chef-client ; then log_to ifup ifup -a echo "chef-client run failed four times, giving up." echo "Failed" printf "Our IP address is: %s\n" "$(ip addr show)" final_state="problem" fi fi fi This really just points to the cause of the "problem" state and suggests how my nodes could have lost their chef client.pem files. I'm still investigating how to resolve this error. I did successfully created a new client.pem an Chef doesn't think that that this node is a duplicate, at least in-so-far-as the chef clients with the problem state are able to successfully connect to the chef server using their client.pem credential. There are some differences between chef-client runs on similar nodes in the fabric that are in the ready state. The nodes in the problem state are executing fewer recipes than those in the ready state. I'm tracing down those differences to see if there is something simple that might be set to clear the problem state. Any thoughts or suggests on how to proceed from this point are appreciated. Is it possible to recover from a a client credential recreate? It's odd that the original credential could be destroyed by a recipe action. Thanks for the insights so far, ~jpr On 05/04/2014 06:19 PM, John-Paul Robinson wrote: > What I missed is that both of the nodes that are reporting the Problem > state in Crowbar have the state set to "problem" in the json data stream > coming from Chef. > > So at least this points upstream to Chef state and not everything being > happy there. > > ~jpr > > On 05/04/2014 11:54 AM, John-Paul Robinson wrote: >> When I look at the data that Crowbar is pulling out of Chef on the TCP >> channel the json object sent to Crowbar from Chef includes the correct >> uptime value for both nodes. From what I've understood from the code, >> missing uptime data is a trigger for the Problem state. > _______________________________________________ > Crowbar mailing list > Crowbar@dell.com > https://lists.us.dell.com/mailman/listinfo/crowbar > For more information: http://crowbar.github.com/
_______________________________________________ Crowbar mailing list Crowbar@dell.com https://lists.us.dell.com/mailman/listinfo/crowbar For more information: http://crowbar.github.com/