As an update on the nature of the error on the Chef side:

It appears my nodes entered the problem state a while ago as a result of
a chef-client run failure during our debug of network performance issues
with the Intel 10G ixgbe driver that was packaged with Ubuntu 12.04.1. 
During these tests we took our network interfaces down on these nodes,
though I'm not convinced that alone would cause the following chain of
errors.

In the /var/log/crowbar-join-chef.log on both of the nodes currently in
the problem state we see the information from the original provisioning
date in April 2013 and then no information (as would be expected) until
the dates where we see the problem state manifest.  This seems to be
where the missing /etc/chef/client.pem originated. 

What follows is my best interpretation of the sequence that introduced
the error state.  I relates to a failed batch of chef-client runs due to
a 500 internal server error on the chef server.  The sequence ends in a
report that the client no longer has a /etc/chef/client.pem and
therefore tries to re-register itself with the chef validator.pem.

    [Thu, 19 Dec 2013 02:50:29 -0600] ERROR: Server returned error for
    http://172.16.128.101:4000/nodes/da0-36-9f-04-83-40.os.uabgrid.uab.edu,
    re
    trying 5/5 in 50s
    [Thu, 19 Dec 2013 02:51:21 -0600] INFO: HTTP Request Returned 500
    Internal Server Error: Connection refused - connect(2)

    
================================================================================
    Recipe Compile Error in
    /var/cache/chef/cookbooks/provisioner/recipes/base.rb
    
================================================================================

    Net::HTTPFatalError
    -------------------
    500 "Internal Server Error"

    Cookbook Trace:
    ---------------
      /var/cache/chef/cookbooks/provisioner/recipes/base.rb:56:in
    `from_file'

    Relevant File Content:
    ----------------------
    /var/cache/chef/cookbooks/provisioner/recipes/base.rb:

     49:  search(:node, "roles:provisioner-server AND
    
provisioner_config_environment:#{node[:provisioner][:config][:environment]}")
    do |n|
     50:    pkey = n["crowbar"]["ssh"]["root_pub_key"] rescue nil
     51:    if !pkey.nil? and pkey !=
    node["crowbar"]["ssh"]["access_keys"][n.name]
     52:      node["crowbar"]["ssh"]["access_keys"][n.name] = pkey
     53:      node_modified = true
     54:    end
     55:  end
     56>> node.save if node_modified
     57: 
     58:  template "/root/.ssh/authorized_keys" do
     59:    owner "root"
     60:    group "root"
     61:    mode "0700"
     62:    action :create
     63:    source "authorized_keys.erb"
     64:    variables(:keys => node["crowbar"]["ssh"]["access_keys"])

    [Thu, 19 Dec 2013 02:51:21 -0600] ERROR: Running exception handlers
    [Thu, 19 Dec 2013 02:51:21 -0600] FATAL: Saving node information to
    /var/cache/chef/failed-run-data.json
    [Thu, 19 Dec 2013 02:51:21 -0600] ERROR: Exception handlers complete
    [Thu, 19 Dec 2013 02:51:21 -0600] FATAL: Stacktrace dumped to
    /var/cache/chef/chef-stacktrace.out
    [Thu, 19 Dec 2013 02:51:21 -0600] FATAL: Net::HTTPFatalError: 500
    "Internal Server Error"


    --------

    2013-12-19 02:53:12 -0600: Running chef-client
    [Thu, 19 Dec 2013 02:53:12 -0600] INFO: *** Chef 10.14.4 ***
    [Thu, 19 Dec 2013 02:53:13 -0600] INFO: Client key
    /etc/chef/client.pem is not present - registering
    [Thu, 19 Dec 2013 02:53:13 -0600] INFO: HTTP Request Returned 409
    Conflict: Client already exists
    [Thu, 19 Dec 2013 02:53:13 -0600] INFO: HTTP Request Returned 403
    Forbidden: You are not allowed to take this action.

    
================================================================================
    Chef encountered an error attempting to create the client
    "da0-36-9f-04-83-40.os.uabgrid.uab.edu"
    
================================================================================

    Authorization Error:
    --------------------
    Your validation client is not authorized to create the client for
    this node (HTTP 403).

>From what I can tell the recipe that could potentially delete the
/etc/chef/client.pem is
cookbooks/provisioner/templates/default/crowbar_join.ubuntu.sh.erb.  So
this recipe appears to have been triggered at some point in the above
sequence. 

The snippet of interest from this recipe that appears to trigger the
"problem" state is:

    # Only transition to problem state if the second run fails.
    echo "Running Chef Client (pass 2)"
    if ! log_to chef chef-client ; then
        log_to ifup ifup -a
        post_state $HOSTNAME "recovering"
        echo "Error Path"
        echo "Syncing Time (pass 3)"
        sync_time
        echo "Removing Chef Cache"
        rm -rf /var/cache/chef/*
        echo "Checking Install Integrity"
        log_to apt /usr/bin/apt-get -q --force-yes -y install
        echo "Running Chef Client (pass 3) - apt/cache cleanup"
        if ! log_to chef chef-client ; then
            log_to ifup ifup -a
            echo "Error Path"
            echo "Syncing Time (pass 4)"
            sync_time
            echo "Removing Chef Cache"
            rm -rf /var/cache/chef/*
            echo "Checking Install Integrity"
            log_to apt /usr/bin/apt-get -q --force-yes -y install
            echo "Checking Keys"
            rm -f /etc/chef/client.pem
            post_state $HOSTNAME "hardware-updated"
            echo "Running Chef Client (pass 4) - password cleanup"
            if ! log_to chef chef-client ; then
                log_to ifup ifup -a
                echo "chef-client run failed four times, giving up."
                echo "Failed"
                printf "Our IP address is: %s\n" "$(ip addr show)"
                final_state="problem"
            fi
        fi
    fi


This really just points to the cause of the "problem" state and suggests
how my nodes could have lost their chef client.pem files.

I'm still investigating how to resolve this error.  I did successfully
created a new client.pem an Chef doesn't think that that this node is a
duplicate, at least in-so-far-as the chef clients with the problem state
are able to successfully connect to the chef server using their
client.pem credential.  

There are some differences between chef-client runs on similar nodes in
the fabric that are in the ready state. The nodes in the problem state 
are executing fewer recipes than those in the ready state.  I'm tracing
down those differences to see if there is something simple that might be
set to clear the problem state.

Any thoughts or suggests on how to proceed from this point are
appreciated.  Is it possible to recover from a a client credential
recreate?  It's odd that the original credential could be destroyed by a
recipe action.

Thanks for the insights so far,

~jpr

On 05/04/2014 06:19 PM, John-Paul Robinson wrote:
> What I missed is that both of the nodes that are reporting the Problem
> state in Crowbar have the state set to "problem" in the json data stream
> coming  from Chef.
>
> So at least this points upstream to Chef state and not everything being
> happy there.
>
> ~jpr
>
> On 05/04/2014 11:54 AM, John-Paul Robinson wrote:
>> When I look at the data that Crowbar is pulling out of Chef on the TCP
>> channel the json object sent to Crowbar from Chef includes the correct
>> uptime value for both nodes.  From what I've understood from the code,
>> missing uptime data is a trigger for the Problem state.
> _______________________________________________
> Crowbar mailing list
> Crowbar@dell.com
> https://lists.us.dell.com/mailman/listinfo/crowbar
> For more information: http://crowbar.github.com/

_______________________________________________
Crowbar mailing list
Crowbar@dell.com
https://lists.us.dell.com/mailman/listinfo/crowbar
For more information: http://crowbar.github.com/

Reply via email to