I've got two of my nodes reporting the "Problem" state in the Node view of the Crowbar UI and a red led blink in their group summary.
Both of the nodes report the uptime as "Unavailable" in their node view and I think this points to the source of the problem. No, not that the machines are down but that Crowbar isn't happy about their state. ;) Both of these nodes appear to be in a normal state in the Chef node view and in fact both display their correct uptimes in that view. They are also functioning normally by all other indicators (Nagios and Ganglia included). When I look at the data that Crowbar is pulling out of Chef on the TCP channel the json object sent to Crowbar from Chef includes the correct uptime value for both nodes. From what I've understood from the code, missing uptime data is a trigger for the Problem state. As background, a few weeks ago I regenerated the Chef client.pem for the two nodes to try and fix this same Problem state. At the time, the chef-client wasn't able to authn to the chef server. For some reason these two nodes had lost their client.pem. This was causing the two nodes to be reported as not reachable in the Chef console and I thought that was why they were in the Problem state in Crowbar. After regenerating the client.pem, Chef returned to normal operation, as noted above, but Crowbar continues to report them in the Problem state. I've restarted Chef, Crowbar (and the full admin node bluepill stack) in an attempt to get Crowbar to see the "happy" state from Chef. Either there really still is some sort of problem or the problem is that Crowbar won't refresh it's state from the data it gets from Chef. Is it possible for node state to get out of sync between Chef and Crowbar? Is there a way to tell crowbar to ignore such cached state and start clean from Chef? Thanks for any pointers, ~jpr _______________________________________________ Crowbar mailing list Crowbar@dell.com https://lists.us.dell.com/mailman/listinfo/crowbar For more information: http://crowbar.github.com/