[Crowbar] Attribute injection

Andi_Abes Thu, 07 Mar 2013 05:54:21 -0800

During the session today, I mentioned there's a proposal on the table.. but it 
exists in mostly some threads & some heads. This is an attempt to capture some 
of the mountain of items that make AttribInj.
You've been warned - TL;DR.

(note to reader - the CB2.0 references are WIP, and not intended to be
presented as a done deal, even if the language occasionally reads that way)

What's the problem:
Chef is designed to be an eventually consistent system - clients will
eventually get to the desired state. This typically works great.
There are 2 elements that make chef eventually consistent:

* Failures. Chef deals with this by having idempotent resources.
Consider a scenario where some resources succeed in changing the system state,
and others fail. Since Chef-client stops on the first failure and reports it,
all resources past the failure will not have been updated. However, on the next
attempt, the successful resources are idempotent (won't attempt to redo), so
they e.g. shouldn't needlessly restart services, or copy over existing files
etc. The failed resource is attempted again. Once all the run-list items have
they resources applied successfully, the node has converged.

* Search. Behind the scenes, chef uses Solr, which is a full text
search engine. This is in general a Good Thing, since it provides a very reach
query capability. However, the full text search index takes quite a while to
reflect changes in nodes (if you deploy a large cluster dedicated to search
that might improve...but we're not google). Recipes that depend on query
results for correct operation might work incorrectly, because of partial (or
no) results being returned while the index is still being updated.

In general, eventual consistency is not a bad thing. But for an orchestration
engine it is. Here's an example (different than the meeting, since it's a bit
simpler)
When deploying nova, it needs to identify the Mysql server and Keystone server.
In a search based approach, the recipes will be performing a search for the
nodes with the appropriate roles. However, if the Solr index has yet to
update, the searches might return 0 results. If the index does its thing for
long enough to try and run chef-client quite a few times, Crowbar might decide
that the deployment failed.

This is aggravated by Crowbar's ability to queue up multiple changes to the
environment, and process the queue as quickly as possible. This leads to a
situation that Mysql is deployed, followed quickly by Keystone, and when
keystone is deployed, nova deployed is started. The speed of runs together with
the large batches of changes in the cluster almost guarantee that the solr
index has not yet kept up.

What's the problem part(b):
The Openstack deployment cookbooks/recipes developed by the Dell Crowbar team
have been termed "crowbarized" by some folks. Here's a quick example of why:
Chef client natively assumes that nodes have an IP address, which matches what
you'd get from EC2 for example. Crowbar supports a much richer set of
networking capabilities that you might encounter when using real hw (e.g.
multi-homed, bonded, vlans, bridges ). The Openstack components' recipes use
this network richness by calling appropriate libraries to get the 'right'
networking info - .e.g. the admin network IP address, vs the storage or public
or nova_fixed, depending on context.
While this sounds great, it does present a challenge - access to this
information is through a crowbar library injected into Chef, and the library
looks up information in Node data structures that are maintained by Crowbar.
... making the recipes somewhat tied to Crowbar.

The Solution(?):
Attribute injection attempts to offer an approach that can solve or at least
alleviate these problems. Here's the general idea, then how it helps, and then
some more details.

In general, each recipe will define a set of attributes that are required for
it to run successfully. These attributes provide the information that either
search or crowbar would have provided. The attributes can be populated in
different ways, depending on the environment in which the recipes are used:

a) With pure Chef - via the default attributes in the cookbook, the
environments facility of Chef, or roles

b) In Crowbar - crowbar will compute and inject override values for these
attributes, at runtime, via chef API's.

c) In other deployment systems - in a manner either like a) or b) or
telepathically.

How does this help?
For part a) of the problem - we can strive to avoid searches in recipes, that
go against the slow Solr, and rather search against the Crowbar DB, which is
boring old consistent SQL. This is somewhat natural, since crowbar assigns
roles and configuration data to nodes, based on information for this DB.

For part b) - recipes/cookbooks now effectively have an API, that can be
discussed. Different deployment systems can satisfy this API - i.e. provide
values for these attributes, in their own way, without requiring changes to the
underlying deployment code in the recipes.

The nitty gritty details:
(this is heavily based on Crowbar 1.0, with some tweaks).

Within crowbar there are 3 types of attributes that can be tied to a node:

1. Node specific discovered information - this include ohai plugin
attributes [2], and additional ones that crowbar writes to the node (e.g. what
services are running on the node, what raid controllers are present etc).

2. Node specific configuration - e.g. IP address, hostname, bios
configuration.

3. Configuration for a given barclamp's deployment - e.g for NTP, what's
the upstream server.

Important but slightly different is

4. A node's run-list - this is a per node list of roles and recipes that
the chef-client execution will attempt to converge on the node. In a typical
Chef deployment, this list is managed by users (or their tooling). Crowbar
manages this list based on its state machine and deployment process.

All these values are stored in chef server, and acted upon by chef-client.

Access for reading:
When a chef client executes, all the values from above are ''flattened'' into a
the nodes attributes, and are accessible as the "node hash". For more details,
see[3]

Update ''lifecycle''
The attributes in 1. are handled by chef-client - whenever a chef client run
completes, these are pushed to the chef server, overwriting what was there
before.
Attribute types 2., 3. and 4. are managed by crowbar, and updated based on
either user actions or internal state transitions within crowbar. As it might
be obvious, those can't be stored on the node itself - there exists a race
condition between crowbar updating these values, and a chef client attempting
to run (or at least there was the potential for one in Crowbar1.0).
So, rather than being stored on the node, these attributes are represented in
chef as roles - the node specific configuration (including the run-list) is in
a role that applies to a single node, while the barclamp configuration role is
assigned to all the nodes that are included in the configuration.

Changes for CB2.0
Much of the above is the same as CB1.0. What is different?

1) Default attributes as API, based on patterns agreed/discussed with
wider openstack community (or pilfered from other folks who've open-sourced
their cookbooks). A follow up email will describe the patterns we discussed
during the Jan/22 hack day [4] and with the general community.

2) Search result injection - there are a few options here,

a. Postpone this, since we're short on time, and things to eventually
work - just retry forever if needed, but be explicit about reporting status
(and option to cancle). Keep recipes using search where currently used.

b. The Jig/Job orchestration engine allows barclamps to define custom
"Jobs" that are executed as the proposal is rolled out (see below).

Some background about jobs, and how search can be handled (since there wasn't a
whole lot of ML activity around jobs):
When deploying a snapshot, there are many required steps - from allocating the
node, deploying each role in the right element order to the subset of nodes on
which it applies, then deploying the next set of roles. A simple example might
be Nova, when a user commits a snapshot for deployment:

* The nodes that are used in the proposal are automatically allocated,
if they're not already

* The nova controller is deployed first

* Once the controller is deployed (which also creates the MySql schea),
the nova-compute and nova-volumes (if any) are deployed.

Each one of these steps is a job (one per node x step), that together form a
DAG (persisted in the DB). A job can be crowbar framework provided logic (e.g.
allocate node), or barclamp specific logic. When events occur in the system (a
node transitions state, a chef-client run finishes), the set of pending jobs
are evaluated - all the jobs who's pre-requisites met are 'performed' (the
'perform' method is executed on them).

Back to search:
There are 2 cases for search, via (partial) examples:

a) Nova - nova needs to locate the keystone server. This is and inter
barclamp dependency, and affects not just search, but order of deployment as
well (nova proposal won't be deployed until the referenced keystone one is).

b) Swift - when running chef-client on the ring-compute node there's a
need to find all the disks on all the storage nodes that have already been
deployed. This is intra-barclamp search, and dependency - the deployment of the
ring-compute role will only start once the storage nodes are deployed.

Note the connection between search and dependencies ;) (job represent
dependencies).
Both cases can handled very similarly, with the only difference is the source
of dependency information, and the granularity of dependency.

In case a) the nova barclamp (taken from the cb1.0 NovasService <
ServiceObject) implements something like:
def proposal_dependencies(role)
answer = []
answer << { "barclamp" => "mysql", "inst" =>
role.default_attributes["nova"]["db"]["mysql_instance"] }
answer << { "barclamp" => "keystone", "inst" =>
role.default_attributes["nova"]["keystone_instance"] }
answer << { "barclamp" => "glance", "inst" =>
role.default_attributes["nova"]["glance_instance"] }
answer
end
The above represents that the proposal depends on the respective instances of
mysql, keystone and glance (instance == deployment name in cb2.0 speak).
The Jig engine will ensure (via a JobWaitForSnapshotDeployment injected into
the DAG) that before nova-controller starts deployment, these dependencies have
been met. The above job identifies (via the type/key fields) the deployment
it's waiting on. It's marked as done when a deployment of a whole snapshot is
completed, triggering the re-evaluation of the nova jobs.

In case b) (swift storage nodes) - the dependency is source is from the run
order defined by the swift barclamp (bc-template-swift.json). The dependency in
this case is that all the nodes to whom the storage-node has been assigned,
have finished their deployment of this role (i.e. just a single role from
within the barclamp, not the whole snapshot).
The job for this is JobWaitForRoleOnNode. When a chef-client run completes, all
pending jobs waiting for a role are evaluated, and those that match are marked
as done.

Ok, your eyes are bleeding and my fingers ache. Hope this is somewhat sane, and
is at least a good enough start for the convo to follow.

[1] attribute precedence:
http://docs.opscode.com/essentials_cookbook_attribute_files_attribute_precedence.html
[2] ohai attributes: http://docs.opscode.com/ohai.html
[3] computing attributes
http://docs.opscode.com/breaking_changes_chef_11.html#computing-attributes-from-attributes
[4] Chef for OpenStack Hack Day - Boston
http://www.eventbrite.com/event/4395344594

_______________________________________________
Crowbar mailing list
Crowbar@dell.com
https://lists.us.dell.com/mailman/listinfo/crowbar
For more information: http://crowbar.github.com/

[Crowbar] Attribute injection

Reply via email to