[Crowbar] Roles, nodes, noderoles, lifeycles, and events, oh my!

Victor_Lowther Thu, 26 Sep 2013 14:01:19 -0700

Now that CB 2.0 is capable of bootstrapping itself and of partially 
bootstrapping other nodes into Sledgehammer, I figured it would be a good time 
to provide a brain dump going over the main components of Crowbar and how they 
interact with each other.  Feedback and questions welcome!

There are 3 basic objects that everything else in Crowbar 2.0 relies
on -- node objects, roles, and noderole objects.

Node objects encapsulate machine-specific state -- they have unique
programmatically generated names (which must also be the machine's FQDN
in DNS), track whether Crowbar is allowed to manage the machine, and track
whether the machine is alive and reachable. In the CB 2.0 framework,
nodes are things that a jig performs operations on as directed by a
role through a noderole.

Roles are the primary unit of functionality in CB 2.0 -- they provide
the code that the jigs use to effect change on the nodes in accordance
with the desired state stored in the noderole graph. Roles form a
dependency graph of their own (roles must declare what other roles
they depend on), each role must declare the jig that will be used to
do things to a node, and roles can have flags that affect how the
annealer will handle certian aspects of building the noderole graph
and initial node bootstrap.

Noderoles represent a binding of a role to a node. Each noderole
tracks any state that needs to communicated from the user or the
Crowbar framework to a node (and vice versa), and the overall noderole
graph determines both the order in which roles are enacted on nodes
and what attributes are visible from other noderoles when the noderole
runs.

On top of those 3 basic object types, we have 3 more that are used to
help keep cluster administrators from dying of information overload
when staring at a noderole graph with 10,000 edges. These are
deployments, deployment roles, and snapshots.

A deployment is an administratively convenient logical grouping of
nodes along with a set of default role configurations (the deployment
roles) relevant to whatever workload is being run in the
deployment. Deployments all have a parent deployment except for the
system deployment, which Crowbar manages and which is where all
newly-discovered nodes wind up.

A snapshot consists of a collection of node-role bindings created in a
particular deployment. A deployment has a linear history of
snapshots, and each snapshot can be proposed (in which case the user
can edit noderole attributes and state, and the annealer will ignore
them), committed (where user edits to the noderole are not allowed,
and the annealer can use the noderoles), and archived (where the
noderoles are not editable and are ignored by the annealer).

Roles, in a little more detail:

Roles have a lot to do. Their dependency graph is used a template to
build the noderole DAG, they need to provide their jig with all the
code and data it will need to effect the changes that the role wants
on a node, they need to ensure that the noderole graph is build
properly, and in some cases they need to track state that should not
be represented directly in the noderole graph.

Most roles should not need to have any state outside of the state stored in the
noderole graph, but there are some (primarily those provided by the
network, dns, and provisioner barclamps) that need to maintain a
significant amout of state outside the noderole graph or that need to
be able to react to noderole and node state transitions. To give them
a formal method of doing so, you can override the base Role model with
one that responds to several events that happen in the noderole and
node lifecycles.

The Two Rules for Events:
1: Events run synchronously, so they must be fast. If your event takes
more than a few milliseconds to run, or you want to do something on a
remote machine, you should make it a role of its own and bind it to
that node as a noderole instead.

2: Events must be idempotent. If the work you were going to do has
already been done, don't do it again.

How To Respond to Events:

The base Role model has a mechanism for letting Rails dynamically
subclass it if there is an appropriatly named model in the Rails
engine that the barclamp provides -- you can provide an override based
on the name of the role by providing

class BarclampFoo::RoleName < Role

class, and you can provide a general Role override for your barclamp
by providing

class BarclampFoo::Role < Role

Noderole Events:

You can provide event hooks on your roles that work with noderoles at
6 points in their lifecycle:

on_proposed, on_todo, on_blocked, on_transition, on_error, and
on_active.

Each method will be called with the noderole that just completed its
state transition after the noderole has transitioned to the state, and
all of its child noderoles have had their state updated accordingly --
for an example, the network barclamp provides an override to
on_proposed that automatically allocates an IP address when a role for
a network is bound to a node:

https://github.com/crowbar/barclamp-network/blob/master/crowbar_engine/barclamp_network/app/models/barclamp_network/role.rb

Node Events:

You can provide event hooks on your roles that work with nodes at 2
points in their lifecycle for now:

on_node_create will be called after the node is created and the
default set of noderoles has been bound to it.

on_node_delete will be called just before the node is destroyed.

Right now, roles have 4 flags that the Crowbar framework knows how to
handle:

1: Discovery, which means that this role will be automatically bound
to all non-admin nodes when the node is freshly-created if the role's
jig is active.
2: Bootstrap, which means that this role will be automatically bound
to all freshly-created admin nodes. This flag is primarily used by
the Crowbar framework to bootstrap the initial Crowbar admin node into
existence.
3: Implicit, which signals that this role can be implicitly created
and bound to a node as part of the dependency resolution process, and
that it must be bound to the same node as the role that depends on it
is being bound to.
4: Library, which is not used by anything right now and may be
removed.

Role dependency rules:

Each role must declare what other roles it directly depends on, and
those dependencies are not allowed to be cyclic -- a role cannot
directly or indirectly depend on itself. Roles should not declare a
dependency on a role it only indirectly depends on, as that makes the
dependency graph needlessly more complicated. A role is dependent on
another role if that other role must be deployed somewhere in the
cluster before the current role.

How the noderole graph is built:

Right now, all nodes are ultimately added to the noderole graph via
the add_to_node_in_snapshot function on role objects. You pass it a
node and a snapshot, and it either creates a node role bound to an
appropriate place in the graph or dies with an exception. In detail:

1: Verify that the jig that implements the role is active.
2: Create a deployment role for the deployment that the snapshot is a
part of, or do nothing if one was already created.
3: Check to see if this role has already been bound to this node. If
it has, return that noderole.
4: For each of our parent roles:
1: Check to see if there is a noderole binding for the parent role
on the same node. If there is, save it on a list and move on to
the next parent.
2: Check to see if there is a noderole binding anywhere for the
parent role. If there is, save it on a list and move to the next
parent.
3: Check to see if the parent is an implicit role. If it
is, call add_to_node_in_snapshot on the parent role with our node
and snapshot, save the noderole that returns on the list, and
move to the next parent.
4: Throw an exception.
5: Create a new noderole binding this role to the requested node in
the snapshot, and create parent/child relationships between the new
noderole and the parents we found. The noderole will be created in
the PROPOSED state.
6: Call the on_proposed event hook for this role with the new
noderole.
7: Return the new noderole to the caller.

This function will need to grow more ornate when we want to start
supporting more than just the system deployment -- right now it will
not respect deployment-level scoping. Adding it is a fairly
straightforward extension to the tests in step 4. This function is
also arguably one of the more critical pieces of code in the Crowbar
framework -- it determines the shape and connectedness of the noderole
graph, and hence it plays a large part in determining whether what we
are deploying makes sense.

What is in a noderole:

1: Pointers to its parents and children in the noderole graph.
2: The state of the noderole.
3: A blob of JSON that the user can edit. This blob is seeded from
the deployment role data, which in turn is seeded from the role
template
4: A blob of data that the Crowbar framework can edit. This is used
by the roles to pass system-generated data to the jigs, and is usually
seeded by one of the noderole events.
5: A blob of data that we get back at the end of a jig run.

What happens in CB 2.0 to create the admin node:

All nodes have roughly the same creation process:
1: an API request come in with the requested name of the new node, and
a flag that indicates whether it is an admin node.
2: The requested name is checked to see it is a valid FQDN in the
cluster's administrative DNS domain and that it is unique. If neither
of those are true, the request fails, otherwise we create the node object.
3: We get all of the bootstrap roles, solve their dependencies to
create a list of roles sorted in dependency order, and add them to the
freshly-created node in the current committed snapshot of the system deployment.
4: Once all the node roles are added, the system will automatically
recommit the snapshot. After that, the annealer takes over to bootstrap
Crowbar.

The NodeRole state machine, the framework-driven parts:

All noderoles start in PROPOSED state, and they stay there until the
snapshot they are a part of is committed. From PROPOSED, a noderole
can go to TODO (if the noderole has no parents or all its parents are
ACTIVE), or BLOCKED (if it has any non-ACTIVE parents).

>From BLOCKED, a noderole can go to TODO when all of its parents are
ACTIVE.

The annealer looks for noderoles in TODO that meet the following
conditions:
1: The jig that is associated with the noderole via the role half of
the binding is active,
2: The snapshot that the noderole belongs to is COMMITTED,
3: The node that the noderole binds to is alive and available,
4: There is no noderole for that node that is in TRANSITION

It takes all the noderoles that meet those conditions, sets them in
TRANSITION, and kicks off a delayed job that will wind up setting the
noderole either to ACTIVE or ERROR.

When a noderole is set to ACTIVE, it sets all of its children in
BLOCKED state to TODO if the rest of that child's parents are ACTIVE.

When a noderole is set to ERROR, it transitions all of its children to
BLOCKED if they were not already blocked.

How we determine what information is visible to a node during a jig
run:

Right now, we use the dumbest method possible that still obeys scoping
rules. We deep-merge all the JSON blobs from all the noderoles on
this node that are ACTIVE, deep-merge that with all the JSON blobs
from all the noderoles and deployment roles that are parents of mine,
starting from the most distant set to the closest set, and then
deep-merge that with the JSON blobs from the current noderole. That
gets handed to the jig, which does its jiggy thing with it and
whatever scripts/cookbooks/modules/whatever, and we get a blob of JSON
back. We deep diff that blob with the blob we sent to the jig, and
that is what winds up on the noderole's wall.

--
Victor Lowther
Dell CloudEdge Solutions
Continuous Integration and Build Automation Czar

_______________________________________________
Crowbar mailing list
Crowbar@dell.com
https://lists.us.dell.com/mailman/listinfo/crowbar
For more information: http://crowbar.github.com/

[Crowbar] Roles, nodes, noderoles, lifeycles, and events, oh my!

Reply via email to