* Dana Quinn <dqu...@gmail.com> [2014-05-19 10:29:43-0700]:
> For those of you sharing these tales of woe (I have some similar ones
> I could share) - can you share what you discussed in your post mortems
> as to protect against these issues in the future?

Yes. There are a couple of techniques that are more or less considered
best practice at this point to protect against config management
amplification mistakes:

    - Use a version control tool as a backend to the config management
      repo. This should go without saying, if you don't have a proper
      VCS backend, stop everything and get something now! (git/mercurial/svn
      are all fine choices)

    - Use a validation testing suite. Every commit you make to your
      config management tool's VCS repo should kick off these tests, I'm
      talking about things like sanity checking on the repo (e.g.
      bcfg2-lint, ansible-playbook --syntax-check, cf-promises, etc.),
      syntax checking and validation on XML/Yaml files, passwd file
      validity, nagios -v, etc.

      Testing is a sore subject for some people and balance is needed
      here. The best analogy I've heard is that testing is like armor --
      too much of it and you won't be able to move at all, which makes
      it useless. Too little and it's the same as not having anything.
      Ideally you want your testing suite to catch all the stupid silly
      mistakes you would have made.

      Most people tend to use the "hooks" subsystem of their VCS for
      this, that's fine, but I think a good CI tool (Jenkins, Bamboo,
      etc.) is better for this task. Either way, something is better
      than nothing.

    - Use a code review tool. A testing suite above will catch syntax
      error ("Hey, you're missing a closing tag in this XML file."), but
      can't ever catch the other class of mistakes that humans make
      ("Hey, why are you pushing this feature to production? It hasn't
      been tested properly yet!")

      Code review is basically having other humans look at your change
      and having them say, "Yeah, this looks okay to me."

      There are a bunch of tools that do this, Github's pull requests
      are probably the most popular type, but there's also Gerrit (open
      source) and similar tools.

      I can tell you without hesitation that using a code review tool is
      THE SINGLE GREATEST THING you can do to protect against these
      mistakes, bar none. Trust me on that.

Just a note here, the combination of a testing suite plus a code review
tool will catch 95% of the mistakes that would have otherwise been made.

> Did you come up with any general approaches or rules for these type of
> rollouts? For example only rolling out to 5% of servers at a time.

This is known as a canary deployment and is a release methodology. Yes,
it's a great thing to do.

In conjunction with canary deployments, you probably also want to take a
look at feature flags (aka "dark releases"). It lends itself well to
canary deployments and makes QA easier in general (especially if you're
doing things like A/B testing).

> Just curious what learnings and what approaches you've taken from
> these incidents (an incident is a terrible thing to waste!).

Now would be a good time to reflect on Fred Brooks' awesome essay "No
Silver Bullet". There is no one single technique that will make all your
problems go away -- BUT there are a number of things you can do to
mitigate against the re-imaging incident that happened at Emory. You
want to refine your current process and keep refining it as time goes
on. You'll never reach utopia, but after a couple of iterations, these
kinds of mistakes should fall unto the class of "unlikely to happen" at
your organization. I hope.

They haven't released too many tech details of the incident yet -- I
believe they're still cleaning up -- but kudos to them for being as open
about it as they have been so far. I think that once the dust has
settled and they examine what happened, it will probably come out that
there was a confluence of failures on multiple levels that led to the
problem, and maybe utilizing any one of the techniques mentioned would
have prevented this disaster. And hopefully we'll all learn from it.

In closing, I will leave you with the eternal wisdom of @DEVOPS_BORAT:

    To make error is human. To propagate error to all server in
    automatic way is #devops.

hth,
Thomas
_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to