* Dana Quinn <dqu...@gmail.com> [2014-05-19 10:29:43-0700]: > For those of you sharing these tales of woe (I have some similar ones > I could share) - can you share what you discussed in your post mortems > as to protect against these issues in the future?
Yes. There are a couple of techniques that are more or less considered best practice at this point to protect against config management amplification mistakes: - Use a version control tool as a backend to the config management repo. This should go without saying, if you don't have a proper VCS backend, stop everything and get something now! (git/mercurial/svn are all fine choices) - Use a validation testing suite. Every commit you make to your config management tool's VCS repo should kick off these tests, I'm talking about things like sanity checking on the repo (e.g. bcfg2-lint, ansible-playbook --syntax-check, cf-promises, etc.), syntax checking and validation on XML/Yaml files, passwd file validity, nagios -v, etc. Testing is a sore subject for some people and balance is needed here. The best analogy I've heard is that testing is like armor -- too much of it and you won't be able to move at all, which makes it useless. Too little and it's the same as not having anything. Ideally you want your testing suite to catch all the stupid silly mistakes you would have made. Most people tend to use the "hooks" subsystem of their VCS for this, that's fine, but I think a good CI tool (Jenkins, Bamboo, etc.) is better for this task. Either way, something is better than nothing. - Use a code review tool. A testing suite above will catch syntax error ("Hey, you're missing a closing tag in this XML file."), but can't ever catch the other class of mistakes that humans make ("Hey, why are you pushing this feature to production? It hasn't been tested properly yet!") Code review is basically having other humans look at your change and having them say, "Yeah, this looks okay to me." There are a bunch of tools that do this, Github's pull requests are probably the most popular type, but there's also Gerrit (open source) and similar tools. I can tell you without hesitation that using a code review tool is THE SINGLE GREATEST THING you can do to protect against these mistakes, bar none. Trust me on that. Just a note here, the combination of a testing suite plus a code review tool will catch 95% of the mistakes that would have otherwise been made. > Did you come up with any general approaches or rules for these type of > rollouts? For example only rolling out to 5% of servers at a time. This is known as a canary deployment and is a release methodology. Yes, it's a great thing to do. In conjunction with canary deployments, you probably also want to take a look at feature flags (aka "dark releases"). It lends itself well to canary deployments and makes QA easier in general (especially if you're doing things like A/B testing). > Just curious what learnings and what approaches you've taken from > these incidents (an incident is a terrible thing to waste!). Now would be a good time to reflect on Fred Brooks' awesome essay "No Silver Bullet". There is no one single technique that will make all your problems go away -- BUT there are a number of things you can do to mitigate against the re-imaging incident that happened at Emory. You want to refine your current process and keep refining it as time goes on. You'll never reach utopia, but after a couple of iterations, these kinds of mistakes should fall unto the class of "unlikely to happen" at your organization. I hope. They haven't released too many tech details of the incident yet -- I believe they're still cleaning up -- but kudos to them for being as open about it as they have been so far. I think that once the dust has settled and they examine what happened, it will probably come out that there was a confluence of failures on multiple levels that led to the problem, and maybe utilizing any one of the techniques mentioned would have prevented this disaster. And hopefully we'll all learn from it. In closing, I will leave you with the eternal wisdom of @DEVOPS_BORAT: To make error is human. To propagate error to all server in automatic way is #devops. hth, Thomas _______________________________________________ Discuss mailing list Discuss@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/