Incident Management (was: Re: pointless mail, (was Re: android-build's are failing...))

Michael Hudson-Doyle Thu, 10 May 2012 17:13:28 -0700

On Fri, 11 May 2012 00:30:26 +0200, Alexander Sack <a...@linaro.org> wrote:
> On Fri, May 11, 2012 at 12:24 AM, Ricardo Salveti
> > Sure, I just think there are better places for it :-) Based on issues
> > we had with LAVA and Jenkins at the previous cycle, if I had one email
> > for every issue, I'd send at least 20 of them, which is useful but
> > that still doesn't make me send them to the list.]
> 
> Actually, I think LAVA outage was announced. I poked for getting more
> status updates, so more mails would have been great.
> 
> Same goes for ci.linaro.org ... if our CI service used for everything
> but android is not available, I want to get a mail that this is the
> case.


So, what this discussion points to is: we need a process for handling
disruptions to the services we provide.  When the **** hits the fan, the
last think you want people to be doing is _thinking_, or at least,
thinking about things that could have been thought through ahead of
time and are not totally specific to the incident at hand.

Just recently within the LAVA team, we've started following such a
process:

    https://wiki.linaro.org/Internal/LAVA/Incidents

(apologies to the non-Linaro insiders for the internal link).  The
process will look very familiar to anyone who works at Canonical...

Creating a wiki page for each incident can feel a bit heavyweight, but
having some kind of defined place for recording details has two massive
values:

 1. It means there's a canonical place to go for information while the
    incident is still in progress.[1]

 2. It means that at the end of the month or quarter or whatever you can
    look back and have _actual data_ for how often various issues come
    up, rather than relying on vague feelings like "it seems we run out
    of disk space a lot".

I created a Google spreadsheet & form for adding details to it in an
attempt to reduce the overhead of recording an incident, but after
exactly two incidents, we already have an incident that was recorded in
a wiki page but not the spreadsheet, so maybe that was premature
optimization on my part.

It's only early days but I already feel happier for having this process
in place.  I'm happy to donate this policy to the wider set of services
Linaro runs if there is consensus it would be useful :-)

There is already a page on a related topic:

    https://wiki.linaro.org/Internal/Process/DealingWithCrisis

but that seems to me to be aimed at bigger issues than android-build or
LAVA being unreachable for an hour.

One thing this thread points out to me though is that our policy does
not really cover communication, either within the team or with our
users.  I'll work on a proposal for that today.

Cheers,
mwh

[1] In particular, if an incident goes on for long enough to require
    hand overs between people working on in, then a wiki page like this
    is downright essential.

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Incident Management (was: Re: pointless mail, (was Re: android-build's are failing...))

Reply via email to