[Ubuntu-phone] LP: #1260712 post-mortem and improving our processes

Barry Warsaw Fri, 13 Dec 2013 14:26:31 -0800

I'm writing this email in response to LP: #1260712, mostly as a post-mortem
and so we as a team can discuss ways to improve our processes so that we avoid
such problems in the future, not just with system-image and system-settings,
but across all our mobile components.  Apologies in advance for the length.


First some background:

LP: #1217098 was filed quite some time ago, describing the need to avoid
re-downloading data files in system-image when the length of time between
downloading an upgrade and applying it caused the system-image-dbus process to
timeout and commit suicide.  When s-i would get re-dbus-activated, it wouldn't
be able to use the already downloaded files, and thus it would have to
re-download them, consuming valuable bandwidth and time.

The fix for this bug required a minor D-Bus API change, described in LP:
#1247215 and further detailed in a message to this mailing list[1].  That
message was also long, but it described a change in the ApplyUpdate() method,
from synchronous-returning-string to asynchronous-returning-nothing-but-sending
-a-signal.  In a follow up message, I described manually testing locally built
system-image debs on my device with no observable adverse effects.  There were
no other replies to that thread.

When s-i 2.0.3 finally landed, LP: #1260712 was reported, and fixed, by my
more easterly colleagues.  No functionality was affected - i.e. the device was
still successfully rebooted into recovery, where the update was applied - but
the u/i did produce a brief confusing error message which worried the OP.  The
proximate cause was that system-settings wasn't prepared not to see a string
returned from ApplyUpdate(), and the reason this happened was because changes
to s-i and s-s weren't properly coordinated, for which I take responsibility.
My thanks to Didier for his quick patch to s-i to work around the problem and
avoid the error message.

Let's dig deeper to try to find the root cause though.  This is important IMHO
in order to identify bugs in our process, which should be addressed so that we
can maintain consistently improving quality.

Firstly, the bug proposing the D-Bus API change wasn't tracked by all the
necessary participants.  It would have been better for me to include a
system-settings bug task, so that at least both components would have a record
of the effect of the change.  This would be similar to LP: #1215586, which has
been simmering for a while, and which describes the re-enabling of i18n
descriptions for update images, passed across the D-Bus API.  Unlike the bug
this morning, landing the fix for LP: #1215586 without a corresponding s-s fix
*would* break everyone's updates via the u/i, so at least we're a little ahead
of the game there.  I've also taken this approach with LP: #1260768, which
describes a further change to ApplyUpdate() requested by Didier for some
future release.

Secondly, I clearly did not communicate well enough the D-Bus API change to
Didier or others working on s-s.  Probably the mailing list message was too
long to read (like this one :/ ), and silence is not assent.  I should have
requested positive acknowledgment of the proposed change from the devs working
on system-settings.  If there is a better way to coordinate and communicate
these changes across teams than the mailing list, let's find out.

Thirdly, the landing spreadsheet did not capture this potential problem.  I
can't go back and check because the Google spreadsheet doesn't version control
afaict, and I don't have write permission on it as it is, but I'm sure I did
not identify this API change as potentially (but minimally) disruptive.  While
I did make a mental note about the change, I should have clearly described it
in the landing request so that special attention could be given.

It was mentioned that maybe a code review would have caught this, but I doubt
it.  Code reviews are A Good Thing, but it's not the responsibility of the
reviewer to understand all the ramifications of such a change.  Maybe this
would have caught the eye of a prescient reviewer, but that is far from
guaranteed.  And if it had been, what steps would the reviewer have
recommended to address the potential problem?  Note that all of this code is
very well covered by unit tests; a reviewer would have seen this and would
have done their job by noticing the coverage.

Manual testing obviously didn't catch the problem.  As the OP for LP: #1260712
observed, the error message disappears pretty quickly since the error has no
effect on the issuance of a proper and effective reboot.  In my manual testing
on my own device, I simply never noticed the error message because the reboot
happens so quickly, and once the update completed successfully, there is no
record of any problem.  I tried it *again* while composing this message and
still didn't see it! ;)

AFAIK, system-image and system-settings are not autopilot tested, but again
it's not clear that such testing would have caught *this* issue.  Not that
autopiloting the upgrade sequence isn't still a good idea!  We need to talk
about that!  But if the error message disappears so quickly to a human with no
functional effect, it's unlikely that an autopilot test would have caught the
problem.

So, what steps can be taken to avoid such problems in the future, and what
lessons were learned that might be of use across the project?

Steve brought up a very good point in my private discussions with him:
versioning the API.  In this case, it might have been better to leave
ApplyUpdate() alone and introduce a new D-Bus API method with the new
behavior.  Decoupling s-i and s-s in this way would allow an unchanged
system-settings to have continued to work, and s-s could change to the new,
improved API at its leisure.  Of course, this would mean that an unchanged
system-settings would still be affected by the re-downloading bug, since that
bug could not be fixed without the change in the API.

(There's a possible middle ground in *this* instance, where the old API could
have been faked closely enough to fool s-s, but that isn't a general solution.
It's essentially what Didier's hot fix implements.)

API versioning is an important principle in general, and especially important
in our touch images since we already have such a difficult time coordinating
all the ongoing changes into a clean image that can be promoted.  An example
here would be future changes to the D-Bus API for ubuntu-download-manager, for
which there are multiple clients.  Is it even feasible to expect that all
those components are released or promoted together so that nothing at all
breaks?

Can we use packaging metadata to help enforce coordinated promotion?  E.g. if
we knew that s-i 2.0 "broke" system-settings, a Breaks could have been added,
but that would still require better coordination between the two projects so
that we knew what version of s-s would satisfy the API requirements of s-i
2.0.

How could we have written automated tests to catch this?  Even with
autopkgtests in s-i, without autopilot tests of the u/i, this would have gone
unnoticed.  What could I have done to better Q/A the change, or to have better
helped the Q/A team to test or notice such a change?

This email is already long enough, and I have probably raised more questions
than I've answered.  I hope this is taken in the spirit in which it is written
-- to openly investigate the problem, discuss solutions, and drive together as
a team toward better processes which improve overall quality and confidence in
our images.  It's not to cast any blame, nor to shirk my responsibility for
this bug.  I'm just glad that the fire was quickly put out, and that we have
the opportunity to discuss this in the light of a small flame.

In any case, I don't anticipate any more changes to system-image before next
year, so enjoy the coming holidays with many happy and successful updates to
your mobile devices. :)

Cheers,
-Barry

[1] https://lists.launchpad.net/ubuntu-phone/msg04913.html

signature.asc
Description: PGP signature

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to     : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp

[Ubuntu-phone] LP: #1260712 post-mortem and improving our processes

Reply via email to