I'm writing this email in response to LP: #1260712, mostly as a post-mortem and so we as a team can discuss ways to improve our processes so that we avoid such problems in the future, not just with system-image and system-settings, but across all our mobile components. Apologies in advance for the length.
First some background: LP: #1217098 was filed quite some time ago, describing the need to avoid re-downloading data files in system-image when the length of time between downloading an upgrade and applying it caused the system-image-dbus process to timeout and commit suicide. When s-i would get re-dbus-activated, it wouldn't be able to use the already downloaded files, and thus it would have to re-download them, consuming valuable bandwidth and time. The fix for this bug required a minor D-Bus API change, described in LP: #1247215 and further detailed in a message to this mailing list[1]. That message was also long, but it described a change in the ApplyUpdate() method, from synchronous-returning-string to asynchronous-returning-nothing-but-sending -a-signal. In a follow up message, I described manually testing locally built system-image debs on my device with no observable adverse effects. There were no other replies to that thread. When s-i 2.0.3 finally landed, LP: #1260712 was reported, and fixed, by my more easterly colleagues. No functionality was affected - i.e. the device was still successfully rebooted into recovery, where the update was applied - but the u/i did produce a brief confusing error message which worried the OP. The proximate cause was that system-settings wasn't prepared not to see a string returned from ApplyUpdate(), and the reason this happened was because changes to s-i and s-s weren't properly coordinated, for which I take responsibility. My thanks to Didier for his quick patch to s-i to work around the problem and avoid the error message. Let's dig deeper to try to find the root cause though. This is important IMHO in order to identify bugs in our process, which should be addressed so that we can maintain consistently improving quality. Firstly, the bug proposing the D-Bus API change wasn't tracked by all the necessary participants. It would have been better for me to include a system-settings bug task, so that at least both components would have a record of the effect of the change. This would be similar to LP: #1215586, which has been simmering for a while, and which describes the re-enabling of i18n descriptions for update images, passed across the D-Bus API. Unlike the bug this morning, landing the fix for LP: #1215586 without a corresponding s-s fix *would* break everyone's updates via the u/i, so at least we're a little ahead of the game there. I've also taken this approach with LP: #1260768, which describes a further change to ApplyUpdate() requested by Didier for some future release. Secondly, I clearly did not communicate well enough the D-Bus API change to Didier or others working on s-s. Probably the mailing list message was too long to read (like this one :/ ), and silence is not assent. I should have requested positive acknowledgment of the proposed change from the devs working on system-settings. If there is a better way to coordinate and communicate these changes across teams than the mailing list, let's find out. Thirdly, the landing spreadsheet did not capture this potential problem. I can't go back and check because the Google spreadsheet doesn't version control afaict, and I don't have write permission on it as it is, but I'm sure I did not identify this API change as potentially (but minimally) disruptive. While I did make a mental note about the change, I should have clearly described it in the landing request so that special attention could be given. It was mentioned that maybe a code review would have caught this, but I doubt it. Code reviews are A Good Thing, but it's not the responsibility of the reviewer to understand all the ramifications of such a change. Maybe this would have caught the eye of a prescient reviewer, but that is far from guaranteed. And if it had been, what steps would the reviewer have recommended to address the potential problem? Note that all of this code is very well covered by unit tests; a reviewer would have seen this and would have done their job by noticing the coverage. Manual testing obviously didn't catch the problem. As the OP for LP: #1260712 observed, the error message disappears pretty quickly since the error has no effect on the issuance of a proper and effective reboot. In my manual testing on my own device, I simply never noticed the error message because the reboot happens so quickly, and once the update completed successfully, there is no record of any problem. I tried it *again* while composing this message and still didn't see it! ;) AFAIK, system-image and system-settings are not autopilot tested, but again it's not clear that such testing would have caught *this* issue. Not that autopiloting the upgrade sequence isn't still a good idea! We need to talk about that! But if the error message disappears so quickly to a human with no functional effect, it's unlikely that an autopilot test would have caught the problem. So, what steps can be taken to avoid such problems in the future, and what lessons were learned that might be of use across the project? Steve brought up a very good point in my private discussions with him: versioning the API. In this case, it might have been better to leave ApplyUpdate() alone and introduce a new D-Bus API method with the new behavior. Decoupling s-i and s-s in this way would allow an unchanged system-settings to have continued to work, and s-s could change to the new, improved API at its leisure. Of course, this would mean that an unchanged system-settings would still be affected by the re-downloading bug, since that bug could not be fixed without the change in the API. (There's a possible middle ground in *this* instance, where the old API could have been faked closely enough to fool s-s, but that isn't a general solution. It's essentially what Didier's hot fix implements.) API versioning is an important principle in general, and especially important in our touch images since we already have such a difficult time coordinating all the ongoing changes into a clean image that can be promoted. An example here would be future changes to the D-Bus API for ubuntu-download-manager, for which there are multiple clients. Is it even feasible to expect that all those components are released or promoted together so that nothing at all breaks? Can we use packaging metadata to help enforce coordinated promotion? E.g. if we knew that s-i 2.0 "broke" system-settings, a Breaks could have been added, but that would still require better coordination between the two projects so that we knew what version of s-s would satisfy the API requirements of s-i 2.0. How could we have written automated tests to catch this? Even with autopkgtests in s-i, without autopilot tests of the u/i, this would have gone unnoticed. What could I have done to better Q/A the change, or to have better helped the Q/A team to test or notice such a change? This email is already long enough, and I have probably raised more questions than I've answered. I hope this is taken in the spirit in which it is written -- to openly investigate the problem, discuss solutions, and drive together as a team toward better processes which improve overall quality and confidence in our images. It's not to cast any blame, nor to shirk my responsibility for this bug. I'm just glad that the fire was quickly put out, and that we have the opportunity to discuss this in the light of a small flame. In any case, I don't anticipate any more changes to system-image before next year, so enjoy the coming holidays with many happy and successful updates to your mobile devices. :) Cheers, -Barry [1] https://lists.launchpad.net/ubuntu-phone/msg04913.html
signature.asc
Description: PGP signature
-- Mailing list: https://launchpad.net/~ubuntu-phone Post to : ubuntu-phone@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-phone More help : https://help.launchpad.net/ListHelp