On Thu, Mar 20, 2014 at 03:43:02PM -0300, Gustavo Boiko wrote: > The only problem is that this doesn't scale. While one big feature lands > (say, Qt 5.2), there are at least five or more others being developed and > maybe even proposed. So we pick one of those to land, and then while we are > in the "small cycle" for that one, there are already four other in the > wait, and more being developed. It takes way too much time to get > everything in, which means features get released much later than they > could, which in turn means they will have less testing time in the end.
Right. Particularly if you have anything that involves a non-trivial dependency stack (I have some things where I can't really work out what I'll need to do next until I've landed the stage before), then the current process results in things taking weeks longer than they should. I'm quite confident that I would have "click chroot" working smoothly by now for 14.04 frameworks if not for this, for example - right now it doesn't install all of the necessary qtdeclarative plugins, and in my judgement the only sane way to fix this involves going through correcting multiarch metadata in lots of library packages throughout the stack so that I can simply have it install "ubuntu-sdk-libs-dev:armhf" rather than hardcoding a huge pile of package names in click. Despite the best efforts and good will of the landing team, this sort of thing, which carries extremely low runtime risk, is very poorly served by the process that's in place at the moment; I'm afraid it feels unnecessarily obstructive. Every time we multiply what could have been a couple of hours of delay into a couple of days, it destroys productivity just as surely as an edit-compile-test cycle that takes hours rather than minutes. I get that having a working image that users can upgrade to is important. I really do, especially as we move towards shipping devices. But if you set the level of acceptable risk to zero, then you also cripple velocity; much though we need to keep things working so that we can dogfood, we also still have a lot of catching up to do before we surpass (say) Android's usability. I don't think we can afford this in the long term. One of our key development assets is significant parallelisation across a wide range of projects, drawing on the whole free software community. Serialising all this into a narrow bottleneck of landings throws away that asset in the cause of risk aversion. It is not clear to me that it is worth the trade. Any time somebody brings this up, a frequent response is "well, you can go and help out with the known regressions". This is the fallacy of the interchangeable developer, and it is terrible that we keep perpetuating it. It isn't sensible for several dozen people who are blocked on landings to all try to teach themselves enough about (say) the innards of Qt from scratch in order to work out what's going wrong. Some of them will waste their time flailing, some of them may waste the time of the people who are actually qualified to fix the regressions by asking overly basic questions or being generally confused, and hopefully some of them will pretend they never saw this and get on with something they can actually do. Maybe one of them might contribute something helpful, but probably only if not quite the right people were on the problem to begin with. (I include myself in all this; I'm not deprecating my colleagues' abilities, just recognising that superhuman masters-of-all-trades don't really exist.) Knowledge sharing is good, yes, but a firedrill isn't a good time to do it, and it probably shouldn't be everyone-to-everyone anyway. Surely, what should happen is: * Management should identify the people who are qualified to fix the regressions in question, make sure they're working on it and have what they need (hopefully they'll do it themselves organically, but this isn't guaranteed), and make sure this is communicated. The point of this is mainly to make sure that serious problems don't fall through the cracks because nobody thinks it's their problem to solve. * Engineers should be particularly responsive to requests for help during times when there are known regressions, and should be alert for problems that touch on their areas of expertise. * Work that overlaps with the regressing areas should be treated with care, so that we don't pile problem upon problem. * Unrelated work should be able to proceed as normal, with caution but without undue fear. * People who are not qualified to work on the regressions should not be told that that's what they need to do if they want to get their code landed. I understand that people are scared that if we don't serialise landings when things are going wrong then we'll have a series of successive out-of-phase bugs and we'll end up never being able to promote an image again. I think we've drawn too broad a lesson from past problems. Yes, we need to be careful not to aggregate risk from lots of nearby overlapping changes. But I don't believe that after ten years we don't have the institutional expertise to spot when two changes truly have nothing to do with each other - and our stack is big enough that this is still frequently the case. Even when our archive was a complete free-for-all with no automatic installability gatewaying (and so upgrades were often broken), we still managed to produce pretty decent desktop milestone images with only a couple of days of heavy lifting, and most of that was typically cleaning up after problems that proposed-migration and other parts of the daily quality initiative now keep out of the archive for us. I am not at all convinced that the phone stack is so much more complex that we can't loosen our grip a bit and still have things workable, especially now that we have some much better technology for catching many frequent categories of regressions (certainly a worthwhile benefit of all this hard experience over the last couple of years); and as a bonus we might not burn out some of our best engineers trying to do ultra-tight coordination of everything all the time. Given the choice, which is better: to have slightly more frequent breakage, but have key engineers be fresh and able to work on urgent problems that come their way every so often; or to have our key engineers concentrate hard every day to make sure as few regressions as possible slip in, at the cost that when difficult problems show up they're too tired and demotivated to deal with them properly? I'm worried that risk aversion means we tend to aim for the latter. Cheers, -- Colin Watson [cjwat...@ubuntu.com] -- Mailing list: https://launchpad.net/~ubuntu-phone Post to : ubuntu-phone@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-phone More help : https://help.launchpad.net/ListHelp