These stats are *awesome*! I've been wanting them for a long time, but never got around to generating them myself. Can we track these on an ongoing basis?
On 11/05/2013 07:09 AM, Ed Morley wrote: > On 05 November 2013 14:44:27, David Burns wrote: >> We appear to be doing 1 backout for every 15 pushes on a rough >> average[4]. > > I've been thinking about this some more - and I believe the ratio is > probably actually even worse than the numbers suggest, since: Yeah, 1 backout for every 15 pushes sounds quite a bit better than I'd expect. > * Depending on how the backouts are performed, the backout of several > changesets/bugs are sometimes folded into one commit. Can this be factored into the stats? As in, parse the backout commit messages, gather the bug numbers (or infer them from the changeset if not given), then map them to back to the pushes for that bug? It still won't be 100% right, but it'll be closer. qbackout does a little bit of this when it tries to find the right commit message to reuse when you run with --apply. But it doesn't have access to (nor need) the pushlog, which would be required for this. > * The 'total commits' figure includes merges & other automated/non-dev > commits. Can this be fixed? > * Sometimes breakage is fixed in-place with a commit message such as > "Bug 123456 - Followup...", which was still for a landing that broke > the tree, but wouldn't count be counted. Could this be inferred from the starring comments? I guess it looks like the stats dburns posted don't involve the starring comments (yet?). I guess the rule would be that a changeset is a bustage fix if it or the tip changeset in its push appears in a star comment. Can we please structure the starring comments yet? I started a little bit down this path a while back (providing buttons on tbpl for common starring reasons), but then stalled it off to wait for tbpl2. It' would be really really good if the sheriffs could feed in metadata saying whether something is a backout or intermittent or whatever. I feel like we have knowledge in the heads of the human parts of our system that's getting dropped on the floor. We could be making much better use of it, since they're already figuring these things out anyway. It's not just for computing goodness metrics, either; it could make it much easier to implement autolanding (aka landing queues), probabilistic coalescing (what useless jobs can you skip to make way for important ones), and other goodies. > > > On 05 November 2013 14:57:17, Kyle Huey wrote: >> What is your proposal for doing that? What are the costs involved? > > For one: devs building/testing locally before pushing. Many cases of > failures would have been caught be just a simple single-platform > build+run of a single directory's worth of tests. If you know the right directory, sure. Though even then, local tests can be very disruptive to run. > > The benefits of this approach are: > * Available local compute time scales linearly with the number of devs > hired, unlike our Tryserver automation. That doesn't seem like a fundamental property to me. At least theoretically, much of the tryserver automation scales with the Amazon cloud (aka it scales with the load on some corporate credit card that I'm glad I don't have to see the statements for). Again theoretically, we could be buying a local build/test box for every dev hire & active volunteer, and setting up automation that bridges the gap between a dev's main box and the try server. (More on this below.) > * Local dep builds are much quicker than Try clobber builds. Let's split that up into builds vs tests. For the stuff I work on, building is normally not a problem. But it can be during heavy times, because doing builds means losing push races. With wide-ranging stuff (where the probability of failures due to rebases is high), this means you either have to push without a final build or get repeatedly bumped to a later day. This should get better with the current build system improvements, so perhaps this isn't much of a problem anymore, but I'm running into it a fair amount right now. For tests, it depends on the test suite. But many of them just really suck to run locally. mach magic to identify a minimal subset of tests to run would help a lot with this, but that's going to be a substantial amount of work. For the most part, I think the try server is the way to go for tests. As for resource usage, my personal opinion is that if you restrict the tests to a single platform (a "T push", which you can generate by selecting something under "Restrict tests to platform(s)" on http://trychooser.pub.build.mozilla.org/ ), then you're fine. I'd rather people run tests on one try platform than whittle down the specific tests to be run. (Well, for the first push. If you're working through a particular issue on try, it makes sense to just test that one test suite.) In short: use the try server. Build on everything. Test on one platform. Run all the tests. If any fail, iterate on just the failed test suites (unless you think your changes may break others.) I don't have the data to prove it, but my guess is that this would result in the lowest overall load. (Backouts are expensive! Especially in hard-to-measure people time.) > > I'm hopeful that with the build peer's ongoing overhaul of our build > system, dep build times for an average patch are going to be short > enough that there really is no excuse not to build locally. Add to > that ongoing work on improving mach commands to ease running just a > subset of the tests (for bonus points making use of the applied MQs to > guess which ones), and it really shouldn't be too onerous of a request. Other ideas: Would it be possible to restrict the statistics to only the active times of day? It sucks when the tree is closed on a weekend or in the middle of my night, but it's way way less of a problem when only a few devs are impacted. The problem I see is tree closures when lots of people need to land. Tree closures at other times are a different problem, and can be addressed separately if needed. (You could even say "backouts don't matter if there's no queue in front of any test machines", which isn't true when you consider human cost, but it's a better approximation than weighting a 3am Sunday PST backout the same as a middle-of-the-workday one.) I'd also like devs to have easier access to a set of buildbot-like test slaves. Debugging via try sucks. And the overhead in requesting access to a slave and then figuring out how to use it is too high, so people just don't. (This is being worked on, btw.) These boxes could double as distributed compute servers, though that might require colocating them with devs. Perhaps we should look into rack-mounted devs, so we could put them directly in the data center near the build/test boxes. Wait, ignore that last. It was a joke. It would be even better if these test boxes could make use of my local builds. So either copy the build over to the test box and run the tests there, or have mach secretly synchronize the source code whenever you do a build and do a shadow build on the test box at the same time. Does the orange factor DB have enough granularity to identify which tests failed for a given push? Could it, without burdening sheriffs too much? It'd be great to have per-test statistics on the number of failures that a test caught, so we could compare the cost of running a test (mostly in time) to the benefit it provides, and reshuffle test suites to run the low-cost high-reward ones all the time, and the high-cost low-reward ones only occasionally. (You might need to tweak this metric to reduce the estimated reward for intermittents.) Anyway, I'll shut up now. I always have way more ideas than time to implement them or ability to market them. There are a lot of things we could do to improve our current setup. _______________________________________________ dev-platform mailing list [email protected] https://lists.mozilla.org/listinfo/dev-platform

