On Jul 24, 2014, at 12:08 PM, Anita Kuno <ante...@anteaya.info> wrote:
> On 07/24/2014 12:40 PM, Daniel P. Berrange wrote: >> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: >> >>> ==Future changes== >> >>> ===Fixing Faster=== >>> >>> We introduce bugs to OpenStack at some constant rate, which piles up >>> over time. Our systems currently treat all changes as equally risky and >>> important to the health of the system, which makes landing code changes >>> to fix key bugs slow when we're at a high reset rate. We've got a manual >>> process of promoting changes today to get around this, but that's >>> actually quite costly in people time, and takes getting all the right >>> people together at once to promote changes. You can see a number of the >>> changes we promoted during the gate storm in June [3], and it was no >>> small number of fixes to get us back to a reasonably passing gate. We >>> think that optimizing this system will help us land fixes to critical >>> bugs faster. >>> >>> [3] https://etherpad.openstack.org/p/gatetriage-june2014 >>> >>> The basic idea is to use the data from elastic recheck to identify that >>> a patch is fixing a critical gate related bug. When one of these is >>> found in the queues it will be given higher priority, including bubbling >>> up to the top of the gate queue automatically. The manual promote >>> process should no longer be needed, and instead bugs fixing elastic >>> recheck tracked issues will be promoted automatically. >>> >>> At the same time we'll also promote review on critical gate bugs through >>> making them visible in a number of different channels (like on elastic >>> recheck pages, review day, and in the gerrit dashboards). The idea here >>> again is to make the reviews that fix key bugs pop to the top of >>> everyone's views. >> >> In some of the harder gate bugs I've looked at (especially the infamous >> 'live snapshot' timeout bug), it has been damn hard to actually figure >> out what's wrong. AFAIK, no one has ever been able to reproduce it >> outside of the gate infrastructure. I've even gone as far as setting up >> identical Ubuntu VMs to the ones used in the gate on a local cloud, and >> running the tempest tests multiple times, but still can't reproduce what >> happens on the gate machines themselves :-( As such we're relying on >> code inspection and the collected log messages to try and figure out >> what might be wrong. >> >> The gate collects alot of info and publishes it, but in this case I >> have found the published logs to be insufficient - I needed to get >> the more verbose libvirtd.log file. devstack has the ability to turn >> this on via an environment variable, but it is disabled by default >> because it would add 3% to the total size of logs collected per gate >> job. >> >> There's no way for me to get that environment variable for devstack >> turned on for a specific review I want to test with. In the end I >> uploaded a change to nova which abused rootwrap to elevate privileges, >> install extra deb packages, reconfigure libvirtd logging and restart >> the libvirtd daemon. >> >> >> https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters >> https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py >> >> This let me get further, but still not resolve it. My next attack is >> to build a custom QEMU binary and hack nova further so that it can >> download my custom QEMU binary from a website onto the gate machine >> and run the test with it. Failing that I'm going to be hacking things >> to try to attach to QEMU in the gate with GDB and get stack traces. >> Anything is doable thanks to rootwrap giving us a way to elevate >> privileges from Nova, but it is a somewhat tedious approach. >> >> I'd like us to think about whether they is anything we can do to make >> life easier in these kind of hard debugging scenarios where the regular >> logs are not sufficient. >> >> Regards, >> Daniel >> > For really really difficult bugs that can't be reproduced outside the > gate, we do have the ability to hold vms if we know they have are > displaying the bug, if they are caught before the vm in question is > scheduled for deletion. In this case, make your intentions known in a > discussion with a member of infra-root. A conversation will ensue > involving what to do to get you what you need to continue debugging. > Why? Is space really that expensive? It boggles my mind a little that we have a well financed foundation (afaik, correct me if I am wrong...) but yet can't save 'all' the things in a smart manner (saving all the VMs snapshots doesn't mean saving hundreds/thousands of gigabytes when u are using de-duping cinder/glance... backends). Expire those VMs after a week if that helps but it feels like we shouldn't be so conservative about developers needs to have access to all the VMs that the gate used/created..., it's not like developers are trying to 'harm' openstack by investigating root issues that raw access to the VM images can provide (in fact it's to the contrary). > It doesn't work in all cases, but some have found it helpful. Keep in > mind you will be asked to demonstrate you have tried all other avenues > before this one is exercised. > > Thanks, > Anita. > > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev