Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

Joshua Harlow Thu, 24 Jul 2014 12:43:19 -0700

On Jul 24, 2014, at 12:08 PM, Anita Kuno <[email protected]> wrote:


> On 07/24/2014 12:40 PM, Daniel P. Berrange wrote:
>> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
>> 
>>> ==Future changes==
>> 
>>> ===Fixing Faster===
>>> 
>>> We introduce bugs to OpenStack at some constant rate, which piles up
>>> over time. Our systems currently treat all changes as equally risky and
>>> important to the health of the system, which makes landing code changes
>>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>>> process of promoting changes today to get around this, but that's
>>> actually quite costly in people time, and takes getting all the right
>>> people together at once to promote changes. You can see a number of the
>>> changes we promoted during the gate storm in June [3], and it was no
>>> small number of fixes to get us back to a reasonably passing gate. We
>>> think that optimizing this system will help us land fixes to critical
>>> bugs faster.
>>> 
>>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>> 
>>> The basic idea is to use the data from elastic recheck to identify that
>>> a patch is fixing a critical gate related bug. When one of these is
>>> found in the queues it will be given higher priority, including bubbling
>>> up to the top of the gate queue automatically. The manual promote
>>> process should no longer be needed, and instead bugs fixing elastic
>>> recheck tracked issues will be promoted automatically.
>>> 
>>> At the same time we'll also promote review on critical gate bugs through
>>> making them visible in a number of different channels (like on elastic
>>> recheck pages, review day, and in the gerrit dashboards). The idea here
>>> again is to make the reviews that fix key bugs pop to the top of
>>> everyone's views.
>> 
>> In some of the harder gate bugs I've looked at (especially the infamous
>> 'live snapshot' timeout bug), it has been damn hard to actually figure
>> out what's wrong. AFAIK, no one has ever been able to reproduce it
>> outside of the gate infrastructure. I've even gone as far as setting up
>> identical Ubuntu VMs to the ones used in the gate on a local cloud, and
>> running the tempest tests multiple times, but still can't reproduce what
>> happens on the gate machines themselves :-( As such we're relying on
>> code inspection and the collected log messages to try and figure out
>> what might be wrong.
>> 
>> The gate collects alot of info and publishes it, but in this case I
>> have found the published logs to be insufficient - I needed to get
>> the more verbose libvirtd.log file. devstack has the ability to turn
>> this on via an environment variable, but it is disabled by default
>> because it would add 3% to the total size of logs collected per gate
>> job.
>> 
>> There's no way for me to get that environment variable for devstack
>> turned on for a specific review I want to test with. In the end I
>> uploaded a change to nova which abused rootwrap to elevate privileges,
>> install extra deb packages, reconfigure libvirtd logging and restart
>> the libvirtd daemon.
>> 
>>  
>> https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
>>  https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py
>> 
>> This let me get further, but still not resolve it. My next attack is
>> to build a custom QEMU binary and hack nova further so that it can
>> download my custom QEMU binary from a website onto the gate machine
>> and run the test with it. Failing that I'm going to be hacking things
>> to try to attach to QEMU in the gate with GDB and get stack traces.
>> Anything is doable thanks to rootwrap giving us a way to elevate
>> privileges from Nova, but it is a somewhat tedious approach.
>> 
>> I'd like us to think about whether they is anything we can do to make
>> life easier in these kind of hard debugging scenarios where the regular
>> logs are not sufficient.
>> 
>> Regards,
>> Daniel
>> 
> For really really difficult bugs that can't be reproduced outside the
> gate, we do have the ability to hold vms if we know they have are
> displaying the bug, if they are caught before the vm in question is
> scheduled for deletion. In this case, make your intentions known in a
> discussion with a member of infra-root. A conversation will ensue
> involving what to do to get you what you need to continue debugging.
> 

Why? Is space really that expensive? It boggles my mind a little that we have a 
well financed foundation (afaik, correct me if I am wrong...) but yet can't 
save 'all' the things in a smart manner (saving all the VMs snapshots doesn't 
mean saving hundreds/thousands of gigabytes when u are using de-duping 
cinder/glance... backends). Expire those VMs after a week if that helps but it 
feels like we shouldn't be so conservative about developers needs to have 
access to all the VMs that the gate used/created..., it's not like developers 
are trying to 'harm' openstack by investigating root issues that raw access to 
the VM images can provide (in fact it's to the contrary).

> It doesn't work in all cases, but some have found it helpful. Keep in
> mind you will be asked to demonstrate you have tried all other avenues
> before this one is exercised.
> 
> Thanks,
> Anita.
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> [email protected]
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

Reply via email to