Re: [openstack-dev] [all] gate debugging

Doug Hellmann Thu, 28 Aug 2014 12:42:00 -0700

On Aug 28, 2014, at 2:15 PM, Sean Dague <s...@dague.net> wrote:

> On 08/28/2014 01:48 PM, Doug Hellmann wrote:
>> 
>> On Aug 28, 2014, at 1:17 PM, Sean Dague <s...@dague.net> wrote:
>> 
>>> On 08/28/2014 12:48 PM, Doug Hellmann wrote:
>>>> 
>>>> On Aug 27, 2014, at 5:56 PM, Sean Dague <s...@dague.net> wrote:
>>>> 
>>>>> On 08/27/2014 05:27 PM, Doug Hellmann wrote:
>>>>>> 
>>>>>> On Aug 27, 2014, at 2:54 PM, Sean Dague <s...@dague.net> wrote:
>>>>>> 
>>>>>>> Note: thread intentionally broken, this is really a different topic.
>>>>>>> 
>>>>>>> On 08/27/2014 02:30 PM, Doug Hellmann wrote:>
>>>>>>>> On Aug 27, 2014, at 1:30 PM, Chris Dent <chd...@redhat.com> wrote:
>>>>>>>> 
>>>>>>>>> On Wed, 27 Aug 2014, Doug Hellmann wrote:
>>>>>>>>> 
>>>>>>>>>> I have found it immensely helpful, for example, to have a written set
>>>>>>>>>> of the steps involved in creating a new library, from importing the
>>>>>>>>>> git repo all the way through to making it available to other 
>>>>>>>>>> projects.
>>>>>>>>>> Without those instructions, it would have been much harder to split 
>>>>>>>>>> up
>>>>>>>>>> the work. The team would have had to train each other by word of
>>>>>>>>>> mouth, and we would have had constant issues with inconsistent
>>>>>>>>>> approaches triggering different failures. The time we spent building
>>>>>>>>>> and verifying the instructions has paid off to the extent that we 
>>>>>>>>>> even
>>>>>>>>>> had one developer not on the core team handle a graduation for us.
>>>>>>>>> 
>>>>>>>>> +many more for the relatively simple act of just writing stuff down
>>>>>>>> 
>>>>>>>> "Write it down.” is my theme for Kilo.
>>>>>>> 
>>>>>>> I definitely get the sentiment. "Write it down" is also hard when you
>>>>>>> are talking about things that do change around quite a bit. OpenStack as
>>>>>>> a whole sees 250 - 500 changes a week, so the interaction pattern moves
>>>>>>> around enough that it's really easy to have *very* stale information
>>>>>>> written down. Stale information is even more dangerous than no
>>>>>>> information some times, as it takes people down very wrong paths.
>>>>>>> 
>>>>>>> I think we break down on communication when we get into a conversation
>>>>>>> of "I want to learn gate debugging" because I don't quite know what that
>>>>>>> means, or where the starting point of understanding is. So those
>>>>>>> intentions are well meaning, but tend to stall. The reality was there
>>>>>>> was no road map for those of us that dive in, it's just understanding
>>>>>>> how OpenStack holds together as a whole and where some of the high risk
>>>>>>> parts are. And a lot of that comes with days staring at code and logs
>>>>>>> until patterns emerge.
>>>>>>> 
>>>>>>> Maybe if we can get smaller more targeted questions, we can help folks
>>>>>>> better? I'm personally a big fan of answering the targeted questions
>>>>>>> because then I also know that the time spent exposing that information
>>>>>>> was directly useful.
>>>>>>> 
>>>>>>> I'm more than happy to mentor folks. But I just end up finding the "I
>>>>>>> want to learn" at the generic level something that's hard to grasp onto
>>>>>>> or figure out how we turn it into action. I'd love to hear more ideas
>>>>>>> from folks about ways we might do that better.
>>>>>> 
>>>>>> You and a few others have developed an expertise in this important 
>>>>>> skill. I am so far away from that level of expertise that I don’t know 
>>>>>> the questions to ask. More often than not I start with the console log, 
>>>>>> find something that looks significant, spend an hour or so tracking it 
>>>>>> down, and then have someone tell me that it is a red herring and the 
>>>>>> issue is really some other thing that they figured out very quickly by 
>>>>>> looking at a file I never got to.
>>>>>> 
>>>>>> I guess what I’m looking for is some help with the patterns. What made 
>>>>>> you think to look in one log file versus another? Some of these jobs 
>>>>>> save a zillion little files, which ones are actually useful? What tools 
>>>>>> are you using to correlate log entries across all of those files? Are 
>>>>>> you doing it by hand? Is logstash useful for that, or is that more 
>>>>>> useful for finding multiple occurrences of the same issue?
>>>>>> 
>>>>>> I realize there’s not a way to write a how-to that will live forever. 
>>>>>> Maybe one way to deal with that is to write up the research done on bugs 
>>>>>> soon after they are solved, and publish that to the mailing list. Even 
>>>>>> the retrospective view is useful because we can all learn from it 
>>>>>> without having to live through it. The mailing list is a fairly 
>>>>>> ephemeral medium, and something very old in the archives is understood 
>>>>>> to have a good chance of being out of date so we don’t have to keep 
>>>>>> adding disclaimers.
>>>>> 
>>>>> Sure. Matt's actually working up a blog post describing the thing he
>>>>> nailed earlier in the week.
>>>> 
>>>> Yes, I appreciate that both of you are responding to my questions. :-)
>>>> 
>>>> I have some more specific questions/comments below. Please take all of 
>>>> this in the spirit of trying to make this process easier by pointing out 
>>>> where I’ve found it hard, and not just me complaining. I’d like to work on 
>>>> fixing any of these things that can be fixed, by writing or reviewing 
>>>> patches for early in kilo.
>>>> 
>>>>> 
>>>>> Here is my off the cuff set of guidelines:
>>>>> 
>>>>> #1 - is it a test failure or a setup failure
>>>>> 
>>>>> This should be pretty easy to figure out. Test failures come at the end
>>>>> of console log and say that tests failed (after you see a bunch of
>>>>> passing tempest tests).
>>>>> 
>>>>> Always start at *the end* of files and work backwards.
>>>> 
>>>> That’s interesting because in my case I saw a lot of failures after the 
>>>> initial “real” problem. So I usually read the logs like C compiler output: 
>>>> Assume the first error is real, and the others might have been caused by 
>>>> that one. Do you work from the bottom up to a point where you don’t see 
>>>> any more errors instead of reading top down?
>>> 
>>> Bottom up to get to problems, then figure out if it's in a subprocess so
>>> the problems could exist for a while. That being said, not all tools do
>>> useful things like actually error when they fail (I'm looking at you
>>> yum....) so there are always edge cases here.
>>> 
>>>>> 
>>>>> #2 - if it's a test failure, what API call was unsuccessful.
>>>>> 
>>>>> Start with looking at the API logs for the service at the top level, and
>>>>> see if there is a simple traceback at the right timestamp. If not,
>>>>> figure out what that API call was calling out to, again look at the
>>>>> simple cases assuming failures will create ERRORS or TRACES (though they
>>>>> often don't).
>>>> 
>>>> In my case, a neutron call failed. Most of the other services seem to have 
>>>> a *-api.log file, but neutron doesn’t. It took a little while to find the 
>>>> API-related messages in screen-q-svc.txt (I’m glad I’ve been around long 
>>>> enough to know it used to be called “quantum”). I get that screen-n-*.txt 
>>>> would collide with nova. Is it necessary to abbreviate those filenames at 
>>>> all?
>>> 
>>> Yeh... service naming could definitely be better, especially with
>>> neutron. There are implications for long names in screen, but maybe we
>>> just get over it as we already have too many tabs to be in one page in
>>> the console anymore anyway.
>>> 
>>>>> Hints on the service log order you should go after are on the footer
>>>>> over every log page -
>>>>> http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/logs/
>>>>> (it's included as an Apache footer) for some services. It's been there
>>>>> for about 18 months, I think people are fully blind to it at this point.
>>>> 
>>>> Where would I go to edit that footer to add information about the neutron 
>>>> log files? Is that Apache footer defined in an infra repo?
>>> 
>>> Note the following at the end of the footer output:
>>> 
>>> About this Help
>>> 
>>> This help file is part of the openstack-infra/config project, and can be
>>> found at modules/openstack_project/files/logs/help/tempest_logs.html .
>>> The file can be updated via the standard OpenStack Gerrit Review process.
>> 
>> /me smacks forehead
> 
> :)
> 
> Also note an early version of this base email is at the top level for
> all runs - (i.e. -
> http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/)
> 
> It's been there about 18 months. People look right past it. Which is
> part of where my skepticism on just writing things down being the
> solution. Because a bunch of it has been written down. But until people
> are in a mode of pulling the information in, pushing it out doesn't help.


Fair enough.

> 
>>>> Another specific issue I’ve seen is a message that says something to the 
>>>> effect “the setup for this job failed, check the appropriate log”. I found 
>>>> 2 files with “setup” in the name, but the failure was actually logged in a 
>>>> different file (devstacklog.txt). Is the job definition too far “removed” 
>>>> from the scripts to know what the real filename is? Is it running scripts 
>>>> that log to multiple files during the setup phase, and so it doesn’t know 
>>>> which to refer me to? Or maybe I overlooked a message about when logging 
>>>> to a specific file started.
>>> 
>>> Part of the issue here is that devstack-gate runs a lot of different
>>> gate_hooks. So that's about as specific as we can get unless you can
>>> figure out how to introspect that info in bash... which I couldn’t.
>> 
>> Are all of the hooks logging to the same file? If not, why not? Would it 
>> make sense to change that so the error messages could be more specific?
> 
> They are not, output direction is actually typically a function of the
> hook script and not devstack gate.
> 
> Some of this is because the tools when run locally need to be able to
> natively support logging. Some of this is because processing logs into
> elastic search requires that we know we understand the log format (a
> generic gate_hook log wouldn't work well there). Some of it is historical.

OK, that makes sense.

> 
> I did spend a bunch of time cleaning up the grenade summary log so in
> the console you get some basic idea of what's going on, and what part
> you failed in. Definitely could be better. Taking some of those summary
> lessons into devstack wouldn't hurt either.

I don't think I've hit a grenade issue, so I haven’t seen that.

> 
> So patches here are definitely accepted. Which is very much not a blow
> off, but in cleaning d-g up over the last 6 months “the setup for this

Yep, I’m asking if I’m even thinking in the right directions, and that sounds 
like a “yes” rather than a blow off.

> job failed, check the appropriate log” was about as good as we could
> figure out. Previously the script just died and people usually blamed an
> error message about uploading artifacts in the jenkins output for the
> failure. So if you can figure out a better UX given the constraints
> we're in, definitely appreciated.

I’ll look at the job definitions and see if I can come up with a way to 
parameterize them or automate the step of figuring out which file is meant for 
each phase.

> 
>>>>> If nothing jumps out at ERROR or TRACE, go back to DEBUG level and
>>>>> figure out what's happening at the time of failure, especially keeping
>>>>> an eye out of areas where other workers are doing interesting things at
>>>>> the same time, possibly indicating state corruption in OpenStack as a 
>>>>> race.
>>>>> 
>>>>> #3 - if it's a console failure, start at the end and work backwards
>>>>> 
>>>>> devstack and grenade run under set -o errexit so that they will
>>>>> critically exit if a command fails. They will typically dump some debug
>>>>> when they do that. So the failing command won't be the last line in the
>>>>> file, but it will be close. The words 'error' typically aren't useful at
>>>>> all in shell because lots of things say error when they aren't, we mask
>>>>> their exit codes if their failure is generally irrelevant.
>>>>> 
>>>>> #4 - general principle the closer to root cause the better
>>>>> 
>>>>> If we think of exposure of bugs as layers we probably end up
>>>>> withsomething like this
>>>>> 
>>>>> - Console log
>>>>> - Test Name + Failure
>>>>> - Failure inside an API service
>>>>> - Failure inside a worker process
>>>>> - Actual failure figured out in OpenStack code path
>>>>> - Failure in something below OpenStack (kernel, libvirt)
>>>>> 
>>>>> This is why signatures that are just test names aren't all that useful
>>>>> much of the time (and why we try not to add those to ER), as that's
>>>>> going to be hitting an API, but the why of things is very much still
>>>>> undiscovered.
>>>>> 
>>>>> #5 - if it's an infrastructure level setup bug (failing to download or
>>>>> install something) figure out if there are other likewise events at the
>>>>> same time (i.e. it's a network issue, which we can't fix) vs. a
>>>>> structural issue.
>>>>> 
>>>>> 
>>>>> I find Elastic Search good for step 5, but realistically for all other
>>>>> steps it's manual log sifting. I open lots of tabs in Chrome, and search
>>>>> by timestamp.
>>>> 
>>>> This feels like something we could improve on. If we had a tool to 
>>>> download the logs and interleave the messages using their timestamps, 
>>>> would that make it easier? We could probably make the job log everything 
>>>> to a single file, but I can see where sometimes only having part of the 
>>>> data to look at would be more useful.
>>> 
>>> Maybe, I find the ability to change the filtering level dynamically to
>>> be pretty important. We actually did some of this once when we used
>>> syslog. Personally I found it a ton harder to get to the bottom of things.
>>> 
>>> A gate run has 25+ services running, it's a rare issue that combines
>>> interactions between > 4 of them to get to a solution. So I expect you'd
>>> exchange context jumping, for tons of irrelevancy. That's a personal
>>> opinion based on personal workflow, and why I never spent time on it.
>>> Instead I built os-loganalyze that does the filtering and coloring
>>> dynamically on the server side, as it was a zero install solution that
>>> provided additional benefits of being able to link to a timestamp in a
>>> log for sharing purposes.
>> 
>> Sure, that makes sense.
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> A big part of the experience also just comes from a manual bayesian
>>>>> filter. Certain scary looking things in the console log aren't, but you
>>>>> don't know that unless you look at setup logs enough (either in gate or
>>>>> in your own devstacks) to realize that. Sanitizing the output of that
>>>>> part of the process is pretty intractable... because shell (though I've
>>>>> put some serious effort into it over the last 6 months).
>>>> 
>>>> Maybe our scripts can emit messages to explain the scary stuff? “This is 
>>>> going to report a problem, but you can ignore it unless X happens.”?
>>> 
>>> Maybe, like I said it's a lot better than it used to be. But very few
>>> people are putting in effort here, and I'm not convinced it's really
>>> solveable in bash.
>> 
>> OK, well, if the answers to these questions are “yes” then I should have 
>> time to help, which is why I’m exploring options.
> 
> Yeh, the issue is you'd need a couple hundred different messages like
> that, and realistically I think they'd lead to more confusion rather
> than less.
> 
> Honestly, I did a huge amount of selective filtering out of xtrace logs
> in the last six months and was able to drop the size of the devstack
> logs by over 50% getting rid of some of the more confusing trace bits.
> But it's something that you make progress on 1% at a time.
> 
> At some point we do need to say "you have to understand OpenStack and
> the Test run process ^this much^ to be able to ride", because cleaning
> up every small thing isn't really possible.
> 
> Now, providing a better flow explaining the parts here might be good. We
> do it during Infra bootcamps, and people find it helpful. But again,
> that's a mostly pull model because the people showing up did so
> specifically to learn, so are much more receptive to gaining the
> information at hand.
> 
>>>>> Sanitizing the OpenStack logs to be crisp about actual things going
>>>>> wrong, vs. not, shouldn't be intractable, but it feels like it some
>>>>> times. Which is why all operators run at DEBUG level. The thing that
>>>>> makes it hard for developers to see the issues here is the same thing
>>>>> that makes it *really* hard for operators to figure out failures. It's
>>>>> also why I tried (though executed poorly on, sorry about that) getting
>>>>> log cleanups rolling this cycle.
>>>> 
>>>> I would like to have the TC back an official cross-project effort to clean 
>>>> up the logs for Kilo, and get all of the integrated projects to commit to 
>>>> working on it as a priority.
>>>> 
>>>> Doug
>>>> 
>>>>> 
>>>>>   -Sean
>>>>> 
>>>>> -- 
>>>>> Sean Dague
>>>>> http://dague.net
>>>>> 
>>>>> _______________________________________________
>>>>> OpenStack-dev mailing list
>>>>> OpenStack-dev@lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>> 
>>>> 
>>>> _______________________________________________
>>>> OpenStack-dev mailing list
>>>> OpenStack-dev@lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>> 
>>> 
>>> 
>>> -- 
>>> Sean Dague
>>> http://dague.net
>>> 
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev@lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> 
>> 
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> 
> 
> 
> -- 
> Sean Dague
> http://dague.net
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [all] gate debugging

Reply via email to