On Aug 28, 2014, at 2:15 PM, Sean Dague <s...@dague.net> wrote: > On 08/28/2014 01:48 PM, Doug Hellmann wrote: >> >> On Aug 28, 2014, at 1:17 PM, Sean Dague <s...@dague.net> wrote: >> >>> On 08/28/2014 12:48 PM, Doug Hellmann wrote: >>>> >>>> On Aug 27, 2014, at 5:56 PM, Sean Dague <s...@dague.net> wrote: >>>> >>>>> On 08/27/2014 05:27 PM, Doug Hellmann wrote: >>>>>> >>>>>> On Aug 27, 2014, at 2:54 PM, Sean Dague <s...@dague.net> wrote: >>>>>> >>>>>>> Note: thread intentionally broken, this is really a different topic. >>>>>>> >>>>>>> On 08/27/2014 02:30 PM, Doug Hellmann wrote:> >>>>>>>> On Aug 27, 2014, at 1:30 PM, Chris Dent <chd...@redhat.com> wrote: >>>>>>>> >>>>>>>>> On Wed, 27 Aug 2014, Doug Hellmann wrote: >>>>>>>>> >>>>>>>>>> I have found it immensely helpful, for example, to have a written set >>>>>>>>>> of the steps involved in creating a new library, from importing the >>>>>>>>>> git repo all the way through to making it available to other >>>>>>>>>> projects. >>>>>>>>>> Without those instructions, it would have been much harder to split >>>>>>>>>> up >>>>>>>>>> the work. The team would have had to train each other by word of >>>>>>>>>> mouth, and we would have had constant issues with inconsistent >>>>>>>>>> approaches triggering different failures. The time we spent building >>>>>>>>>> and verifying the instructions has paid off to the extent that we >>>>>>>>>> even >>>>>>>>>> had one developer not on the core team handle a graduation for us. >>>>>>>>> >>>>>>>>> +many more for the relatively simple act of just writing stuff down >>>>>>>> >>>>>>>> "Write it down.” is my theme for Kilo. >>>>>>> >>>>>>> I definitely get the sentiment. "Write it down" is also hard when you >>>>>>> are talking about things that do change around quite a bit. OpenStack as >>>>>>> a whole sees 250 - 500 changes a week, so the interaction pattern moves >>>>>>> around enough that it's really easy to have *very* stale information >>>>>>> written down. Stale information is even more dangerous than no >>>>>>> information some times, as it takes people down very wrong paths. >>>>>>> >>>>>>> I think we break down on communication when we get into a conversation >>>>>>> of "I want to learn gate debugging" because I don't quite know what that >>>>>>> means, or where the starting point of understanding is. So those >>>>>>> intentions are well meaning, but tend to stall. The reality was there >>>>>>> was no road map for those of us that dive in, it's just understanding >>>>>>> how OpenStack holds together as a whole and where some of the high risk >>>>>>> parts are. And a lot of that comes with days staring at code and logs >>>>>>> until patterns emerge. >>>>>>> >>>>>>> Maybe if we can get smaller more targeted questions, we can help folks >>>>>>> better? I'm personally a big fan of answering the targeted questions >>>>>>> because then I also know that the time spent exposing that information >>>>>>> was directly useful. >>>>>>> >>>>>>> I'm more than happy to mentor folks. But I just end up finding the "I >>>>>>> want to learn" at the generic level something that's hard to grasp onto >>>>>>> or figure out how we turn it into action. I'd love to hear more ideas >>>>>>> from folks about ways we might do that better. >>>>>> >>>>>> You and a few others have developed an expertise in this important >>>>>> skill. I am so far away from that level of expertise that I don’t know >>>>>> the questions to ask. More often than not I start with the console log, >>>>>> find something that looks significant, spend an hour or so tracking it >>>>>> down, and then have someone tell me that it is a red herring and the >>>>>> issue is really some other thing that they figured out very quickly by >>>>>> looking at a file I never got to. >>>>>> >>>>>> I guess what I’m looking for is some help with the patterns. What made >>>>>> you think to look in one log file versus another? Some of these jobs >>>>>> save a zillion little files, which ones are actually useful? What tools >>>>>> are you using to correlate log entries across all of those files? Are >>>>>> you doing it by hand? Is logstash useful for that, or is that more >>>>>> useful for finding multiple occurrences of the same issue? >>>>>> >>>>>> I realize there’s not a way to write a how-to that will live forever. >>>>>> Maybe one way to deal with that is to write up the research done on bugs >>>>>> soon after they are solved, and publish that to the mailing list. Even >>>>>> the retrospective view is useful because we can all learn from it >>>>>> without having to live through it. The mailing list is a fairly >>>>>> ephemeral medium, and something very old in the archives is understood >>>>>> to have a good chance of being out of date so we don’t have to keep >>>>>> adding disclaimers. >>>>> >>>>> Sure. Matt's actually working up a blog post describing the thing he >>>>> nailed earlier in the week. >>>> >>>> Yes, I appreciate that both of you are responding to my questions. :-) >>>> >>>> I have some more specific questions/comments below. Please take all of >>>> this in the spirit of trying to make this process easier by pointing out >>>> where I’ve found it hard, and not just me complaining. I’d like to work on >>>> fixing any of these things that can be fixed, by writing or reviewing >>>> patches for early in kilo. >>>> >>>>> >>>>> Here is my off the cuff set of guidelines: >>>>> >>>>> #1 - is it a test failure or a setup failure >>>>> >>>>> This should be pretty easy to figure out. Test failures come at the end >>>>> of console log and say that tests failed (after you see a bunch of >>>>> passing tempest tests). >>>>> >>>>> Always start at *the end* of files and work backwards. >>>> >>>> That’s interesting because in my case I saw a lot of failures after the >>>> initial “real” problem. So I usually read the logs like C compiler output: >>>> Assume the first error is real, and the others might have been caused by >>>> that one. Do you work from the bottom up to a point where you don’t see >>>> any more errors instead of reading top down? >>> >>> Bottom up to get to problems, then figure out if it's in a subprocess so >>> the problems could exist for a while. That being said, not all tools do >>> useful things like actually error when they fail (I'm looking at you >>> yum....) so there are always edge cases here. >>> >>>>> >>>>> #2 - if it's a test failure, what API call was unsuccessful. >>>>> >>>>> Start with looking at the API logs for the service at the top level, and >>>>> see if there is a simple traceback at the right timestamp. If not, >>>>> figure out what that API call was calling out to, again look at the >>>>> simple cases assuming failures will create ERRORS or TRACES (though they >>>>> often don't). >>>> >>>> In my case, a neutron call failed. Most of the other services seem to have >>>> a *-api.log file, but neutron doesn’t. It took a little while to find the >>>> API-related messages in screen-q-svc.txt (I’m glad I’ve been around long >>>> enough to know it used to be called “quantum”). I get that screen-n-*.txt >>>> would collide with nova. Is it necessary to abbreviate those filenames at >>>> all? >>> >>> Yeh... service naming could definitely be better, especially with >>> neutron. There are implications for long names in screen, but maybe we >>> just get over it as we already have too many tabs to be in one page in >>> the console anymore anyway. >>> >>>>> Hints on the service log order you should go after are on the footer >>>>> over every log page - >>>>> http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/logs/ >>>>> (it's included as an Apache footer) for some services. It's been there >>>>> for about 18 months, I think people are fully blind to it at this point. >>>> >>>> Where would I go to edit that footer to add information about the neutron >>>> log files? Is that Apache footer defined in an infra repo? >>> >>> Note the following at the end of the footer output: >>> >>> About this Help >>> >>> This help file is part of the openstack-infra/config project, and can be >>> found at modules/openstack_project/files/logs/help/tempest_logs.html . >>> The file can be updated via the standard OpenStack Gerrit Review process. >> >> /me smacks forehead > > :) > > Also note an early version of this base email is at the top level for > all runs - (i.e. - > http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/) > > It's been there about 18 months. People look right past it. Which is > part of where my skepticism on just writing things down being the > solution. Because a bunch of it has been written down. But until people > are in a mode of pulling the information in, pushing it out doesn't help.
Fair enough. > >>>> Another specific issue I’ve seen is a message that says something to the >>>> effect “the setup for this job failed, check the appropriate log”. I found >>>> 2 files with “setup” in the name, but the failure was actually logged in a >>>> different file (devstacklog.txt). Is the job definition too far “removed” >>>> from the scripts to know what the real filename is? Is it running scripts >>>> that log to multiple files during the setup phase, and so it doesn’t know >>>> which to refer me to? Or maybe I overlooked a message about when logging >>>> to a specific file started. >>> >>> Part of the issue here is that devstack-gate runs a lot of different >>> gate_hooks. So that's about as specific as we can get unless you can >>> figure out how to introspect that info in bash... which I couldn’t. >> >> Are all of the hooks logging to the same file? If not, why not? Would it >> make sense to change that so the error messages could be more specific? > > They are not, output direction is actually typically a function of the > hook script and not devstack gate. > > Some of this is because the tools when run locally need to be able to > natively support logging. Some of this is because processing logs into > elastic search requires that we know we understand the log format (a > generic gate_hook log wouldn't work well there). Some of it is historical. OK, that makes sense. > > I did spend a bunch of time cleaning up the grenade summary log so in > the console you get some basic idea of what's going on, and what part > you failed in. Definitely could be better. Taking some of those summary > lessons into devstack wouldn't hurt either. I don't think I've hit a grenade issue, so I haven’t seen that. > > So patches here are definitely accepted. Which is very much not a blow > off, but in cleaning d-g up over the last 6 months “the setup for this Yep, I’m asking if I’m even thinking in the right directions, and that sounds like a “yes” rather than a blow off. > job failed, check the appropriate log” was about as good as we could > figure out. Previously the script just died and people usually blamed an > error message about uploading artifacts in the jenkins output for the > failure. So if you can figure out a better UX given the constraints > we're in, definitely appreciated. I’ll look at the job definitions and see if I can come up with a way to parameterize them or automate the step of figuring out which file is meant for each phase. > >>>>> If nothing jumps out at ERROR or TRACE, go back to DEBUG level and >>>>> figure out what's happening at the time of failure, especially keeping >>>>> an eye out of areas where other workers are doing interesting things at >>>>> the same time, possibly indicating state corruption in OpenStack as a >>>>> race. >>>>> >>>>> #3 - if it's a console failure, start at the end and work backwards >>>>> >>>>> devstack and grenade run under set -o errexit so that they will >>>>> critically exit if a command fails. They will typically dump some debug >>>>> when they do that. So the failing command won't be the last line in the >>>>> file, but it will be close. The words 'error' typically aren't useful at >>>>> all in shell because lots of things say error when they aren't, we mask >>>>> their exit codes if their failure is generally irrelevant. >>>>> >>>>> #4 - general principle the closer to root cause the better >>>>> >>>>> If we think of exposure of bugs as layers we probably end up >>>>> withsomething like this >>>>> >>>>> - Console log >>>>> - Test Name + Failure >>>>> - Failure inside an API service >>>>> - Failure inside a worker process >>>>> - Actual failure figured out in OpenStack code path >>>>> - Failure in something below OpenStack (kernel, libvirt) >>>>> >>>>> This is why signatures that are just test names aren't all that useful >>>>> much of the time (and why we try not to add those to ER), as that's >>>>> going to be hitting an API, but the why of things is very much still >>>>> undiscovered. >>>>> >>>>> #5 - if it's an infrastructure level setup bug (failing to download or >>>>> install something) figure out if there are other likewise events at the >>>>> same time (i.e. it's a network issue, which we can't fix) vs. a >>>>> structural issue. >>>>> >>>>> >>>>> I find Elastic Search good for step 5, but realistically for all other >>>>> steps it's manual log sifting. I open lots of tabs in Chrome, and search >>>>> by timestamp. >>>> >>>> This feels like something we could improve on. If we had a tool to >>>> download the logs and interleave the messages using their timestamps, >>>> would that make it easier? We could probably make the job log everything >>>> to a single file, but I can see where sometimes only having part of the >>>> data to look at would be more useful. >>> >>> Maybe, I find the ability to change the filtering level dynamically to >>> be pretty important. We actually did some of this once when we used >>> syslog. Personally I found it a ton harder to get to the bottom of things. >>> >>> A gate run has 25+ services running, it's a rare issue that combines >>> interactions between > 4 of them to get to a solution. So I expect you'd >>> exchange context jumping, for tons of irrelevancy. That's a personal >>> opinion based on personal workflow, and why I never spent time on it. >>> Instead I built os-loganalyze that does the filtering and coloring >>> dynamically on the server side, as it was a zero install solution that >>> provided additional benefits of being able to link to a timestamp in a >>> log for sharing purposes. >> >> Sure, that makes sense. >> >>> >>>> >>>>> >>>>> >>>>> A big part of the experience also just comes from a manual bayesian >>>>> filter. Certain scary looking things in the console log aren't, but you >>>>> don't know that unless you look at setup logs enough (either in gate or >>>>> in your own devstacks) to realize that. Sanitizing the output of that >>>>> part of the process is pretty intractable... because shell (though I've >>>>> put some serious effort into it over the last 6 months). >>>> >>>> Maybe our scripts can emit messages to explain the scary stuff? “This is >>>> going to report a problem, but you can ignore it unless X happens.”? >>> >>> Maybe, like I said it's a lot better than it used to be. But very few >>> people are putting in effort here, and I'm not convinced it's really >>> solveable in bash. >> >> OK, well, if the answers to these questions are “yes” then I should have >> time to help, which is why I’m exploring options. > > Yeh, the issue is you'd need a couple hundred different messages like > that, and realistically I think they'd lead to more confusion rather > than less. > > Honestly, I did a huge amount of selective filtering out of xtrace logs > in the last six months and was able to drop the size of the devstack > logs by over 50% getting rid of some of the more confusing trace bits. > But it's something that you make progress on 1% at a time. > > At some point we do need to say "you have to understand OpenStack and > the Test run process ^this much^ to be able to ride", because cleaning > up every small thing isn't really possible. > > Now, providing a better flow explaining the parts here might be good. We > do it during Infra bootcamps, and people find it helpful. But again, > that's a mostly pull model because the people showing up did so > specifically to learn, so are much more receptive to gaining the > information at hand. > >>>>> Sanitizing the OpenStack logs to be crisp about actual things going >>>>> wrong, vs. not, shouldn't be intractable, but it feels like it some >>>>> times. Which is why all operators run at DEBUG level. The thing that >>>>> makes it hard for developers to see the issues here is the same thing >>>>> that makes it *really* hard for operators to figure out failures. It's >>>>> also why I tried (though executed poorly on, sorry about that) getting >>>>> log cleanups rolling this cycle. >>>> >>>> I would like to have the TC back an official cross-project effort to clean >>>> up the logs for Kilo, and get all of the integrated projects to commit to >>>> working on it as a priority. >>>> >>>> Doug >>>> >>>>> >>>>> -Sean >>>>> >>>>> -- >>>>> Sean Dague >>>>> http://dague.net >>>>> >>>>> _______________________________________________ >>>>> OpenStack-dev mailing list >>>>> OpenStack-dev@lists.openstack.org >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>>> >>>> _______________________________________________ >>>> OpenStack-dev mailing list >>>> OpenStack-dev@lists.openstack.org >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>> >>> >>> -- >>> Sean Dague >>> http://dague.net >>> >>> _______________________________________________ >>> OpenStack-dev mailing list >>> OpenStack-dev@lists.openstack.org >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >> _______________________________________________ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > > -- > Sean Dague > http://dague.net > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev