On 07/03/2014 04:34 PM, Kevin Benton wrote: > Yes, I can propose a spec for that. It probably won't be until Monday. > Is that okay? > Sure, that's fine. Thanks Kevin, I look forward to your spec once it is up. Enjoy tomorrow. :D
Thanks Kevin, Anita. > > On Thu, Jul 3, 2014 at 11:42 AM, Anita Kuno <[email protected]> wrote: > >> On 07/03/2014 02:33 PM, Kevin Benton wrote: >>> Maybe we can require period checks against the head of the master >>> branch (which should always pass) and build statistics based on the >> results >>> of that. >> I like this suggestion. I really like this suggestion. >> >> Hmmmm, what to do with a good suggestion? I wonder if we could capture >> it in an infra-spec and work on it from there. >> >> Would you feel comfortable offering a draft as an infra-spec and then >> perhaps we can discuss the design through the spec? >> >> What do you think? >> >> Thanks Kevin, >> Anita. >> >>> Otherwise it seems like we have to take a CI system's word for it >>> that a particular patch indeed broke that system. >>> >>> -- >>> Kevin Benton >>> >>> >>> On Thu, Jul 3, 2014 at 11:07 AM, Anita Kuno <[email protected]> >> wrote: >>> >>>> On 07/03/2014 01:27 PM, Kevin Benton wrote: >>>>>> This allows the viewer to see categories of reviews based upon their >>>>>> divergence from OpenStack's Jenkins results. I think evaluating >>>>>> divergence from Jenkins might be a metric worth consideration. >>>>> >>>>> I think the only thing this really reflects though is how much the >> third >>>>> party CI system is mirroring Jenkins. >>>>> A system that frequently diverges may be functioning perfectly fine and >>>>> just has a vastly different code path that it is integration testing so >>>> it >>>>> is legitimately detecting failures the OpenStack CI cannot. >>>> Great. >>>> >>>> How do we measure the degree to which it is legitimately detecting >>>> failures? >>>> >>>> Thanks Kevin, >>>> Anita. >>>>> >>>>> -- >>>>> Kevin Benton >>>>> >>>>> >>>>> On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno <[email protected]> >> wrote: >>>>> >>>>>> On 07/03/2014 07:12 AM, Salvatore Orlando wrote: >>>>>>> Apologies for quoting again the top post of the thread. >>>>>>> >>>>>>> Comments inline (mostly thinking aloud) >>>>>>> Salvatore >>>>>>> >>>>>>> >>>>>>> On 30 June 2014 22:22, Jay Pipes <[email protected]> wrote: >>>>>>> >>>>>>>> Hi Stackers, >>>>>>>> >>>>>>>> Some recent ML threads [1] and a hot IRC meeting today [2] brought >> up >>>>>> some >>>>>>>> legitimate questions around how a newly-proposed Stackalytics report >>>>>> page >>>>>>>> for Neutron External CI systems [2] represented the results of an >>>>>> external >>>>>>>> CI system as "successful" or not. >>>>>>>> >>>>>>>> First, I want to say that Ilya and all those involved in the >>>>>> Stackalytics >>>>>>>> program simply want to provide the most accurate information to >>>>>> developers >>>>>>>> in a format that is easily consumed. While there need to be some >>>>>> changes in >>>>>>>> how data is shown (and the wording of things like "Tests >> Succeeded"), >>>> I >>>>>>>> hope that the community knows there isn't any ill intent on the part >>>> of >>>>>>>> Mirantis or anyone who works on Stackalytics. OK, so let's keep the >>>>>>>> conversation civil -- we're all working towards the same goals of >>>>>>>> transparency and accuracy. :) >>>>>>>> >>>>>>>> Alright, now, Anita and Kurt Taylor were asking a very poignant >>>>>> question: >>>>>>>> >>>>>>>> "But what does CI tested really mean? just running tests? or tested >> to >>>>>>>> pass some level of requirements?" >>>>>>>> >>>>>>>> In this nascent world of external CI systems, we have a set of >> issues >>>>>> that >>>>>>>> we need to resolve: >>>>>>>> >>>>>>>> 1) All of the CI systems are different. >>>>>>>> >>>>>>>> Some run Bash scripts. Some run Jenkins slaves and devstack-gate >>>>>> scripts. >>>>>>>> Others run custom Python code that spawns VMs and publishes logs to >>>> some >>>>>>>> public domain. >>>>>>>> >>>>>>>> As a community, we need to decide whether it is worth putting in the >>>>>>>> effort to create a single, unified, installable and runnable CI >>>> system, >>>>>> so >>>>>>>> that we can legitimately say "all of the external systems are >>>> identical, >>>>>>>> with the exception of the driver code for vendor X being substituted >>>> in >>>>>> the >>>>>>>> Neutron codebase." >>>>>>>> >>>>>>> >>>>>>> I think such system already exists, and it's documented here: >>>>>>> http://ci.openstack.org/ >>>>>>> Still, understanding it is quite a learning curve, and running it is >>>> not >>>>>>> exactly straightforward. But I guess that's pretty much >> understandable >>>>>>> given the complexity of the system, isn't it? >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> If the goal of the external CI systems is to produce reliable, >>>>>> consistent >>>>>>>> results, I feel the answer to the above is "yes", but I'm interested >>>> to >>>>>>>> hear what others think. Frankly, in the world of benchmarks, it >> would >>>> be >>>>>>>> unthinkable to say "go ahead and everyone run your own benchmark >>>> suite", >>>>>>>> because you would get wildly different results. A similar problem >> has >>>>>>>> emerged here. >>>>>>>> >>>>>>> >>>>>>> I don't think the particular infrastructure which might range from an >>>>>>> openstack-ci clone to a 100-line bash script would have an impact on >>>> the >>>>>>> "reliability" of the quality assessment regarding a particular driver >>>> or >>>>>>> plugin. This is determined, in my opinion, by the quantity and nature >>>> of >>>>>>> tests one runs on a specific driver. In Neutron for instance, there >> is >>>> a >>>>>>> wide range of choices - from a few test cases in tempest.api.network >> to >>>>>> the >>>>>>> full smoketest job. As long there is no minimal standard here, then >> it >>>>>>> would be difficult to assess the quality of the evaluation from a CI >>>>>>> system, unless we explicitly keep into account coverage into the >>>>>> evaluation. >>>>>>> >>>>>>> On the other hand, different CI infrastructures will have different >>>>>> levels >>>>>>> in terms of % of patches tested and % of infrastructure failures. I >>>> think >>>>>>> it might not be a terrible idea to use these parameters to evaluate >> how >>>>>>> good a CI is from an infra standpoint. However, there are still open >>>>>>> questions. For instance, a CI might have a low patch % score because >> it >>>>>>> only needs to test patches affecting a given driver. >>>>>>> >>>>>>> >>>>>>>> 2) There is no mediation or verification that the external CI system >>>> is >>>>>>>> actually testing anything at all >>>>>>>> >>>>>>>> As a community, we need to decide whether the current system of >>>>>>>> self-policing should continue. If it should, then language on >> reports >>>>>> like >>>>>>>> [3] should be very clear that any numbers derived from such systems >>>>>> should >>>>>>>> be taken with a grain of salt. Use of the word "Success" should be >>>>>> avoided, >>>>>>>> as it has connotations (in English, at least) that the result has >> been >>>>>>>> verified, which is simply not the case as long as no verification or >>>>>>>> mediation occurs for any external CI system. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> 3) There is no clear indication of what tests are being run, and >>>>>> therefore >>>>>>>> there is no clear indication of what "success" is >>>>>>>> >>>>>>>> I think we can all agree that a test has three possible outcomes: >>>> pass, >>>>>>>> fail, and skip. The results of a test suite run therefore is nothing >>>>>> more >>>>>>>> than the aggregation of which tests passed, which failed, and which >>>> were >>>>>>>> skipped. >>>>>>>> >>>>>>>> As a community, we must document, for each project, what are >> expected >>>>>> set >>>>>>>> of tests that must be run for each merged patch into the project's >>>>>> source >>>>>>>> tree. This documentation should be discoverable so that reports like >>>> [3] >>>>>>>> can be crystal-clear on what the data shown actually means. The >> report >>>>>> is >>>>>>>> simply displaying the data it receives from Gerrit. The community >>>> needs >>>>>> to >>>>>>>> be proactive in saying "this is what is expected to be tested." This >>>>>> alone >>>>>>>> would allow the report to give information such as "External CI >> system >>>>>> ABC >>>>>>>> performed the expected tests. X tests passed. Y tests failed. Z >> tests >>>>>> were >>>>>>>> skipped." Likewise, it would also make it possible for the report to >>>>>> give >>>>>>>> information such as "External CI system DEF did not perform the >>>> expected >>>>>>>> tests.", which is excellent information in and of itself. >>>>>>>> >>>>>>>> >>>>>>> Agreed. In Neutron we have enforced CIs but not yet agreed on what's >>>> the >>>>>>> minimum set of tests we expect them to run. I reckon this will be >> fixed >>>>>>> soon. >>>>>>> >>>>>>> I'll try to look at what "SUCCESS" is from a naive standpoint: a CI >>>> says >>>>>>> "SUCCESS" if the test suite it rans passed; then one should have >> means >>>> to >>>>>>> understand whether a CI might blatantly lie or tell "half truths". >> For >>>>>>> instance saying it passes tempest.api.network while >>>>>>> tempest.scenario.test_network_basic_ops has not been executed is a >> half >>>>>>> truth, in my opinion. >>>>>>> Stackalitycs can help here, I think. One could create "CI classes" >>>>>>> according to how much they're close to the level of the upstream >> gate, >>>>>> and >>>>>>> then parse results posted to classify CIs. Now, before cursing me, I >>>>>>> totally understand that this won't be easy at all to implement! >>>>>>> Furthermore, I don't know whether how this should be reflected in >>>> gerrit. >>>>>>> >>>>>>> >>>>>>>> === >>>>>>>> >>>>>>>> In thinking about the likely answers to the above questions, I >> believe >>>>>> it >>>>>>>> would be prudent to change the Stackalytics report in question [3] >> in >>>>>> the >>>>>>>> following ways: >>>>>>>> >>>>>>>> a. Change the "Success %" column header to "% Reported +1 Votes" >>>>>>>> b. Change the phrase " Green cell - tests ran successfully, red >> cell - >>>>>>>> tests failed" to "Green cell - System voted +1, red cell - System >>>> voted >>>>>> -1" >>>>>>>> >>>>>>> >>>>>>> That makes sense to me. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> and then, when we have more and better data (for example, # tests >>>>>> passed, >>>>>>>> failed, skipped, etc), we can provide more detailed information than >>>>>> just >>>>>>>> "reported +1" or not. >>>>>>>> >>>>>>> >>>>>>> I think it should not be too hard to start adding minimal measures >> such >>>>>> as >>>>>>> "% of voted patches" >>>>>>> >>>>>>>> >>>>>>>> Thoughts? >>>>>>>> >>>>>>>> Best, >>>>>>>> -jay >>>>>>>> >>>>>>>> [1] http://lists.openstack.org/pipermail/openstack-dev/2014- >>>>>>>> June/038933.html >>>>>>>> [2] http://eavesdrop.openstack.org/meetings/third_party/2014/ >>>>>>>> third_party.2014-06-30-18.01.log.html >>>>>>>> [3] http://stackalytics.com/report/ci/neutron/7 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> OpenStack-dev mailing list >>>>>>>> [email protected] >>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> OpenStack-dev mailing list >>>>>>> [email protected] >>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>>> >>>>>> Thanks for sharing your thoughts, Salvadore. >>>>>> >>>>>> Some additional things to look at: >>>>>> >>>>>> Sean Dague has created a tool in stackforge gerrit-dash-creator: >>>>>> >>>>>> >>>> >> http://git.openstack.org/cgit/stackforge/gerrit-dash-creator/tree/README.rst >>>>>> which has the ability to make interesting queries on gerrit results. >> One >>>>>> such example can be found here: >> http://paste.openstack.org/show/85416/ >>>>>> (Note when this url was created there was a bug in the syntax and this >>>>>> url works in chrome but not firefox, Sean tells me the firefox bug has >>>>>> been addressed - though this url hasn't been altered with the new >> syntax >>>>>> yet) >>>>>> >>>>>> This allows the viewer to see categories of reviews based upon their >>>>>> divergence from OpenStack's Jenkins results. I think evaluating >>>>>> divergence from Jenkins might be a metric worth consideration. >>>>>> >>>>>> Also a gui representation worth looking at is Mikal Still's gui for >>>>>> Neutron ci health: >>>>>> http://www.rcbops.com/gerrit/reports/neutron-cireport.html >>>>>> and Nova ci health: >>>>>> http://www.rcbops.com/gerrit/reports/nova-cireport.html >>>>>> >>>>>> I don't know the details of how the graphs are calculated in these >>>>>> pages, but being able to view passed/failed/missed and compare them to >>>>>> Jenkins is an interesting approach and I feel has some merit. >>>>>> >>>>>> Thanks I think we are getting some good information out in this thread >>>>>> and look forward to hearing more thoughts. >>>>>> >>>>>> Thank you, >>>>>> Anita. >>>>>> >>>>>> _______________________________________________ >>>>>> OpenStack-dev mailing list >>>>>> [email protected] >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> OpenStack-dev mailing list >>>>> [email protected] >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>> >>>> >>>> >>>> _______________________________________________ >>>> OpenStack-dev mailing list >>>> [email protected] >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> OpenStack-dev mailing list >>> [email protected] >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >> >> >> _______________________________________________ >> OpenStack-dev mailing list >> [email protected] >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > > > > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
