Yes, I can propose a spec for that. It probably won't be until Monday. Is that okay?
On Thu, Jul 3, 2014 at 11:42 AM, Anita Kuno <ante...@anteaya.info> wrote: > On 07/03/2014 02:33 PM, Kevin Benton wrote: > > Maybe we can require period checks against the head of the master > > branch (which should always pass) and build statistics based on the > results > > of that. > I like this suggestion. I really like this suggestion. > > Hmmmm, what to do with a good suggestion? I wonder if we could capture > it in an infra-spec and work on it from there. > > Would you feel comfortable offering a draft as an infra-spec and then > perhaps we can discuss the design through the spec? > > What do you think? > > Thanks Kevin, > Anita. > > > Otherwise it seems like we have to take a CI system's word for it > > that a particular patch indeed broke that system. > > > > -- > > Kevin Benton > > > > > > On Thu, Jul 3, 2014 at 11:07 AM, Anita Kuno <ante...@anteaya.info> > wrote: > > > >> On 07/03/2014 01:27 PM, Kevin Benton wrote: > >>>> This allows the viewer to see categories of reviews based upon their > >>>> divergence from OpenStack's Jenkins results. I think evaluating > >>>> divergence from Jenkins might be a metric worth consideration. > >>> > >>> I think the only thing this really reflects though is how much the > third > >>> party CI system is mirroring Jenkins. > >>> A system that frequently diverges may be functioning perfectly fine and > >>> just has a vastly different code path that it is integration testing so > >> it > >>> is legitimately detecting failures the OpenStack CI cannot. > >> Great. > >> > >> How do we measure the degree to which it is legitimately detecting > >> failures? > >> > >> Thanks Kevin, > >> Anita. > >>> > >>> -- > >>> Kevin Benton > >>> > >>> > >>> On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno <ante...@anteaya.info> > wrote: > >>> > >>>> On 07/03/2014 07:12 AM, Salvatore Orlando wrote: > >>>>> Apologies for quoting again the top post of the thread. > >>>>> > >>>>> Comments inline (mostly thinking aloud) > >>>>> Salvatore > >>>>> > >>>>> > >>>>> On 30 June 2014 22:22, Jay Pipes <jaypi...@gmail.com> wrote: > >>>>> > >>>>>> Hi Stackers, > >>>>>> > >>>>>> Some recent ML threads [1] and a hot IRC meeting today [2] brought > up > >>>> some > >>>>>> legitimate questions around how a newly-proposed Stackalytics report > >>>> page > >>>>>> for Neutron External CI systems [2] represented the results of an > >>>> external > >>>>>> CI system as "successful" or not. > >>>>>> > >>>>>> First, I want to say that Ilya and all those involved in the > >>>> Stackalytics > >>>>>> program simply want to provide the most accurate information to > >>>> developers > >>>>>> in a format that is easily consumed. While there need to be some > >>>> changes in > >>>>>> how data is shown (and the wording of things like "Tests > Succeeded"), > >> I > >>>>>> hope that the community knows there isn't any ill intent on the part > >> of > >>>>>> Mirantis or anyone who works on Stackalytics. OK, so let's keep the > >>>>>> conversation civil -- we're all working towards the same goals of > >>>>>> transparency and accuracy. :) > >>>>>> > >>>>>> Alright, now, Anita and Kurt Taylor were asking a very poignant > >>>> question: > >>>>>> > >>>>>> "But what does CI tested really mean? just running tests? or tested > to > >>>>>> pass some level of requirements?" > >>>>>> > >>>>>> In this nascent world of external CI systems, we have a set of > issues > >>>> that > >>>>>> we need to resolve: > >>>>>> > >>>>>> 1) All of the CI systems are different. > >>>>>> > >>>>>> Some run Bash scripts. Some run Jenkins slaves and devstack-gate > >>>> scripts. > >>>>>> Others run custom Python code that spawns VMs and publishes logs to > >> some > >>>>>> public domain. > >>>>>> > >>>>>> As a community, we need to decide whether it is worth putting in the > >>>>>> effort to create a single, unified, installable and runnable CI > >> system, > >>>> so > >>>>>> that we can legitimately say "all of the external systems are > >> identical, > >>>>>> with the exception of the driver code for vendor X being substituted > >> in > >>>> the > >>>>>> Neutron codebase." > >>>>>> > >>>>> > >>>>> I think such system already exists, and it's documented here: > >>>>> http://ci.openstack.org/ > >>>>> Still, understanding it is quite a learning curve, and running it is > >> not > >>>>> exactly straightforward. But I guess that's pretty much > understandable > >>>>> given the complexity of the system, isn't it? > >>>>> > >>>>> > >>>>>> > >>>>>> If the goal of the external CI systems is to produce reliable, > >>>> consistent > >>>>>> results, I feel the answer to the above is "yes", but I'm interested > >> to > >>>>>> hear what others think. Frankly, in the world of benchmarks, it > would > >> be > >>>>>> unthinkable to say "go ahead and everyone run your own benchmark > >> suite", > >>>>>> because you would get wildly different results. A similar problem > has > >>>>>> emerged here. > >>>>>> > >>>>> > >>>>> I don't think the particular infrastructure which might range from an > >>>>> openstack-ci clone to a 100-line bash script would have an impact on > >> the > >>>>> "reliability" of the quality assessment regarding a particular driver > >> or > >>>>> plugin. This is determined, in my opinion, by the quantity and nature > >> of > >>>>> tests one runs on a specific driver. In Neutron for instance, there > is > >> a > >>>>> wide range of choices - from a few test cases in tempest.api.network > to > >>>> the > >>>>> full smoketest job. As long there is no minimal standard here, then > it > >>>>> would be difficult to assess the quality of the evaluation from a CI > >>>>> system, unless we explicitly keep into account coverage into the > >>>> evaluation. > >>>>> > >>>>> On the other hand, different CI infrastructures will have different > >>>> levels > >>>>> in terms of % of patches tested and % of infrastructure failures. I > >> think > >>>>> it might not be a terrible idea to use these parameters to evaluate > how > >>>>> good a CI is from an infra standpoint. However, there are still open > >>>>> questions. For instance, a CI might have a low patch % score because > it > >>>>> only needs to test patches affecting a given driver. > >>>>> > >>>>> > >>>>>> 2) There is no mediation or verification that the external CI system > >> is > >>>>>> actually testing anything at all > >>>>>> > >>>>>> As a community, we need to decide whether the current system of > >>>>>> self-policing should continue. If it should, then language on > reports > >>>> like > >>>>>> [3] should be very clear that any numbers derived from such systems > >>>> should > >>>>>> be taken with a grain of salt. Use of the word "Success" should be > >>>> avoided, > >>>>>> as it has connotations (in English, at least) that the result has > been > >>>>>> verified, which is simply not the case as long as no verification or > >>>>>> mediation occurs for any external CI system. > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>> 3) There is no clear indication of what tests are being run, and > >>>> therefore > >>>>>> there is no clear indication of what "success" is > >>>>>> > >>>>>> I think we can all agree that a test has three possible outcomes: > >> pass, > >>>>>> fail, and skip. The results of a test suite run therefore is nothing > >>>> more > >>>>>> than the aggregation of which tests passed, which failed, and which > >> were > >>>>>> skipped. > >>>>>> > >>>>>> As a community, we must document, for each project, what are > expected > >>>> set > >>>>>> of tests that must be run for each merged patch into the project's > >>>> source > >>>>>> tree. This documentation should be discoverable so that reports like > >> [3] > >>>>>> can be crystal-clear on what the data shown actually means. The > report > >>>> is > >>>>>> simply displaying the data it receives from Gerrit. The community > >> needs > >>>> to > >>>>>> be proactive in saying "this is what is expected to be tested." This > >>>> alone > >>>>>> would allow the report to give information such as "External CI > system > >>>> ABC > >>>>>> performed the expected tests. X tests passed. Y tests failed. Z > tests > >>>> were > >>>>>> skipped." Likewise, it would also make it possible for the report to > >>>> give > >>>>>> information such as "External CI system DEF did not perform the > >> expected > >>>>>> tests.", which is excellent information in and of itself. > >>>>>> > >>>>>> > >>>>> Agreed. In Neutron we have enforced CIs but not yet agreed on what's > >> the > >>>>> minimum set of tests we expect them to run. I reckon this will be > fixed > >>>>> soon. > >>>>> > >>>>> I'll try to look at what "SUCCESS" is from a naive standpoint: a CI > >> says > >>>>> "SUCCESS" if the test suite it rans passed; then one should have > means > >> to > >>>>> understand whether a CI might blatantly lie or tell "half truths". > For > >>>>> instance saying it passes tempest.api.network while > >>>>> tempest.scenario.test_network_basic_ops has not been executed is a > half > >>>>> truth, in my opinion. > >>>>> Stackalitycs can help here, I think. One could create "CI classes" > >>>>> according to how much they're close to the level of the upstream > gate, > >>>> and > >>>>> then parse results posted to classify CIs. Now, before cursing me, I > >>>>> totally understand that this won't be easy at all to implement! > >>>>> Furthermore, I don't know whether how this should be reflected in > >> gerrit. > >>>>> > >>>>> > >>>>>> === > >>>>>> > >>>>>> In thinking about the likely answers to the above questions, I > believe > >>>> it > >>>>>> would be prudent to change the Stackalytics report in question [3] > in > >>>> the > >>>>>> following ways: > >>>>>> > >>>>>> a. Change the "Success %" column header to "% Reported +1 Votes" > >>>>>> b. Change the phrase " Green cell - tests ran successfully, red > cell - > >>>>>> tests failed" to "Green cell - System voted +1, red cell - System > >> voted > >>>> -1" > >>>>>> > >>>>> > >>>>> That makes sense to me. > >>>>> > >>>>> > >>>>>> > >>>>>> and then, when we have more and better data (for example, # tests > >>>> passed, > >>>>>> failed, skipped, etc), we can provide more detailed information than > >>>> just > >>>>>> "reported +1" or not. > >>>>>> > >>>>> > >>>>> I think it should not be too hard to start adding minimal measures > such > >>>> as > >>>>> "% of voted patches" > >>>>> > >>>>>> > >>>>>> Thoughts? > >>>>>> > >>>>>> Best, > >>>>>> -jay > >>>>>> > >>>>>> [1] http://lists.openstack.org/pipermail/openstack-dev/2014- > >>>>>> June/038933.html > >>>>>> [2] http://eavesdrop.openstack.org/meetings/third_party/2014/ > >>>>>> third_party.2014-06-30-18.01.log.html > >>>>>> [3] http://stackalytics.com/report/ci/neutron/7 > >>>>>> > >>>>>> _______________________________________________ > >>>>>> OpenStack-dev mailing list > >>>>>> OpenStack-dev@lists.openstack.org > >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> OpenStack-dev mailing list > >>>>> OpenStack-dev@lists.openstack.org > >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >>>>> > >>>> Thanks for sharing your thoughts, Salvadore. > >>>> > >>>> Some additional things to look at: > >>>> > >>>> Sean Dague has created a tool in stackforge gerrit-dash-creator: > >>>> > >>>> > >> > http://git.openstack.org/cgit/stackforge/gerrit-dash-creator/tree/README.rst > >>>> which has the ability to make interesting queries on gerrit results. > One > >>>> such example can be found here: > http://paste.openstack.org/show/85416/ > >>>> (Note when this url was created there was a bug in the syntax and this > >>>> url works in chrome but not firefox, Sean tells me the firefox bug has > >>>> been addressed - though this url hasn't been altered with the new > syntax > >>>> yet) > >>>> > >>>> This allows the viewer to see categories of reviews based upon their > >>>> divergence from OpenStack's Jenkins results. I think evaluating > >>>> divergence from Jenkins might be a metric worth consideration. > >>>> > >>>> Also a gui representation worth looking at is Mikal Still's gui for > >>>> Neutron ci health: > >>>> http://www.rcbops.com/gerrit/reports/neutron-cireport.html > >>>> and Nova ci health: > >>>> http://www.rcbops.com/gerrit/reports/nova-cireport.html > >>>> > >>>> I don't know the details of how the graphs are calculated in these > >>>> pages, but being able to view passed/failed/missed and compare them to > >>>> Jenkins is an interesting approach and I feel has some merit. > >>>> > >>>> Thanks I think we are getting some good information out in this thread > >>>> and look forward to hearing more thoughts. > >>>> > >>>> Thank you, > >>>> Anita. > >>>> > >>>> _______________________________________________ > >>>> OpenStack-dev mailing list > >>>> OpenStack-dev@lists.openstack.org > >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >>>> > >>> > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> OpenStack-dev mailing list > >>> OpenStack-dev@lists.openstack.org > >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >>> > >> > >> > >> _______________________________________________ > >> OpenStack-dev mailing list > >> OpenStack-dev@lists.openstack.org > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >> > > > > > > > > > > > > _______________________________________________ > > OpenStack-dev mailing list > > OpenStack-dev@lists.openstack.org > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Kevin Benton
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev