Re: [openstack-dev] [third-party-ci][neutron] What is "Success" exactly?

Joe Gordon Mon, 07 Jul 2014 14:12:07 -0700

On Jul 3, 2014 8:57 AM, "Anita Kuno" <[email protected]> wrote:
>
> On 07/03/2014 06:22 AM, Sullivan, Jon Paul wrote:
> >> -----Original Message-----
> >> From: Anita Kuno [mailto:[email protected]]
> >> Sent: 01 July 2014 14:42
> >> To: [email protected]
> >> Subject: Re: [openstack-dev] [third-party-ci][neutron] What is
"Success"
> >> exactly?
> >>
> >> On 06/30/2014 09:13 PM, Jay Pipes wrote:
> >>> On 06/30/2014 07:08 PM, Anita Kuno wrote:
> >>>> On 06/30/2014 04:22 PM, Jay Pipes wrote:
> >>>>> Hi Stackers,
> >>>>>
> >>>>> Some recent ML threads [1] and a hot IRC meeting today [2] brought
> >>>>> up some legitimate questions around how a newly-proposed
> >>>>> Stackalytics report page for Neutron External CI systems [2]
> >>>>> represented the results of an external CI system as "successful" or
> >> not.
> >>>>>
> >>>>> First, I want to say that Ilya and all those involved in the
> >>>>> Stackalytics program simply want to provide the most accurate
> >>>>> information to developers in a format that is easily consumed. While
> >>>>> there need to be some changes in how data is shown (and the wording
> >>>>> of things like "Tests Succeeded"), I hope that the community knows
> >>>>> there isn't any ill intent on the part of Mirantis or anyone who
> >>>>> works on Stackalytics. OK, so let's keep the conversation civil --
> >>>>> we're all working towards the same goals of transparency and
> >>>>> accuracy. :)
> >>>>>
> >>>>> Alright, now, Anita and Kurt Taylor were asking a very poignant
> >>>>> question:
> >>>>>
> >>>>> "But what does CI tested really mean? just running tests? or tested
> >>>>> to pass some level of requirements?"
> >>>>>
> >>>>> In this nascent world of external CI systems, we have a set of
> >>>>> issues that we need to resolve:
> >>>>>
> >>>>> 1) All of the CI systems are different.
> >>>>>
> >>>>> Some run Bash scripts. Some run Jenkins slaves and devstack-gate
> >>>>> scripts. Others run custom Python code that spawns VMs and publishes
> >>>>> logs to some public domain.
> >>>>>
> >>>>> As a community, we need to decide whether it is worth putting in the
> >>>>> effort to create a single, unified, installable and runnable CI
> >>>>> system, so that we can legitimately say "all of the external systems
> >>>>> are identical, with the exception of the driver code for vendor X
> >>>>> being substituted in the Neutron codebase."
> >>>>>
> >>>>> If the goal of the external CI systems is to produce reliable,
> >>>>> consistent results, I feel the answer to the above is "yes", but I'm
> >>>>> interested to hear what others think. Frankly, in the world of
> >>>>> benchmarks, it would be unthinkable to say "go ahead and everyone
> >>>>> run your own benchmark suite", because you would get wildly
> >>>>> different results. A similar problem has emerged here.
> >>>>>
> >>>>> 2) There is no mediation or verification that the external CI system
> >>>>> is actually testing anything at all
> >>>>>
> >>>>> As a community, we need to decide whether the current system of
> >>>>> self-policing should continue. If it should, then language on
> >>>>> reports like [3] should be very clear that any numbers derived from
> >>>>> such systems should be taken with a grain of salt. Use of the word
> >>>>> "Success" should be avoided, as it has connotations (in English, at
> >>>>> least) that the result has been verified, which is simply not the
> >>>>> case as long as no verification or mediation occurs for any external
> >> CI system.
> >>>>>
> >>>>> 3) There is no clear indication of what tests are being run, and
> >>>>> therefore there is no clear indication of what "success" is
> >>>>>
> >>>>> I think we can all agree that a test has three possible outcomes:
> >>>>> pass, fail, and skip. The results of a test suite run therefore is
> >>>>> nothing more than the aggregation of which tests passed, which
> >>>>> failed, and which were skipped.
> >>>>>
> >>>>> As a community, we must document, for each project, what are
> >>>>> expected set of tests that must be run for each merged patch into
> >>>>> the project's source tree. This documentation should be discoverable
> >>>>> so that reports like [3] can be crystal-clear on what the data shown
> >>>>> actually means. The report is simply displaying the data it receives
> >>>>> from Gerrit. The community needs to be proactive in saying "this is
> >>>>> what is expected to be tested." This alone would allow the report to
> >>>>> give information such as "External CI system ABC performed the
> >> expected tests. X tests passed.
> >>>>> Y tests failed. Z tests were skipped." Likewise, it would also make
> >>>>> it possible for the report to give information such as "External CI
> >>>>> system DEF did not perform the expected tests.", which is excellent
> >>>>> information in and of itself.
> >>>>>
> >>>>> ===
> >>>>>
> >>>>> In thinking about the likely answers to the above questions, I
> >>>>> believe it would be prudent to change the Stackalytics report in
> >>>>> question [3] in the following ways:
> >>>>>
> >>>>> a. Change the "Success %" column header to "% Reported +1 Votes"
> >>>>> b. Change the phrase " Green cell - tests ran successfully, red cell
> >>>>> - tests failed" to "Green cell - System voted +1, red cell - System
> >>>>> voted -1"
> >>>>>
> >>>>> and then, when we have more and better data (for example, # tests
> >>>>> passed, failed, skipped, etc), we can provide more detailed
> >>>>> information than just "reported +1" or not.
> >>>>>
> >>>>> Thoughts?
> >>>>>
> >>>>> Best,
> >>>>> -jay
> >>>>>
> >>>>> [1]
> >>>>> http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933.
> >>>>> html
> >>>>> [2]
> >>>>> http://eavesdrop.openstack.org/meetings/third_party/2014/third_party
> >>>>> .2014-06-30-18.01.log.html
> >>>>>
> >>>>>
> >>>>> [3] http://stackalytics.com/report/ci/neutron/7
> >>>>>
> >>>>> _______________________________________________
> >>>>> OpenStack-dev mailing list
> >>>>> [email protected]
> >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>> Hi Jay:
> >>>>
> >>>> Thanks for starting this thread. You raise some interesting
> >> questions.
> >>>>
> >>>> The question I had identified as needing definition is "what
> >>>> algorithm do we use to assess fitness of a third party ci system".
> >>>>
> >>>> http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstac
> >>>> k-infra.2014-06-30.log
> >>>>
> >>>> timestamp 2014-06-30T19:23:40
> >>>>
> >>>> This is the question that is top of mind for me.
> >>>
> >>> Right, my email above is written to say "unless there is a) uniformity
> >>> of the external CI system, b) agreement on mediation or verification
> >>> of said systems, and c) agreement on what tests shall be expected to
> >>> pass and be skipped for each project, then no such algorithm is really
> >>> possible."
> >>>
> >>> Now, if the community is willing to agree to a), b), and c), then
> >>> certainly there is the ability to determine the fitness of a CI system
> >>> -- at least in regards to its output (test results and the voting on
> >>> the Gerrit system).
> >>>
> >>> Barring agreement on any or all of those three things, I recommended
> >>> changing the language on the report due to the inability to have any
> >>> consistently-applied algorithm to determine fitness.
> >>>
> >>> Best,
> >>> -jay
> >>>
> >
> > +1 to all of your points above, Jay.  Well-written, thank you.
> >
> >>> _______________________________________________
> >>> OpenStack-dev mailing list
> >>> [email protected]
> >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >> I've been mulling this over and looking at how I assess feedback I get
> >> from different human reviewers, since I don't know the basis of how
they
> >> arrive at their decisions unless they tell me and/or I have experience
> >> with their criteria for how they review my patches.
> >>
> >> I get different value from different human reviewers based upon my
> >> experience of them reviewing my patches, my experience of them
reviewing
> >> other people's patches, my experience reviewing their code and my
> >> discussions with them in channel, on the mailing list and in person, as
> >> well as my experience reading or becoming aware of other decisions they
> >> make.
> >>
> >> It would be really valuable for me personally to have a page in gerrit
> >> for each third party ci account, where I could sign in and leave
> >> comments or vote +/-1 or 0 as a way of giving feedback to the
> >> maintainers of that system. Also others could do the same and I could
> >> read their feedback. For instance, yesterday someone linked me to logs
> >> that forced me to download them to read. I hadn't been made aware this
> >> account had been doing this, but this developer was aware. Currently we
> >> have no system for a developer, in the course of their normal workflow,
> >> to leave a comment and/or vote on a third party ci system to give those
> >> maintainers feedback about how they are doing at providing consumable
> >> artifacts from their system.
> >>
> >> It also would remove the perception that I'm just a big meany, since
> >> developers could comment for themselves, directly on the account, how
> >> they feel about having to download tarballs, or sign into other systems
> >> to trigger a recheck. The community of developers would say how fit a
> >> system is or isn't since they are the individuals having to dig through
> >> logs and evaluate "did this build fail because the code needs
> >> adjustment" or not, and can reflect their findings in a comment and
vote
> >> on the system.
> >>
> >> The other thing I really value about gerrit is that votes can change,
> >> systems can improve, given motivation and accurate feedback for making
> >> changes.
> >>
> >> I have no idea how hard this would be to create, but I think having
> >> direct feedback from developers on systems would help both the
> >> developers and the maintainers of ci systems.
> >>
> >> There are a number of people working really hard to do a good job in
> >> this area. This sort of structure would also provide support and
> >> encouragement to those people providing leadership in this space,
people
> >> asking good questions, helping other system maintainers, starting
> >> discussions, offering patches to infra (and reviewing infra patches) in
> >> accordance with the goals of the third party meeting[0] and other hard-
> >> to-measure valuable decisions that provide value for the community.
> >> I'd really like a way we all can demonstrate the extent to which we
> >> value these contributions.
> >>
> >> So far, those are my thoughts.
> >>
> >> Thanks,
> >> Anita.
> >
> > +1 - this sounds like a really good idea.
> >
> > How is feedback on the Openstack check/gate retrieved and moderated?
 Can that provide a model for doing what you suggest here?
> Hi Jon Paul: (Is it Jon Paul or Jon?)
>
> The OpenStack check/gate pipelines are assessed using a system we call
> elastic recheck: http://status.openstack.org/elastic-recheck/
>
> We use logstash to index log output and elastic search to then be able
> to compose queries to evaluate the number of incidence of a given error
> message (for example). Sample query:
>
http://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries/1097592.yaml
>
> The elastic-recheck repo is here:
> http://git.openstack.org/cgit/openstack-infra/elastic-recheck/
>
> The gui is available at logstash.openstack.org.
>
> All the queries are written manually as yaml files and named with a
> corresponding bug number:
> http://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries
>
> Here is some documentation about elastic-recheck and how to write
> queries: http://docs.openstack.org/infra/elastic-recheck/readme.html
>
> Joe Gordon actually has created some great graphs (where are those
> hosted again, Joe?) to be able to evaluate failure rates in the
> pipelines (check and gate) based on test groups (tempest, unit tests).


http://jogo.github.io/gate

The data comes from graphite.openstack.org and is hosted off site because
the data requires some interpretation and should be viewed with a grain of
salt.

>
> Clark Boylan and Sean Dague did and do the majority of the heavy lifting
> setting up and maintaining elastic-recheck (with lots of help from
> others, thank you!) so perhaps they could offer their opinion on if this
> is a reasonable choice for evaluating third party ci systems.
>
> Thanks Jon Paul, this is a good question,
> Anita.
> >
> >>
> >>
> >> [0]
> >>
https://wiki.openstack.org/wiki/Meetings/ThirdParty#Goals_for_Third_Part
> >> y_meetings
> >>
> >> _______________________________________________
> >> OpenStack-dev mailing list
> >> [email protected]
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > [email protected]
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
> _______________________________________________
> OpenStack-dev mailing list
> [email protected]
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [third-party-ci][neutron] What is "Success" exactly?

Reply via email to