Re: Measuring Release Quality

kurt greaves Sat, 22 Sep 2018 15:03:01 -0700

Yep agreed with that. Count me in.

On Sun., 23 Sep. 2018, 00:33 Benedict Elliott Smith, <bened...@apache.org>
wrote:


> Thanks Kurt.  I think the goal would be to get JIRA into a state where it
> can hold all the information we want, and for it to be easy to get all the
> information correct when filing.
>
> My feeling is that it would be easiest to do this with a small group, so
> we can make rapid progress on an initial proposal, then bring that to the
> community for final tweaking / approval (or, perhaps, rejection - but I
> hope it won’t be a contentious topic).  I don’t think it should be a huge
> job to come up with a proposal - though we might need to organise a
> community effort to clean up the JIRA history!
>
> It would be great if we could get a few more volunteers from other
> companies/backgrounds to participate.
>
>
> > On 22 Sep 2018, at 11:54, kurt greaves <k...@instaclustr.com> wrote:
> >
> > I'm interested. Better defining the components and labels we use in our
> > docs would be a good start and LHF. I'd prefer if we kept all the
> > information within JIRA through the use of fields/labels though, and
> > generated reports off those tags. Keeping all the information in one
> place
> > is much better in my experience. Not applicable for CI obviously, but
> > ideally we can generate testing reports directly from the testing
> systems.
> >
> > I don't see this as a huge amount of work so I think the overall risk is
> > pretty small, especially considering it can easily be done in a way that
> > doesn't affect anyone until we get consensus on methodology.
> >
> >
> >
> > On Sat, 22 Sep 2018 at 03:44, Scott Andreas <sc...@paradoxica.net>
> wrote:
> >
> >> Josh, thanks for reading and sharing feedback. Agreed with starting
> simple
> >> and measuring inputs that are high-signal; that’s a good place to begin.
> >>
> >> To the challenge of building consensus, point taken + agreed. Perhaps
> the
> >> distinction is between producing something that’s “useful” vs. something
> >> that’s “authoritative” for decisionmaking purposes. My motivation is to
> >> work toward something “useful” (as measured by the value contributors
> >> find). I’d be happy to start putting some of these together as part of
> an
> >> experiment – and agreed on evaluating “value relative to cost” after we
> see
> >> how things play out.
> >>
> >> To Benedict’s point on JIRA, agreed that plotting a value from messy
> input
> >> wouldn’t produce useful output. Some questions a small working group
> might
> >> take on toward better categorization might look like:
> >>
> >> –––
> >> – Revisiting the list of components: e.g., “Core” captures a lot right
> now.
> >> – Revisiting which fields should be required when filing a ticket – and
> if
> >> there are any that should be removed from the form.
> >> – Reviewing active labels: understanding what people have been trying to
> >> capture, and how they could be organized + documented better.
> >> – Documenting “priority”: (e.g., a common standard we can point to, even
> >> if we’re pretty good now).
> >> – Considering adding a "severity” field to capture the distinction
> between
> >> priority and severity.
> >> –––
> >>
> >> If there’s appetite for spending a little time on this, I’d put effort
> >> toward it if others are interested; is anyone?
> >>
> >> Otherwise, I’m equally fine with an experiment to measure basics via the
> >> current structure as Josh mentioned, too.
> >>
> >> – Scott
> >>
> >>
> >> On September 20, 2018 at 8:22:55 AM, Benedict Elliott Smith (
> >> bened...@apache.org<mailto:bened...@apache.org>) wrote:
> >>
> >> I think it would be great to start getting some high quality info out of
> >> JIRA, but I think we need to clean up and standardise how we use it to
> >> facilitate this.
> >>
> >> Take the Component field as an example. This is the current list of
> >> options:
> >>
> >> 4.0
> >> Auth
> >> Build
> >> Compaction
> >> Configuration
> >> Core
> >> CQL
> >> Distributed Metadata
> >> Documentation and Website
> >> Hints
> >> Libraries
> >> Lifecycle
> >> Local Write-Read Paths
> >> Materialized Views
> >> Metrics
> >> Observability
> >> Packaging
> >> Repair
> >> SASI
> >> Secondary Indexes
> >> Streaming and Messaging
> >> Stress
> >> Testing
> >> Tools
> >>
> >> In some cases there's duplication (Metrics + Observability, Coordination
> >> (=“Storage Proxy, Hints, Batchlog, Counters…") + Hints, Local Write-Read
> >> Paths + Core)
> >> In others, there’s a lack of granularity (Streaming + Messaging, Core,
> >> Coordination, Distributed Metadata)
> >> In others, there’s a lack of clarity (Core, Lifecycle, Coordination)
> >> Others are probably missing entirely (Transient Replication, …?)
> >>
> >> Labels are also used fairly haphazardly, and there’s no clear definition
> >> of “priority”
> >>
> >> Perhaps we should form a working group to suggest a methodology for
> >> filling out JIRA, standardise the necessary components, labels etc, and
> put
> >> together a wiki page with step-by-step instructions on how to do it?
> >>
> >>
> >>> On 20 Sep 2018, at 15:29, Joshua McKenzie <jmcken...@apache.org>
> wrote:
> >>>
> >>> I've spent a good bit of time thinking about the above and bounced off
> >> both
> >>> different ways to measure quality and progress as well as trying to
> >>> influence community behavior on this topic. My advice: start small and
> >>> simple (KISS, YAGNI, all that). Get metrics for pass/fail on
> >>> utest/dtest/flakiness over time, perhaps also aggregate bug count by
> >>> component over time. After spending a predetermined time doing that (a
> >>> couple months?) as an experiment, we retrospect as a project and see if
> >>> these efforts are adding value commensurate with the time investment
> >>> required to perform the measurement and analysis.
> >>>
> >>> There's a lot of really good ideas in that linked wiki article / this
> >> email
> >>> thread. The biggest challenge, and risk of failure, is in translating
> >> good
> >>> ideas into action and selling project participants on the value of
> >> changing
> >>> their behavior. The latter is where we've fallen short over the years;
> >>> building consensus (especially regarding process /shudder) is Very
> Hard.
> >>>
> >>> Also - thanks for spearheading this discussion Scott. It's one we come
> >> back
> >>> to with some regularity so there's real pain and opportunity here for
> the
> >>> project imo.
> >>>
> >>> On Wed, Sep 19, 2018 at 9:32 PM Scott Andreas <sc...@paradoxica.net>
> >> wrote:
> >>>
> >>>> Hi everyone,
> >>>>
> >>>> Now that many teams have begun testing and validating Apache Cassandra
> >>>> 4.0, it’s useful to think about what “progress” looks like. While
> >> metrics
> >>>> alone may not tell us what “done” means, they do help us answer the
> >>>> question, “are we getting better or worse — and how quickly”?
> >>>>
> >>>> A friend described to me a few attributes of metrics he considered
> >> useful,
> >>>> suggesting that good metrics are actionable, visible, predictive, and
> >>>> consequent:
> >>>>
> >>>> – Actionable: We know what to do based on them – where to invest, what
> >> to
> >>>> fix, what’s fine, etc.
> >>>> – Visible: Everyone who has a stake in a metric has full visibility
> into
> >>>> it and participates in its definition.
> >>>> – Predictive: Good metrics enable forecasting of outcomes – e.g.,
> >>>> “consistent performance test results against build abc predict an x%
> >>>> reduction in 99%ile read latency for this workload in prod".
> >>>> – Consequent: We take actions based on them (e.g., not shipping if
> tests
> >>>> are failing).
> >>>>
> >>>> Here are some notes in Confluence toward metrics that may be useful to
> >>>> track beginning in this phase of the development + release cycle. I’m
> >>>> interested in your thoughts on these. They’re also copied inline for
> >> easier
> >>>> reading in your mail client.
> >>>>
> >>>> Link:
> >>>>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=93324430
> >>>>
> >>>> Cheers,
> >>>>
> >>>> – Scott
> >>>>
> >>>> ––––––
> >>>>
> >>>> Measuring Release Quality:
> >>>>
> >>>> [ This document is a draft + sketch of ideas. It is located in the
> >>>> "discussion" section of this wiki to indicate that it is an active
> >> draft –
> >>>> not a document that has been voted on, achieved consensus, or in any
> way
> >>>> official. ]
> >>>>
> >>>> Introduction:
> >>>>
> >>>> This document outlines a series of metrics that may be useful toward
> >>>> measuring release quality, and quantifying progress during the
> testing /
> >>>> validation phase of the Apache Cassandra 4.0 release cycle.
> >>>>
> >>>> The goal of this document is to think through what we should consider
> >>>> measuring to quantify our progress testing and validating Apache
> >> Cassandra
> >>>> 4.0. This document explicitly does not discuss release criteria –
> though
> >>>> metrics may be a useful input to a discussion on that topic.
> >>>>
> >>>>
> >>>> Metric: Build / Test Health (produced via CI, recorded in Confluence):
> >>>>
> >>>> Bread-and-butter metrics intended to capture baseline build health,
> >>>> flakiness in the test suite, and presented as a time series to
> >> understand
> >>>> how they’ve changed from build to build and release to release:
> >>>>
> >>>> Metrics:
> >>>>
> >>>> – Pass / fail metrics for unit tests
> >>>> – Pass / fail metrics for dtests
> >>>> – Flakiness stats for unit and dtests
> >>>>
> >>>>
> >>>> Metric: “Found Bug” Count by Methodology (sourced via JQL, reported in
> >>>> Confluence):
> >>>>
> >>>> These are intended to help us understand the efficacy of each
> >> methodology
> >>>> being applied. We might consider annotating bugs found in JIRA with
> the
> >>>> methodology that produced them. This could be consumed as input in a
> JQL
> >>>> query and reported on the Confluence dev wiki.
> >>>>
> >>>> As we reach a pareto-optimal level of investment in a methodology,
> we’d
> >>>> expect to see its found-bug rate taper. As we achieve higher quality
> >> across
> >>>> the board, we’d expect to see a tapering in found-bug counts across
> all
> >>>> methodologies. In the event that one or two approaches is an outlier,
> >> this
> >>>> could indicate the utility of doubling down on a particular form of
> >> testing.
> >>>>
> >>>> We might consider reporting “Found By” counts for methodologies such
> as:
> >>>>
> >>>> – Property-based / fuzz testing
> >>>> – Replay testing
> >>>> – Upgrade / Diff testing
> >>>> – Performance testing
> >>>> – Shadow traffic
> >>>> – Unit/dtest coverage of new areas
> >>>> – Source audit
> >>>>
> >>>>
> >>>> Metric: “Found Bug” Count by Subsystem/Component (sourced via JQL,
> >>>> reported in Confluence):
> >>>>
> >>>> Similar to “found by,” but “found where.” These metrics help us
> >> understand
> >>>> which components or subsystems of the database we’re finding issues
> in.
> >> In
> >>>> the event that a particular area stands out as “hot,” we’ll have the
> >>>> quantitative feedback we need to support investment there. Tracking
> >> these
> >>>> counts over time – and their first derivative – the rate – also helps
> us
> >>>> make statements regarding progress in various subsystems. Though we
> >> can’t
> >>>> prove a negative (“no bugs have been found, therefore there are no
> >> bugs”),
> >>>> we gain confidence as their rate decreases normalized to the effort
> >> we’re
> >>>> putting in.
> >>>>
> >>>> We might consider reporting “Found In” counts for components as
> >> enumerated
> >>>> in JIRA, such as:
> >>>> – Auth
> >>>> – Build
> >>>> – Compaction
> >>>> – Compression
> >>>> – Core
> >>>> – CQL
> >>>> – Distributed Metadata
> >>>> – …and so on.
> >>>>
> >>>>
> >>>> Metric: “Found Bug” Count by Severity (sourced via JQL, reported in
> >>>> Confluence)
> >>>>
> >>>> Similar to “found by/where,” but “how bad”? These metrics help us
> >>>> understand the severity of the issues we encounter. As build quality
> >>>> improves, we would expect to see decreases in the severity of issues
> >>>> identified. A high rate of critical issues identified late in the
> >> release
> >>>> cycle would be cause for concern, though it may be expected at an
> >> earlier
> >>>> time.
> >>>>
> >>>> These could roughly be sourced from the “Priority” field in JIRA:
> >>>> – Trivial
> >>>> – Minor
> >>>> – Major
> >>>> – Critical
> >>>> – Blocker
> >>>>
> >>>> While “priority” doesn’t map directly to “severity,” it may be a
> useful
> >>>> proxy. Alternately, we could introduce a label intended to represent
> >>>> severity if we’d like to make that clear.
> >>>>
> >>>>
> >>>> Metric: Performance Tests
> >>>>
> >>>> Performance tests tell us “how fast” (and “how expensive”). There are
> >> many
> >>>> metrics we could capture here, and a variety of workloads they could
> be
> >>>> sourced from.
> >>>>
> >>>> I’ll refrain from proposing a particular methodology or reporting
> >>>> structure since many have thought about this. From a reporting
> >> perspective,
> >>>> I’m inspired by Mozilla’s “arewefastyet.com<http://arewefastyet.com>”
> >>>> used to report the performance of their Javascript engine relative to
> >>>> Chrome’s: https://arewefastyet.com/win10/overview
> >>>>
> >>>> Having this sort of feedback on a build-by-build basis would help us
> >> catch
> >>>> regressions, quantify improvements, and provide a baseline against 3.0
> >> and
> >>>> 3.x.
> >>>>
> >>>>
> >>>> Metric: Code Coverage (/ other static analysis techniques)
> >>>>
> >>>> It may also be useful to publish metrics from CI on code coverage by
> >>>> package/class/method/branch. These might not be useful metrics for
> >>>> “quality” (the relationship between code coverage and quality is
> >> tenuous).
> >>>>
> >>>> However, it would be useful to quantify the trend over time between
> >>>> releases, and to source a “to-do” list for important but
> poorly-covered
> >>>> areas of the project.
> >>>>
> >>>>
> >>>> Others:
> >>>>
> >>>> There are more things we could measure. We won’t want to drown
> ourselves
> >>>> in metrics (or the work required to gather them) –– but there are
> likely
> >>>> more not described here that could be useful to consider.
> >>>>
> >>>>
> >>>> Convergence Across Metrics:
> >>>>
> >>>> The thesis of this document is that improvements in each of these
> areas
> >>>> are correlated with increases in quality. Improvements across all
> areas
> >> are
> >>>> correlated with an increase in overall release quality. Tracking
> metrics
> >>>> like these provides the quantitative foundation for assessing
> progress,
> >>>> setting goals, and defining criteria. In that sense, they’re not an
> end
> >> –
> >>>> but a beginning.
> >>>>
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: Measuring Release Quality

Reply via email to