Re: Overall bitrot, package reviews and fast(er) unmaintained package removals

Neil Williams Fri, 08 Apr 2016 00:40:29 -0700

On Fri, 8 Apr 2016 00:02:08 +0000 (UTC)
Felipe Sateler <fsate...@debian.org> wrote:

> On Wed, 06 Apr 2016 17:16:18 +0100, Neil Williams wrote:
> 
> > On Wed, 6 Apr 2016 15:27:48 +0000 (UTC)
> > Felipe Sateler <fsate...@debian.org> wrote:
> >   
> >> On Wed, 06 Apr 2016 00:18:10 +0200, Ondřej Surý wrote:
> >> >  - other indicators  
> >> 
> >> - Is maintained by the QA group (for longer than X time?)
> >> - Is orphaned (for longer than X time?)
> >> - Is RFA (for longer than X time? Or maybe it should auto-move to
> >>   orphaned)
> >> 
> >> Essentially, if nobody steps up to maintain the packages, then they
> >> should go.
> >> 
> >> - Maintainer does not respond to bug reports in a timely (eg, 1.5
> >> months calculated per package).
> >> 
> >> I think that maintainer responsiveness should be the key metric,
> >> not up- to-dateness (ie, the maintainer may be holding back for
> >> good reasons, but those reasons should be explained).  
> > 
> > That could lead to a lot of ping messages in bug reports which
> > might not be that useful. It could also lead to maintainers closing
> > bugs which may have previously been left open as wontfix or
> > wishlist. The severity of the bug may need to be considered.  
> 
> As always, the devil is in the details. I agree that severity should
> be considered. But I'm mostly thinking about new bugs. Bug reports
> without any maintainer response at all are way more common than they
> should be.

New bugs would be the worst to measure here - you'd have to take into
account the variable delay between bugs which need an urgent fix and
bugs which - although filed at higher severity - are actually not at
all urgent.

Severity does not map to urgency.

Again, enforcing something like this just ends up with ping messages in
bug reports. There is little worse than a maintainer response of "it's
on the TODO list" - better not to respond at all. At least then someone
else doesn't think there's already some work in progress.

> > How do we assess responsiveness on those packages which have 0
> > bugs?  
> 
> Good question. But let's not make the perfect be the enemy of the
> good.

If this is to be entirely automated, "good enough" will cause a lot of
disagreements, arguments and discussion - for packages which currently
take up very little time - and that just discredits the algorithm.

> > So more than just
> > responsiveness, it needs to take account of the number and severity
> > of the bugs to which there has not been a response.  
> 
> The number should be in relation to the total number of bugs the
> package has. A package with a single bug report that is unanswered
> should have a bad score, a package with several bug reports
> unanswered out of hundreds, a better one.

Nonsense. That is a massive generalisation. The severity of the bug
does not necessarily map to a timeframe within which a response is
expected. You are mixing up different issues and this kind of
assessment cannot be automated with any degree of reliability.

> > There may also need to
> > be some protection from the implications of severity-ping-pong.
> > Overall, I think this is an unreliable metric and should not be
> > used.  
> 
> The failure mode, as you describe it, is to be too lax, and will be 
> easily trickable. I somewhat agree. But I don't find it an argument 
> against it. After all, the idea is to help discover places that need 
> attention, not make debian fit on a single cd again ;)

The attention will just focus on the algorithm rather than the package.
That then risks undermining the relevance of the other work done by the
algorithm and very soon, the "score" becomes irrelevant, ignored and a
complete waste of time.

> >> This should also help detecting teams that have effectively become
> >> empty.  
> > 
> > That is not the same as low quality packages.  
> 
> Unmaintained packages are likely to become low quality as time
> passes. No need to wait for that to happen.

No - not necessarily. The problem is that we're talking about "likely"
probably" and other indeterminates. The algorithm is going to have to
*guess* and that indicates that the metric itself is utterly broken
with regard to automation.

> > 
> > Packages with NMUs not resolved by the maintainer is a much better
> > metric. The bugs are closed, so responsiveness would not be
> > counted, but the package is still low quality.  
> 
> I'm not sure I get this. Do you mean maintainer uploads that discard
> the NMU part? That should be a red flag as well.

No. An NMU only touches the very smallest part of the package. I've
done repeat NMUs where I'm patching and existing patch of an existing
patch because the underlying code needs refactoring. This *is* a case
where long-term lack of maintenance is indicative of low quality but
the bug is closed, the NMU has been acknowledged and incorporated. The
wider problem within the package has not been addressed because all the
maintainer has done is include the minimal change without spending any
real time on the code. So the bugs list is closed but the package is
now lower quality - subjectively - than before the bug was filed. Not
all bug fixes raise the quality of the package. Often, the bug fix
simply moves the problem to a later point in the codebase because of
the poor quality of the original code. Maybe this later point is a less
commonly used code path - so nobody files another bug and the problem
becomes invisible again. Waiting.

This is what I mean by an unreliable algorithm. In one metric, this is
indicative of low quality. Another metric will show it as high quality
(because the bug is fixed). There is no metric to identify whether
the NMU is a minimal fix of a broken package which merely hides the low
quality of the package or a minimal fix which raises the overall
quality of the package. That is not a metric, that is a subjective
call - likely based on comments within the relevant bug reports and a
review of the code itself.

Static code analysis tools exist but are very variable in their false
negative, false positive and overly pedantic results.

> > 
> > The full list of identified packages will need some form of marker
> > because then tracker could indicate this in the same way as it does
> > for "your package depends on a package which needs a new
> > maintainer" for orphaned packages. (Maybe the first step for this
> > process *is* to forcibly orphan the package?)
> > 
> > The individual metrics need to be aggregated to a score but fine
> > tuning that score algorithm is more work than most people want to
> > do on packages which are already uninteresting.
> > 
> > What has happened in the past is that a BSP close to a release has
> > had a reason to look at a particular set of packages and removed
> > the whole lot in one operation.  
> 
> This is a slightly different topic than the reason I answered in the 
> first place: changes that affect a wider number of packages usually
> take forever and involve large amounts of NMUs.

An automated process will blast through a complex chain like this just
by iteration. If the algorithm is using heuristics (guesswork) or
unreliable metrics, then packages will be removed in error. If that is
a removal from unstable, getting the package back is a *lot* more work
for everyone than not running the algorithm in the first place.

> > It's a scatter-gun approach but getting agreement on
> > the algorithm could take forever.  
> 
> Unfortunately, this is true.

... and therefore, some of the metrics previously mentioned *must* be
abandoned or the consequences will just mean more arguments and
aggravation when packages are removed in error.

> > 
> > There needs to be something which makes these uninteresting packages
> > relevant to something important - beyond them simply being low
> > quality.  
> 
> I'm not sure what you mean by 'need'. In my ideal world, this process 
> would be largely automatic, so that human effort can be better spent
> in more productive areas (like fixing bugs). In other words, low
> quality and unmaintained packages should cease to be a burden on
> others.

This is a long, long way from an ideal world. I've gone round these
loops several times over the years - usually prompted by one of the BSPs
described above. This process is not suitable for a fully automated
process because the available data cannot be sufficiently accurate or
deterministic. Quality is subjective and automating subjective tests is
a bad idea.

Even using an automated check to flag packages risks the same problem
of the algorithm becoming untrustworthy and the flags ignored.

Some risks can be identified and separate flags used for those but the
idea that a script could properly identify low quality packages without
guesswork and without large numbers of false positives is just
unrealistic.

The closest we have is lintian - which has full access to the source
code of the package and the package metadata. I have massive respect
for the lintian developers because it is a hugely complex task.
Lintian does use some heuristics but declares whether the result is
probable or certain. Trying to write another script to produce an
aggregative score which includes metrics based on both objective and
subjective criteria is just going to make the score itself unreliable.

Concentrate on flags which can be determined solely through objective
metrics.

-- 

Neil Williams
=============
http://www.linux.codehelp.co.uk/

pgpWXQyteeJFb.pgp
Description: OpenPGP digital signature

Re: Overall bitrot, package reviews and fast(er) unmaintained package removals

Reply via email to