How far should we trust ChangeLog attribution dates?

Eric S. Raymond Thu, 21 Dec 2017 13:20:47 -0800

The 'changelogs' command of reposugeon attempts to fill in author
attributions (author and authorship date) by mining Changelogs.  It
makes use of the fact that changesets often contain a change band for
a Changelog and that change band normally contains or can be referred
to exactly one attribution line.


This is a recent feature - I developed it while doing the lift of
GNUPLOT from CVS in early September, a trivial and tiny affair
compared to *this* conversion. The code is pretty adaptable and pulls
sense out of a larger percentage of attribution situations than you
might expect, but it has some significant limitations.

It can reliably pull email addresses for authors - the error (false
attribution) rate on this is pretty low.  Authorship dates are a
different matter.

One limitation is that dates are normally given only as YMD; we don't
get time of day, though as Joseph Myers pointed out we will from some
older GCC commits.  Since there's no right way to complete a YMD, we
complete with the committer's time of day and timezone.

    # Rationale: If the author is the committer and
    # the attribution date is the same day as the
    # commit, this is probably wthin seconds or minutes
    # of right.  On a different day with the same
    # committer this gets the timezone right. If
    # committer is different and day is different we're
    # making up data out of thin air, but it at least
    # is no worse than any other random choice and
    # has the following good properties: (1) no
    # later than the committer date, and (2) has same
    # mm:ss, making it possible to match the
    # committer date in a display even if we don't
    # know whether it's local or UTC.

Another problem, as Joseph has pointed out, is that the YMD we get is no
better than the manually-entered information in the Changelog; humans
sometimes garble these, for example by swapping month with day.

Joseph's reaction when hearing about this feature was to suggest ignoring
the authorship date and simply filling in the committer date.  I could
write a policy switch to do that easily enough.

However, another effect of having done GNUPLOT recently is that I
audited for dates that were obviously bogus (it was small enough for
that to be practical, about as old as GCC but only 10K commits) and
only found a tiny number - 2 or 3 - of bogons.  What I did, therefore,
is trust the dates but include in the conversion commentary a description
of the limitations and uncertainty in the data.

We really have three possible policies here, and the choice among them
is partly a philosophical:

(i) Don't trust the authorship dates.  Clobber them with commit dates.
Has the advantage that all dates are of consistent quality, and the
disadvantage that if the author and committer aren't the same person
the quality is "always wrong, and by an unbounded amount of time".

(ii) Always trust the authorship dates.  Most will be right to within
24 hours.  A few percent will be garbage. (This is what reposurgeon
did last week.)

(iii) Trust after filtering for obvious bogons like (a) author date in
the future of the commit date, (b) impossible month or year number. If
the filter fails use the committer YMD. (This is what reposurgeon does now.)

I think (iii) wins here, if the figure of merit is "lowest cumulative
authorship-time error over the whole history".

But this is a decision for the customer(s) to make.
-- 
                <a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>

How far should we trust ChangeLog attribution dates?

Reply via email to