On Wed, 8 Jan 2020, Eric S. Raymond wrote: > They use your feedback to find places where their comment-processing > scripts could be improved; we've used it learn what additional > oddities in ChangeLogs we need to be able to handle automatically.
I've used comparisons of authors in the two conversions - in cases where they get different human identities for the author, not just different email addresses or name variants - to identify cases for manual review, since ChangeLog parsing is the most subjective part of doing a conversion and cases where different heuristics produce different results indicate those worthy of manual review. Apart from about 1600 with no changes to ChangeLog files but a ChangeLog entry in the commit message, which I reviewed mostly automatically to make sure I agreed with Maxim's author extraction with only limited manual checks on those that looked like suspect cases, that involved reviewing around 3000 commits manually; I've now completed that review. Some of those are also subjective cases even after review (for example, where the commit involved one person backporting another person's patch). In the set of around 1200 commits with both ChangeLog and non-ChangeLog files being changed, which did not look like backports, for example, I arrived at around 400 author improvements from this review (not all of them the same authors as in Maxim's conversion), while for around 800 commits I concluded the reposurgeon author was preferable. (The typical case where reposurgeon does better is where successive commits add new ChangeLog entries under an existing ChangeLog header. The typical case where I added fixes was where a commit made nonsubstantive changes under an existing header, as well as adding new entries, which is hard to distinguish automatically from a multi-author commit so reposurgeon conservatively treats as a multi-author commit.) In the case of ChangeLog-only commits, where reposurgeon assumes they are likely to be fixing typos or similar and so does not extract an attribution from ChangeLog files in such commits, manual review identified many cases (especially in the earlier parts of the history) where the ChangeLog was committed separately from the substantive parts of the patch and so a better attribution could be assigned to those substantive commits. I consider the reposurgeon-based conversion machinery to be in essentially its final state now; I don't have any further authors to review, Richard doesn't have any further Bugzilla-based commit summaries to review and we don't know of any relevant reposurgeon bugs or missing features. I'm running a conversion now to verify both the current state of the fixups and the Makefile integration of the conversion and subsequent automated validation, and will make that converted repository available for final checks if this succeeds. Compared to the previous converted repository, this one has many author fixups, a fix for a bug in the author fixups where they broke commit dates, and reposurgeon improvements to avoid producing unidiomatic empty git commits in the converted repository for things such as branch and tag creation. This converted repository uses the ref rearrangements along the lines proposed by Richard (so dead branches and vendor branches are available but not fetched by default); the objects from the existing git mirror will also be included in the repository (so existing gitweb links to such objects in list archives continue to work, for example, as long as they aren't links to objects that were made unreachable at some point in the mirror's history), but again under ref names that are not fetched by default. As noted on overseers, once Saturday's DATESTAMP update has run at 00:16 UTC on Saturday, I intend to add a README.MOVED_TO_GIT file on SVN trunk and change the SVN hooks to make SVN readonly, then disable gccadmin's cron jobs that build snapshots and update online documentation until they are ready to run with the git repository. Once the existing git mirror has picked up the last changes I'll make that read-only and disable that cron job as well, and start the conversion process with a view to having the converted repository in place this weekend (it could either be made writable as soon as I think it's ready, or left read-only until people have had time to do any final checks on Monday). Before then, I'll work on hooks, documentation and maintainer-scripts updates. As well as having objects from the existing git mirror available under refs that are not fetched by default, that mirror will remain available read-only at git://gcc.gnu.org/git/gcc-old.git (which already exists, currently a symlink to the mirror). -- Joseph S. Myers jos...@codesourcery.com