I've made test conversions of the GCC repository with reposurgeon
available (gcc.gnu.org / sourceware.org account required to access
these git+ssh repositories, it doesn't need to be one in the gcc group
or to have shell access).  More information about the repositories,
conversion choices made and known issues is given below, and, as noted
there, I'm running another conversion now with fixes for some of those
issues and the remaining listed issues not fixed in that conversion
are being actively worked on.

git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1a.git
git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1b.git

The two repositories have exactly the same objects (thus, exactly the
same commit graph).  The only difference is that the 1a conversion has
branches and tags named the same as in SVN (or as similarly as
possible; tags in branches/st/tags/ in SVN become tags in
refs/tags/st/ in git), whereas the 1b conversion has refs rearranged
as suggested by Richard (meaning most are not fetched by default, so
you may wish to clone with --mirror to inspect them more closely).  We
can of course do a different rearrangement if desired.

The repositories also include refs/deleted refs for each commit that
deleted a tag or branch in SVN (to be precise, the ref points to a
commit deleting all the tag or branch contents, so preserving the
original commit message for the deletion; its parent is thus the final
state of the tag or branch before deletion).  We may or may not want
these in the final conversion, but it seems useful to have them at
this point for verification purposes (in particular, I intend to
implement a check that the final state of each tag or branch before
deletion is correct, as a further check that the conversion machinery
is working correctly).

The repositories don't include refs for the version of history from
the old git-svn mirror, but I have a script to add them (in
refs/git-old/ and refs/git-svn-old/) for the benefit of people wishing
to interpret old commit hashes after the conversion and to make things
more convenient for people wishing to rebase active git-only branches
onto the new version of the history.  The script is independent of
reposurgeon; it's just a single "git fetch" command (which should be
followed by "git gc --aggressive").

The repositories include all the non-deleted branches and tags in the
SVN repository (and, outside refs/deleted/, that is the exact set of
branches and tags present).  For this purpose, the file
branches/st/README in the SVN repository is considered to have its own
branch.  reposurgeon generates a "root" branch for commits to paths
not part of any branch; this is not included in these repositories
(has been deleted at the git level) because I don't believe it
contains anything plausibly relevant in git.  The commits to
branches/st/README were moved to their own branch, as noted; all other
commits that end up in "root" are either commits wrongly creating a
branch or tag at top level rather than /branches or /tags, commits
deleting such branches or tags created in the wrong place, or changes
to the SVN /hooks directory.

As far as I know, all issues affecting commit tree contents have been
fixed, as have some previously noted issues with some merge commits
having too many parents, and incorrect attributions seen in an earlier
conversion of Richard's.  Tree contents are verified correct at every
non-deleted branch tip and tag (I intend to do such validation for
deleted branches and tags as well, but haven't yet implemented it).
For comparisons, the following methodology applies: empty directories
are removed from the SVN checkout, because git doesn't store empty
directories; .cvsignore files are excluded from the comparison, since
reposurgeon doesn't include them (but if people want them in the git
history, they could easily be included); .gitignore files are excluded
from the comparison, since reposurgeon generates one based on
svn:ignore properties (or SVN defaults, in the absence of such
properties) where the repository doesn't have one checked in (where
there *is* a .gitignore file checked into SVN, it's preferred over the
auto-generated one); cases of SVN keyword expansion are excluded
manually (only two branches have files with SVN keyword expansion
enabled).

Every branch with SVN ancestry based on the first commit of /trunk has
first-parent ancestry in git going back to that commit, as expected.
This includes libstdcxx_so_7-2-branch, which was created via creating
the directory for the branch and then copying only the libstdc++-v3
subdirectory from trunk rather than directly copying the whole of
/trunk to /branches/libstdcxx_so_7-2-branch; reposurgeon has detected
that case automatically and created an appropriate parent link to the
relevant trunk commit.

Some parts of Richard's commit message improvements are present, but
most aren't because of an issue accessing Bugzilla (which also
affected some of the improvements not involving accessing Bugzilla, as
the script terminated early).  This should be fixed for my next
conversion run.  As discussed, Richard's improvements only add new
summary lines, with the original commit message following them.


Known issues (all either already fixed or understood and currently
being worked on):

1. Some cherry-picks are showing up as merges (this is the only issue
I could find in my checks, manual and automated, that affects the
commit graph; I couldn't find any issues affecting tree contents,
first-parent ancestry or the set of refs present).  Being worked on by
Julien "_FrnchFrgg_" RIVAUD.

2. Branch creation or recreation commits have attribution taken from
some ChangeLog file in the branch when it should come from the SVN
committer.  Being worked on by Eric S. Raymond.

3. There are still some merge commits with too many parents, although
the cases Richard found have all been fixed (and all those parents in
the cases I found are genuinely ancestors of the merge commit in
question, so it's essentially a cosmetic issue that there are some
that are redundant - it won't affect anything in default "git log"
output other than the "Merge:" line, for example, as "git log" orders
by commit timestamp by default).  Being worked on by Julien
"_FrnchFrgg_" RIVAUD.

4. Only files called ChangeLog are used to extract attributions, not
ChangeLog.<branch> (fixed for my next conversion run, currently
running at the git-fast-import stage).

5. Most of Richard's commit message improvements aren't present (fixed
for my next conversion run, currently running at the git-fast-import
stage).


Points for consideration:

1. Do we want some kind of rearrangement of refs as in the 1b
repository or not?

2. Should the final converted repository contain refs/deleted/ refs or
not?

3. Where an attribution comes from an author map rather than a
ChangeLog file, do we wish to use the existing author map or do people
prefer using names from that map but with @gcc.gnu.org addresses (and
@gnu.org for usernames that only committed in the gcc2 period)?

-- 
Joseph S. Myers
jos...@codesourcery.com

Reply via email to