I've made test conversions of the GCC repository with reposurgeon available (gcc.gnu.org / sourceware.org account required to access these git+ssh repositories, it doesn't need to be one in the gcc group or to have shell access). More information about the repositories, conversion choices made and known issues is given below, and, as noted there, I'm running another conversion now with fixes for some of those issues and the remaining listed issues not fixed in that conversion are being actively worked on.
git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1a.git git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1b.git The two repositories have exactly the same objects (thus, exactly the same commit graph). The only difference is that the 1a conversion has branches and tags named the same as in SVN (or as similarly as possible; tags in branches/st/tags/ in SVN become tags in refs/tags/st/ in git), whereas the 1b conversion has refs rearranged as suggested by Richard (meaning most are not fetched by default, so you may wish to clone with --mirror to inspect them more closely). We can of course do a different rearrangement if desired. The repositories also include refs/deleted refs for each commit that deleted a tag or branch in SVN (to be precise, the ref points to a commit deleting all the tag or branch contents, so preserving the original commit message for the deletion; its parent is thus the final state of the tag or branch before deletion). We may or may not want these in the final conversion, but it seems useful to have them at this point for verification purposes (in particular, I intend to implement a check that the final state of each tag or branch before deletion is correct, as a further check that the conversion machinery is working correctly). The repositories don't include refs for the version of history from the old git-svn mirror, but I have a script to add them (in refs/git-old/ and refs/git-svn-old/) for the benefit of people wishing to interpret old commit hashes after the conversion and to make things more convenient for people wishing to rebase active git-only branches onto the new version of the history. The script is independent of reposurgeon; it's just a single "git fetch" command (which should be followed by "git gc --aggressive"). The repositories include all the non-deleted branches and tags in the SVN repository (and, outside refs/deleted/, that is the exact set of branches and tags present). For this purpose, the file branches/st/README in the SVN repository is considered to have its own branch. reposurgeon generates a "root" branch for commits to paths not part of any branch; this is not included in these repositories (has been deleted at the git level) because I don't believe it contains anything plausibly relevant in git. The commits to branches/st/README were moved to their own branch, as noted; all other commits that end up in "root" are either commits wrongly creating a branch or tag at top level rather than /branches or /tags, commits deleting such branches or tags created in the wrong place, or changes to the SVN /hooks directory. As far as I know, all issues affecting commit tree contents have been fixed, as have some previously noted issues with some merge commits having too many parents, and incorrect attributions seen in an earlier conversion of Richard's. Tree contents are verified correct at every non-deleted branch tip and tag (I intend to do such validation for deleted branches and tags as well, but haven't yet implemented it). For comparisons, the following methodology applies: empty directories are removed from the SVN checkout, because git doesn't store empty directories; .cvsignore files are excluded from the comparison, since reposurgeon doesn't include them (but if people want them in the git history, they could easily be included); .gitignore files are excluded from the comparison, since reposurgeon generates one based on svn:ignore properties (or SVN defaults, in the absence of such properties) where the repository doesn't have one checked in (where there *is* a .gitignore file checked into SVN, it's preferred over the auto-generated one); cases of SVN keyword expansion are excluded manually (only two branches have files with SVN keyword expansion enabled). Every branch with SVN ancestry based on the first commit of /trunk has first-parent ancestry in git going back to that commit, as expected. This includes libstdcxx_so_7-2-branch, which was created via creating the directory for the branch and then copying only the libstdc++-v3 subdirectory from trunk rather than directly copying the whole of /trunk to /branches/libstdcxx_so_7-2-branch; reposurgeon has detected that case automatically and created an appropriate parent link to the relevant trunk commit. Some parts of Richard's commit message improvements are present, but most aren't because of an issue accessing Bugzilla (which also affected some of the improvements not involving accessing Bugzilla, as the script terminated early). This should be fixed for my next conversion run. As discussed, Richard's improvements only add new summary lines, with the original commit message following them. Known issues (all either already fixed or understood and currently being worked on): 1. Some cherry-picks are showing up as merges (this is the only issue I could find in my checks, manual and automated, that affects the commit graph; I couldn't find any issues affecting tree contents, first-parent ancestry or the set of refs present). Being worked on by Julien "_FrnchFrgg_" RIVAUD. 2. Branch creation or recreation commits have attribution taken from some ChangeLog file in the branch when it should come from the SVN committer. Being worked on by Eric S. Raymond. 3. There are still some merge commits with too many parents, although the cases Richard found have all been fixed (and all those parents in the cases I found are genuinely ancestors of the merge commit in question, so it's essentially a cosmetic issue that there are some that are redundant - it won't affect anything in default "git log" output other than the "Merge:" line, for example, as "git log" orders by commit timestamp by default). Being worked on by Julien "_FrnchFrgg_" RIVAUD. 4. Only files called ChangeLog are used to extract attributions, not ChangeLog.<branch> (fixed for my next conversion run, currently running at the git-fast-import stage). 5. Most of Richard's commit message improvements aren't present (fixed for my next conversion run, currently running at the git-fast-import stage). Points for consideration: 1. Do we want some kind of rearrangement of refs as in the 1b repository or not? 2. Should the final converted repository contain refs/deleted/ refs or not? 3. Where an attribution comes from an author map rather than a ChangeLog file, do we wish to use the existing author map or do people prefer using names from that map but with @gcc.gnu.org addresses (and @gnu.org for usernames that only committed in the gcc2 period)? -- Joseph S. Myers jos...@codesourcery.com