Hello, 

For the past 8 months I've been working on the process and tooling to convert 
the Kuali Student subversion repository into Git and to support CI on pull 
requests with auto merge to trunk once the builds were green and the 
appropriate sign-off provided.

The Kuali Student project is being restructured so my work on CI was halted but 
I was able to get the repository converted and placed into Github: 
https://github.com/kuali-student/archived-from-svn (Contains Revisions r1 
through r77740 in 20,631 branches and 95,297 git commits)

This work is not being officially supported by the Kuali Foundation in the 
future but I'm personally interested in seeing other projects use it to convert 
from Subversion into Git (and find any edge cases it might not be handling 
correctly right now)

Since our conversion was successful I wanted to alert users on both the Git and 
JGit mailing lists about the Java based conversion program that I wrote to 
perform the conversion.

The code and some cursory instructions on how to use it are located here: 
https://github.com/kuali-student/git-repository-tools, the current version can 
be downloaded from maven central: 
http://search.maven.org/#artifactdetails|org.kuali.student.repository|git-importer|0.0.4|jar

It is intended for larger repositories like kuali student with around 100,000 
or more revisions with many product streams and questionable at times branch 
naming strategies; instead of enumerating which branches you want this 
conversion program will extract everything.  

You can specify a per repository branch detection strategy to handle 
non-standard cases so that looking back in the history things will make more 
sense, but it shouldn't be needed to get an accurate repository.   

Key features:

. Full repository conversion by parsing Subversion version 2 dump streams and 
writing into a bare Git repository.

. Automatic branch detection 
o We figure out how to split the full path to a blob into a branch part and 
file part.  The branch part becomes the name of the branch and the file part 
turns into the Tree object of the commit.
o We track copy-from information similar I think to how git-svn does so that 
for each subversion revision we have a list of all of the git branches and 
their heads at that point in time.
o We convert everything but if the branch naming is non-standard you can have 
branches created with subdirectories that really should have been separate 
branches themselves.

. Plugin Mechanism to define custom per repository branch detection (before 
falling back on the standard mechanisms)
o See how the 
student-plugin(https://github.com/kuali-student/git-repository-tools/tree/master/git-importer-student-plugin)
 was setup with its own custom branch detection logic (lots of conversion 
iterations in that scheme)  
o Also look at how unit testing can be done on the repository specific branch 
scenarios.

. Fusion instead of submodules for svn:externals
o The fusion-maven-plugin was created and its fuse mojo will do essentially a 
multi subtree merge to turn the aggregate branch (the one where the 
svn:externals property was set) into a commit whose tree contains actual 
materialized subdirectories with the tree of the module branch placed within it.
o The git-importer would leave fusion-maven-plugin.dat files in the root of the 
commit tree's where in svn there had been svn:externals set so that this fusion 
process could be applied at a later point in time.

. Fairly fast
o Creating the subversion mirror and dump files can take some time
o The KS svn repository mirror was 8.8 GB but that turned into about 20 GB 
using bzip2 compressed subversion version 2 dump streams
o Running against the existing dump files it would take the importer about 12 
hours to perform the full conversion on a low end core 2 duo (3Ghz) writing on 
a raid-0 7200 RPM disk drive.  

. Accurate
o I compared our key release tags and development branches by doing an 
subversion export of the particular equivalent of the git branch (based on the 
path and revision in the comment) and added into git and then did a git diff to 
make sure there are no differences (and when I found differences I tracked them 
down and added unit tests to reproduce and fixed them).


Additional Cleanup Programs:

The git-repository-tools repository also includes the cleanup programs we used 
https://github.com/kuali-student/git-repository-tools/tree/master/git-repo-cleaner:

Our initial converted repo was 2.8 GB (final was 1.3 GB) so we looked at 
splitting the graph based on date and then using git grafts to give developers 
the full history.  The splitting program would find the split point and write a 
grafts file for later use.

But we found out that a certain kind of database file was taking up a lot of 
space (over 50% of the converted repository) so instead of splitting we did 
three cleanup operations:
1. Remove the content of all .mpx database files (sql files built a db which 
then dumped out csv files stored in files ending in .mpx which is what we 
removed)
2. Remove two big files, one of them > 100 MB which was blocking the github 
upload.
3. Rewrite all of the commits rewritten in steps 1 and 2 to update the 
fusion-maven-plugin.dat files generated by the exporter to use the latest 
commit ids (so that fusion would work)

Step 3 was interesting because the fusion-maven-plugin.dat files contained 
essentially extra parentage information so we needed to sort the list of 95,297 
commits in such a way that real and fusion parents were emitted first.   I was 
able to use the EWAH Compressed bitmap for this purpose (I used the EWAH 
bitmaps directly but was inspired by the jgit packfile bitmap implementation).

The 
https://github.com/kuali-student/git-repository-tools/blob/master/git-repo-cleaner/src/main/java/org/kuali/student/git/cleaner/AbstractRepositoryCleaner.java
 class can be extended to support other use cases.

It loads all of the commits in the repository in a parents first ordering and 
then provides hooks to do different things.  It takes care of updating the 
branch and tag references to point at the rewritten commits.

All of this code is licensed under the Educational Community License, Version 
2.0 (An add on to the Apache licence, Version 2.0).

Hopefully it will be useful to others, 

Regards, 

Michael

--
Michael O'Cleirigh
Java Developer 
Enterprise Applications and Solutions Integration (EASI)
University of Toronto

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to