[go-nuts] Some background on reposurgeon and my Go translation problem

Eric Raymond Thu, 30 Aug 2018 20:44:37 -0700

There's been enough interest here in the technical questions I've been 
raising recently that a bit of a backgrounder seems in order.


Back in 2010 I noticed that git's fast-import stream format opened up some 
possibilities its designers probably hadn't anticipated. Their original 
motivation was to make it easy to write exporters from other 
version-control systems.  Their offer was this: here's a flat-file format 
that can serialize the entire state of a git repository.  If you can dump 
your repository state from random version control system X in this format, 
we can reanimate the history in git.

This was a very clever idea, and lots of people proceeded to write 
exporters on this model.  A few VCS implementers, noticing that this 
implied a vast  one way traffic of user attention away from them and 
towards git, wrote importers that could consume fast-export streams from 
git back *to *random version-control system X.  I noticed that the effect 
was to turn git stream dumps into a de-facto exchange standard for 
version-control histories.

One of my quirks is that I like thinking about version-control systems and 
the tools around them.  I had been noticing for years that most repository 
conversion tools are pretty bad in a specific way.  They tend to try to 
over-automate the process, producing brute-force conversions full of crufty 
artifacts and minor defects around the places where the data models of 
source and target systems don't quite match.  There were, at the time, no 
tools fit for a human to fix these problems.

Reposurgeon implements - in Python - a domain-specific language for 
describing surgical operation on repositories.  It can be run in batch mode 
or as an interactive structure editor for doing forensics on damaged 
repository metadata.  It works by calling a front end to get a stream dump 
of the repository you want to edit, deserializing the dump into an 
attributed graph, supporting a full repertoire of operations on that graph, 
and then writing the result out as a stream dump fed to an importer for the 
target system.

The target system can be the same as the source.  Or a different one.  
There are front-end/back-end pairs to support RCS, CVS, Subversion, git, 
mercurial, darcs, monotone,  bitkeeper, bzr, and src. Not all of these 
combinations are well-tested, but moves from CVS, Subversion, git, bzr, and 
mercurial to git or mercurial are pretty solid.

My conjecture that a human-driven tool with good exploratory capabilities 
would produce higher-quality history translations than fully-automated 
converters rapidly proved correct, rather dramatically so in fact. 
Reposurgeon has been the key tool for a great many high-complexity, 
high-risk history translations.  Probably the most consequential single 
success was moving the history of GNU Emacs from bzr to git, cleaning up a 
lot of ancient cruft from RCS and CVS along the way.

Below the size of GCC, which IIRC was around 40K commits, Python gave me 
reasonable turnaround times. This is important if you need to do a lot of 
exploration to find a good conversion recipe, which you always do with 
these large old histories.  But there was an adverse-selection effect.  The 
average size of the histories people wanted by to convert kept increasing. 
Eventually I ended up designing a semi-custom PC optimized for the job 
load, with high CPU-to-memory bandwidth and beefy primary caches to enable 
it to handle large working sets - graph-theory problems gigabytes wide.  
Its fans call it the Great Beast, and three years after it was built you 
still can't spec a machine that performs better from COTS parts.

(That may change soon. The guy who actually put togeether the Beast for me 
is contemplating an upgrade based on the Cascade Lake chipset due out from 
Intel in Q4.  His plan is to build one for me and another for Linus 
Torvalds. The clever fellow has sponsors lined up a block long to be in the 
build video.)

Then came GCC.  The GCC repository is over 259K commits.  It brings the 
Great Beast to its knees. Minimum 9-hour times for test conversions, which 
is intolerable. I concluded that Python doesn't just cut it at this scale.  
I then shopped for a new language pretty carefully before choosing Go.  
Compiled Lisp was a contender, and I even briefly considered OCaml.  Go won 
in part because the semantic distance from Python is not all that large, 
considering the whole late-binding issue.

Python reposurgeon is about 14KLOC  of code.  In six days I've translated 
about 11% of it, building unit tests as I go. (There's already a very 
strong set of functional and end-to-end tests. There has to be; I wouldn't 
dare modify it otherwise. To say it's become algorithmically dense enough 
to be tricky to change would be to wallow in understatement...)

However, this is only my second Go project.  My first was a finger 
exercise, a re-implementation of David A. Wheeler's sloccount tool.  So I'm 
in a weird place where I'm translating rather advanced Python into rather 
advanced Go (successfully, as much as I can tell before the end-to-end 
tests) but I still have newbie questions.

I translated an auxiliary tool called repocutter (it slices and dices 
Subversion dump streams) as a warm-up.  This leads me to believe I can 
expect about a 40x speedup.  That's worth playing for, even before I do 
things like exploiting concurrency for faster searches,




-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[go-nuts] Some background on reposurgeon and my Go translation problem

Reply via email to