Reliable Subversion processing in reposurgeon

Eric S. Raymond Sat, 11 Apr 2020 05:25:46 -0700

In early January of this year I finally achieved sufficiently
high-quality interpretation of Subversion stream dumps in reposurgeon
that I consider that code finished.  This was, of course, essential
for the conversion of the GCC repository.


I've been able to let that code cool off for three months now.  No new
bug reports; the only known issue is an O(n**2) slowdown in one stage
that we're still tracing because most repositories never trigger it.

Only now am I in a position to start unequivocally that getting
Subversion dump interpretation right was (a) possible for a third
party, and (b) extremely difficult.

It took me from 2012-2020 to get this done.  That's eight years. And
it requires a significant fraction of my programming time during those
eight years.  This problem was *hard*.

I'm reporting this here because of a fact I suspect Subversion
developers might find interesting to know: I had to spend most of my
Subversion-related development time dealing with the consequences
of one single point feature of dump streams. Almost everything else
about them was relatively trivial.

Before I point at the one big blocking feature, however, I want to
mention its runner-up, the main reason I said "almost".  Tracking
executability in a history was a little tricky because resolving the
set of properties active at a node is a little tricky.  You have to
keep pathwise journals, which is less easy to do than describe.

This pales, however, in comparison to the rather nightmarish
complexity required to interpret directory-copy operations correctly
and completely. *That's* what took me eight years to get right.

In the process, I discovered a new way of classifying data
structures that is going to be a major theme in my next book, "The
Programmer's Way: A Guide to Right Mindset".  That is according to
a property I have named "semantic locality".

I don't need to give a full explanation of the general concept to
convey how it applies here.  A directory copy is the only operation
in a stream dump that requires lookback arbitrily far back in the
history.  (It might look like interpreting an ordinary file change
does, but remember that my goal is to map changesets to a different
representation of changesets. My interpreter neither needs to or
wants to *compose* changesets.)

In my new terminology, I say that dump streams would have good
semantic locality - the context needed to interpret any individual
part is always nearby and/or chesp to compute - except that this
one operation gives them extremely poor semantic locality.

The reason is that source and destination content copies are not
specified explicitly by (revision, path) pairs, though of course that
individual file copy operation exists as well.  Instead, the set of
content copies to be done is implicit and has to be computed by
reconstructing what paths were visible at the source revision and
wildcarding.

Getting that exactly right turns out to be a real rolling bitch - a
rabbit hole of subtle bugs and nigh-endless edge-case gotchas.  Even
when you have it right, it's expensive in all directions - blows up
your code complexity, your working-set size, and your time to
completion.

I eventually solved the time-to-completion problem pretty well by
throwing massive concurrency at building the required intermediate
data structures.  But doing that took me a solid year of effort
to change implementation languages from Python to Go.  The fact that
I was backed into doing that tells you what a showstopper that
performance problem was.  It blocked the GCC conversion for about
14 months. 

Moving to Go also whacked the working-set problem pretty hard;
the change from runtime-polymorphic to statically-typed objects
pared away a lot of memory overhead.

The code-complexity cost, however, is irreducible.  The part of
reposurgeon required to cope with directory copies is never going to
be less than daunting to maintain.

Tempting though it would be to rant that this feature was a mistake,
I'm not quite prepared to do that.  Making all file copies in the dump
stream would of course have made the streams larger. It's easy to say
under 2020's cost gradients that this wouldn't be a problem, but disk
storage and fast I/O were quite significantly more expensive 20-odd
years ago, possibly so much more so that the choice was then justified.

It did end up costing a lot, though, in unexpected ways.
-- 
                <a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>

You [should] not examine legislation in the light of the benefits it will
convey if properly administered, but in the light of the wrongs it
would do and the harm it would cause if improperly administered
        -- Lyndon Johnson, former President of the U.S.

Reliable Subversion processing in reposurgeon

Reply via email to