In early January of this year I finally achieved sufficiently high-quality interpretation of Subversion stream dumps in reposurgeon that I consider that code finished. This was, of course, essential for the conversion of the GCC repository.
I've been able to let that code cool off for three months now. No new bug reports; the only known issue is an O(n**2) slowdown in one stage that we're still tracing because most repositories never trigger it. Only now am I in a position to start unequivocally that getting Subversion dump interpretation right was (a) possible for a third party, and (b) extremely difficult. It took me from 2012-2020 to get this done. That's eight years. And it requires a significant fraction of my programming time during those eight years. This problem was *hard*. I'm reporting this here because of a fact I suspect Subversion developers might find interesting to know: I had to spend most of my Subversion-related development time dealing with the consequences of one single point feature of dump streams. Almost everything else about them was relatively trivial. Before I point at the one big blocking feature, however, I want to mention its runner-up, the main reason I said "almost". Tracking executability in a history was a little tricky because resolving the set of properties active at a node is a little tricky. You have to keep pathwise journals, which is less easy to do than describe. This pales, however, in comparison to the rather nightmarish complexity required to interpret directory-copy operations correctly and completely. *That's* what took me eight years to get right. In the process, I discovered a new way of classifying data structures that is going to be a major theme in my next book, "The Programmer's Way: A Guide to Right Mindset". That is according to a property I have named "semantic locality". I don't need to give a full explanation of the general concept to convey how it applies here. A directory copy is the only operation in a stream dump that requires lookback arbitrily far back in the history. (It might look like interpreting an ordinary file change does, but remember that my goal is to map changesets to a different representation of changesets. My interpreter neither needs to or wants to *compose* changesets.) In my new terminology, I say that dump streams would have good semantic locality - the context needed to interpret any individual part is always nearby and/or chesp to compute - except that this one operation gives them extremely poor semantic locality. The reason is that source and destination content copies are not specified explicitly by (revision, path) pairs, though of course that individual file copy operation exists as well. Instead, the set of content copies to be done is implicit and has to be computed by reconstructing what paths were visible at the source revision and wildcarding. Getting that exactly right turns out to be a real rolling bitch - a rabbit hole of subtle bugs and nigh-endless edge-case gotchas. Even when you have it right, it's expensive in all directions - blows up your code complexity, your working-set size, and your time to completion. I eventually solved the time-to-completion problem pretty well by throwing massive concurrency at building the required intermediate data structures. But doing that took me a solid year of effort to change implementation languages from Python to Go. The fact that I was backed into doing that tells you what a showstopper that performance problem was. It blocked the GCC conversion for about 14 months. Moving to Go also whacked the working-set problem pretty hard; the change from runtime-polymorphic to statically-typed objects pared away a lot of memory overhead. The code-complexity cost, however, is irreducible. The part of reposurgeon required to cope with directory copies is never going to be less than daunting to maintain. Tempting though it would be to rant that this feature was a mistake, I'm not quite prepared to do that. Making all file copies in the dump stream would of course have made the streams larger. It's easy to say under 2020's cost gradients that this wouldn't be a problem, but disk storage and fast I/O were quite significantly more expensive 20-odd years ago, possibly so much more so that the choice was then justified. It did end up costing a lot, though, in unexpected ways. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> You [should] not examine legislation in the light of the benefits it will convey if properly administered, but in the light of the wrongs it would do and the harm it would cause if improperly administered -- Lyndon Johnson, former President of the U.S.