That light at the end of the tunnel turned out to be an oncoming train. Until recently I thought the conversion was near finished. I'd had verified clean conversions across trunk and all branches, except for one screwed-up branch that the management agreed we could discard.
I had some minor issues left with execute-permission propagation and how to interpret mid-branch deletes I solved the former and was working on the latter. I expected to converge on a final result well before the end of the year, probably in August or September. Then, as I reported here, my most recent test conversion produced incorrect content on trunk. That's very bad, because the sheer size of the GCC repository makes bug forensics extremely slow. Just loading the SVN dump file for examination in reposurgeon takes 4.5 hours; full conversions are back up to 9 hours now. The repository is growing about as fast as my ability to find speed optimizations. Then it got worse. I backed up to a commit that I remembered as producing a clean conversion, and it didn't. This can only mean that the reposurgeon changes I've been making to handle weird branch-copy cases have been fighting each other. For those of you late to the party, interpreting the operation sequences in Subversion dump files is simple and produces results that are easy to verify - except near branch copy operations. The way those interact with each other and other operations is extremely murky. There is *a* correct semantics defined by what the Subversion code does. But if any of the Subversion devs ever fully understood it, they no longer do. The dump format was never documented by them. It is only partly documented now because I reverse-engineered it. But the document I wrote has questions in it that the Subversion devs can't answer. It's not unusual for me to trip over a novel branch-copy-related weirdness while converting a repo. Normally the way I handle this is by performing a bisection procedure to pin down the bad commit. Then I: (1) Truncate the dump to the shortest leading segment that reproduces the problem. (2) Perform a strip operation that replaces all content blobs with unique small cookies that identify their source commit. Verify that it still reproduces... (3) Perform a topological reduce that drops out all uninteresting commits, that is pure content changes not adjacent to any branch copies or property changes. Verify that it still reproduces... (4) Manually remove irrelevant branches with reposurgeon. Verify that it still reproduces... At this point I normally have a fairly small test repository (never, previously, more than 200 or so commits) that reproduces the issue. I watch conversions at increasing debug levels until I figure out what is going on. Then I fix it and the reduced dump becomes a new regression test. In this way I make monotonic progress towards a dumpfile analyzer that ever more closely emulates what the Subversion code is doing. It's not anything like easy, and gets less so as the edge cases I'm probing get more recondite. But until now it worked. The size of the GCC repository defeats this strategy. By back of the envelope calculation, a single full bisection would take a minimum of 18 days. Realistically it would probably be closer to a month. That means that, under present assumptions, it's game over and we've lost. The GCC repo is just too large and weird. My tools need to get a lot faster, like more than an order of magnitude faster, before digging out of the bad situation the conversion is now in will be practical. Hardware improvements won't do that. Nobody knows how to build a machine that can crank a single process enough faster than 1.3GHz. And the problem doesn't parallelize. There is a software change that might do it. I have been thinking about translating reposurgeon from Python to Go. Preliminary experiments with a Go version of repocutter show that it has a 40x speed advantage over the Python version. I don't think I'll get quite that much speedup on reposurgeon, but I'm pretty optimistic agout getting enough speedup to make debugging the GCC conversion tractable. Even at half that, 9 hour test runs would collapse to 13 minutes. The problem with this plan is that a full move to Go will be very difficult. *Very* difficult. As in, work time in an unknown and possibly large number of months. GCC management will have to make a decision about how patient it is willing to be. I am at this point not sure it wouldn't be better to convert your existing tree state and go from there, jeeping the Subversion history around for archival purposes -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ..every Man has a Property in his own Person. This no Body has any Right to but himself. The Labour of his Body, and the Work of his Hands, we may say, are properly his. .... The great and chief end therefore, of Mens uniting into Commonwealths, and putting themselves under Government, is the Preservation of their Property. -- John Locke, "A Treatise Concerning Civil Government"