Julian Foad <julianf...@apache.org>: > Thanks for your detailed reply, Eric. I can accept your argument of > value of v2 dump format for its simplicity for these purposes.
*heaves vast sigh of relief* Thank you. A few users with trouble because svnrdump doesn't dump version 2 would have been annoying but comparatively minor. Version 2 potentially getting dropped im the future because Subversion's devs don't grasp its use cases was a much bigger deal. > I am very well aware you wrote much of the spec doc and worked a lot > with it, especially figuring out the semantics; and that's why, > seeing its "v3" section, I was puzzled when you wrote simply "it's > not documented". I even noticed the language in that section seemed > to match your style but didn't have time to go digging with "svn > blame". Thanks for explaining what's missing. Currently the format doc has the following issues: * Version 3 diff format and compression (if any) are not described. * On a Node record which has both a copyfrom source and a property section, it is possible that the copy source node itself has a property section. How these are to be combined is unspecified. * As of December 2011 there was a minor bug: Adding a file with history twice _in two different revisions_ succeeds silently. I don't know if this was ever fixed or is perhaps intended behavior. > I haven't tried writing an importer. Heads up that if you ever do you'll find it's all pretty easy except for one part that's so rebabarbative it took me most of a decade to get the interpretation right in all cases. The horrible part is the implicit wildcarding in directory copies. What the absence of explicitness about every file copy means is that your importer has to store manifests for every revision in the history in RAM in case they become visible at copy source revisions in a later target revision. Not only is this a tricky data-management problem, it's massively memory-intensive. When I did the GCC conversion it required a specially superpowered EC2 instance with 512 terabytes of RAM *just to get the history loaded*. This was with the blobs still on disk and a specially tuned copy-on-write store for the manifeats! (If you try to journal the manifests to disk as well as leaving the blobs there you can end up with running times measured in months. Yes, I know of two real cases where this happened.) Eventually I had to move my code from Python to Go just to cut the size of the working set enough to keep it usable on huge repositories. It also means that as simple a task as starting from a node and computing the location of its most recent ancestor, the node that last mutated that path, is surprisingly difficult. There was a point at which my stream interpreter was stuck in a flawed state for nearly *two years* because this is so much more gnarly than it looks. I didn't get all the edge cases nailed down until last January - there is an especially nasty one whewn a tag is deleted and later recreated with the same name. Now you know why every importer before reposurgeon failed and they're all moldering abandoned or in a seriously flawed state. There are a couple of other moderately hard parts, but the project-killer is interpreting directory copies correctly. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a>