Julian Foad <julianf...@apache.org>:
> Thanks for your detailed reply, Eric. I can accept your argument of
> value of v2 dump format for its simplicity for these purposes.

*heaves vast sigh of relief*

Thank you.  A few users with trouble because svnrdump doesn't dump
version 2 would have been annoying but comparatively minor.  Version 2
potentially getting dropped im the future because Subversion's devs
don't grasp its use cases was a much bigger deal.

> I am very well aware you wrote much of the spec doc and worked a lot
> with it, especially figuring out the semantics; and that's why,
> seeing its "v3" section, I was puzzled when you wrote simply "it's
> not documented". I even noticed the language in that section seemed
> to match your style but didn't have time to go digging with "svn
> blame". Thanks for explaining what's missing.

Currently the format doc has the following issues:

* Version 3 diff format and compression (if any) are not described.

* On a Node record which has both a copyfrom source and
  a property section, it is possible that the copy source node itself
  has a property section.  How these are to be combined is unspecified.

* As of December 2011 there was a minor bug: Adding a file with history
  twice _in two different revisions_ succeeds silently.  I don't know
  if this was ever fixed or is perhaps intended behavior.

> I haven't tried writing an importer.

Heads up that if you ever do you'll find it's all pretty easy except
for one part that's so rebabarbative it took me most of a decade to
get the interpretation right in all cases.

The horrible part is the implicit wildcarding in directory copies.
What the absence of explicitness about every file copy means is that
your importer has to store manifests for every revision in the history
in RAM in case they become visible at copy source revisions in a later
target revision.

Not only is this a tricky data-management problem, it's massively
memory-intensive. When I did the GCC conversion it required a
specially superpowered EC2 instance with 512 terabytes of RAM *just to
get the history loaded*.  This was with the blobs still on disk
and a specially tuned copy-on-write store for the manifeats!

(If you try to journal the manifests to disk as well as leaving the
blobs there you can end up with running times measured in months.
Yes, I know of two real cases where this happened.)

Eventually I had to move my code from Python to Go just to cut the
size of the working set enough to keep it usable on huge repositories.

It also means that as simple a task as starting from a node and
computing the location of its most recent ancestor, the node that last
mutated that path, is surprisingly difficult.  There was a point at
which my stream interpreter was stuck in a flawed state for nearly *two
years* because this is so much more gnarly than it looks.  

I didn't get all the edge cases nailed down until last January - there
is an especially nasty one whewn a tag is deleted and later recreated
with the same name.

Now you know why every importer before reposurgeon failed and they're
all moldering abandoned or in a seriously flawed state.  There are a
couple of other moderately hard parts, but the project-killer is
interpreting directory copies correctly.
-- 
                <a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>


Reply via email to