Julian Foad <julianf...@apache.org>:
> Eric S. Raymond wrote:
> > Reposurgeon can't handle the Version 3 format with deltas, and there
> > is no realistic possibility that this will change because the format
> > is not documented anywhere.
> 
> Isn't format 3 documented in the section called "Version 3 format" in 
> dump-load-format.txt?
> http://svn.apache.org/viewvc/subversion/trunk/notes/dump-load-format.txt?revision=1884689&view=markup#l503

It appears to have slipped your mind that the person who wrote most of
that documentation was *me*. And I do still update it occasionally,
most recently about a month ago.  If it held the information I needed
I'd already know it and not be bothering this list.

Since a dev as senior as you has forgotten this, I have to assume the
rest of the list has also forgotten or never knew how
dump-load-format.txt came to exist.  So, a reminder: have patience as
I describe it for the list because this bears directly on why "just
use Version 3" is insufficient and the (possibly unintended)
implication that Version 2 might be retired someday is feeply
disturbing.

I had a very specific motivation for documenting the dump format -
reposurgeon comsumed and generated Subversion dumps, and I know it's a
fragile and dangerous thing when the assumptions of that kind of
reader code are only documented in the code itself.  It is much better
practice to write a ground-truth document about the format (or the
parts one uses, anyway) and then have that be the authority for the
code.  For another example of the practice, see

https://gpsd.gitlab.io/gpsd/AIVDM.html

You have dump-load-format.txt because, having written it, I thought it
was silly for it not to be flying with the Subversion distribution, so
I combined it with some historical notes on the old version 1 format,
and voila.


But. The Version 2 documentation I wrote for Subversion is incomplete,
because there were details I could neither find in pre-existing
documentation nor easily discover at the time I wrote the bulk of it
in 2012. And once I made reposurgeon able to read and emit version 2
dumps, digging deep enough to find out what version 3 was doing never
made it far enough up my priority list that I actually did it.

Notably: dump-load-format.txt does not describe the delta format.  I
have since seen hints in the SVN Book that version 3 uses some kind of
binary delta compression.  But the SVN book does not describe either
of these details; it's not even clear enough for me to be sure I'm
not hallucinating the "binary" part.

> Format 3 makes such a huge difference to data transfer size in
> typical cases, as far as I recall, that it is hard to justify using
> format 2 for anything.

Oh, *hell* no it isn't.

Have you ever written an importer-from-Subversion?  Other than
svnadmin load, I mean; all you have to verify about that is that it
round-trips streams, which tends to avoid the problems I'm about to
describe.

If you had ever tried writing other stream analysis tools (I've done
this twice), you would know that they're a very different use case
from transport or archiving and have different tradeoffs. The
bulkiness of the Version 2 stream files is a good trade for its easy
parseability and eyeball-friendliness.

I have a whole bunch of Subversion dumps in my regression-test suite
for reposurgeon/repocutter, some collected in the wild and some
hand-crafted.  It would be *bad* if I couldn't sic a text editor on
any of those to read or modify it. Very bad.  Certainly a huge pain in
the ass for me, plausibly a crash landing of I-can-no-longer-support-this
severity.  That worst case would leave a lot of users stuck; even if you
don't care about inconveniencing me, please don't risk it for them.

Plain text blobs for every revision may be fat but it's a super-stable
and discoverable place in the design space, what economists call a
Schelling point; in practice, great future-proofing.  Deltas and
compression are *not* future-proofing.  Once you start playing that
game, the temptation to iterate on it by improving the
delta/compression pieces can't be resisted, and indeed shouldn't be -
as long as there's still Version 2 dump support for people who want to
evade problems like ... uh, what kind of compression are they using
and how do I unpack it?  How do I interpret this diff format?


So please do not *ever* think of Version 2 as in any way obsolete or
dispensible.  It's got at least one important use case - reposurgeon,
the only tool in the world that can do really lossless conversions to/from
other VCSes needs it. You'd probably be able to hear the screaming from
the direction of my house if you actually dropped it.

But more generally, it's more future-proof and discoverable than any
space optimization of it you could invent.  That in itself is good
enough reason to fully support it, including in svnrdump.  It's a
promise to your users: "This is *understandable*.  No matter what
wacky things we get up to to optimize the transport/archiving case,
you're not screwed."

> > Should I file an issue about this?
> 
> You can certainly file an issue if there isn't one.

I will do so.

> If I were to have a say I would recommend anyone should rather work
> on adding v3 to reposurgeon and addressing any documentation that
> may be lacking.

Adding v3 to the documentation should certainly be done.  If I could
write it I would have already, but I'm more than willing to ask picky
questions of anyone who rights a draft.

As for adding v3 to reposurgeon, it could be *a* solution but it's not
the *right* solution.  It's not the path that delivers the best
guarantee to all your users of "you won't be messed over in ten years
by unintended side effects of optimizations that seemed like good
ideas at the time".
-- 
                <a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>


Reply via email to