On 13. 12. 24 13:24, Nikola Dipanov wrote:
On Sat, Dec 7, 2024 at 11:14 PM Johan Corveleyn <jcor...@gmail.com> wrote:

    On Fri, Dec 6, 2024 at 12:27 PM Nikola Dipanov
    <ndipa...@hudson-trading.com> wrote:
    >
    > Hi all,
    >
    > A quick summary first: we’ve found that for certain use cases -
    increasing SVN_DELTA_WINDOW_SIZE  yields very significant storage
    savings on the size of the repo (~10x). We have a POC patch to
    make it configurable (currently only via fsfs.conf and using
    libsvn_ra_svn and libsvn_ra_local) and would be interested in
    working with the community to see if the changes could be
    improved, with the ultimate goal to have them accepted into the
    mainline.
    >
    > I have found some previous discussions on this from a good while
    back: https://svn.haxx.se/users/archive-2008-02/0547.shtml. but
    not much else. If there’s any additional information on this
    problem that I may have missed - please feel free to educate me!
    >
    > Otherwise if there are no glaring issues that anyone can see in
    supporting this - I’d be happy to look into cleaning up my POC
    patch and post it for detailed review.
    >
    > Obviously, there are trade-offs in doing this: A repo would most
    likely need to use a specified window size throughout its
    lifetime, and would not be beneficial for every use-case. This is
    why we propose to keep it as a config option. Some more details of
    our use-case and some rough numbers that illustrate the benefits
    follow.
    >
    > Our use case is that we commonly have very large files that see
    small changes over time, and a large xdelta window size would
    benefit such use-cases greatly. For us - the growth pace of our
    repositories is quite staggering due to this amplification, and
    we’d be more than happy to trade memory usage (especially, but
    also processing time to an extent) to be able to keep this in
    check. We’ve not done extensive measurements on how this impacts
    runtime yet.
    >
    > To demonstrate the effect - we generate a random and fairly
    large (~1.3 G, ~30 million lines) XML file, commit it, then make
    random changes to it (from 300 lines to ~1% of lines), committed
    those, and then looked at the file size generated by the commit in
    repo/db/revs/.  We then repeat this, but with configuring a 100x
    larger window size (10240000). The most dramatic results are for
    small changes (this is somewhat intuitive) where the size of the
    revision with changes is ~40x smaller (10k vs 430k for a 1.3Gig
    file) when using a larger window size. For different patterns of
    changes the difference is not that dramatic but still large
    (5-10x). I am happy to share more details if people are interested.
    >
    > We believe we’d see a lot of benefit from this option, as would
    others in the community, and are very much committed to Subversion
    in the long run, so would love to hear what people think about
    something like this.

    That sounds quite interesting. However, I'm a bit worried about
    compatibility.

    The thread you metioned has one reply [1] which points to another
    thread about a possible backward compatibility issue [2].

    In that thread, Daniel Berlin wrote:
    > I also think we should up our default window size, but due to some
    > silliness in how this is currently implemented, changing the
    window size
    > is backwards incompatible!
    >
    > (This is because the code currently relies on knowing whether
    there are
    > more windows waiting to consume by comparing the window length
    against
    > the default window size, instead of by seeing if there is an EOF
    :( )

    This is confirmed by Branko Čibej in another reply [3].

    It's not clear to me whether those limitations still hold, but
    assuming they do there are some possible issues:

Hi Johan - thanks for a quick response first of all!

I believe the limitation mentioned in the email still holds. I believe this is referring to [1]. However in our testing iirc, clients built with a hardcoded, larger window size had no issues cloning a repo with a smaller (aka default) window size. I will confirm what is actually going on here.

    - fsfs.conf settings can be changed at any time during the
    lifetime of the repository. So the first 1000 revisions may have a
    different window size than later ones. The code reading those
    revisions cannot handle that (if I interpret the above threads
    correctly).

At least in our current POC - the assumption is that this would not change throughout the lifetime of the repository. Is this a problem in practice? If an older client attempted to do operations on a repo with a version higher AND a non-default window size it would fail to do so.

    - A vanilla SVN server (or a client accessing the repository with
    file://) should at least be able to read the repository, even it
    was constructed with a different window size. Here too I fear this
    won't work (even if the window size would be a "immutable setting
    at creation time").


Yes this would be potentially a problem but only for the repositories that do have a modified, non-default window size value. However let me experiment a bit more with this and confirm soon.

    It all hinges on whether or not the current FSFS code can read
    data compressed with different window sizes (different from its
    own setting). Maybe things have changed since 2008 (there were
    several new versions of the fsfs format since then), maybe they
    haven't.


From fsfs POV - the POC we have currently proposes to store the window size (if modified, or defaulted to DELTA_WINDOW_SIZE) on svn_fs_t struct as a new member (and the code is changed to use that and thread it where necessary) so *I think* this would work but the caveat that it has to be constant for the lifetime of the repository still holds.

    Perhaps you can perform some experiments to test for the above
    issues, as a form of further exploration.


I will revert back in a few days after digging some more into compatibility questions, especially with code built without the proposed changes! Thanks again for looking into this!


We don't encode the window size in the delta data, and the code relies on knowing the window size as a pre-defined constant. Changing the window size would make existing repositories unreadable, it doesn't matter if this is a compile-time or run-time constant. You'd have to encode the window size in the delta data, changing the format and making it backward-incompatible. That's OK for a major release, as long as the old format can still be read.

But decoding deltas is the least of your problems. The delta combiner also relies on knowing the window size, and assumes that all deltas it combines have the same window size. Removing that limitation is a lot of work. One way to simplify it might be to ensure that every file's history uses a single window size, on all its branches. I don't know how complex that would be, I've not looked at the code in ages and there are probably many, many assumptions in the implementation.

And then, of course, there's the client-server protocol, which also uses deltas. The clients also assume a fixed, predefined window size. So you're looking at either extending the protocol or making sure that no matter what's on disk, data is sent to the client with the current window size. That's easier than it looks, since the server never (IIRC) sends on-disk data to the client but always re-encodes it.

In short -- it's not a 5-minute hack...


-- Brane

Reply via email to