On 13. 12. 24 13:24, Nikola Dipanov wrote:
On Sat, Dec 7, 2024 at 11:14 PM Johan Corveleyn <jcor...@gmail.com> wrote:
On Fri, Dec 6, 2024 at 12:27 PM Nikola Dipanov
<ndipa...@hudson-trading.com> wrote:
>
> Hi all,
>
> A quick summary first: we’ve found that for certain use cases -
increasing SVN_DELTA_WINDOW_SIZE yields very significant storage
savings on the size of the repo (~10x). We have a POC patch to
make it configurable (currently only via fsfs.conf and using
libsvn_ra_svn and libsvn_ra_local) and would be interested in
working with the community to see if the changes could be
improved, with the ultimate goal to have them accepted into the
mainline.
>
> I have found some previous discussions on this from a good while
back: https://svn.haxx.se/users/archive-2008-02/0547.shtml. but
not much else. If there’s any additional information on this
problem that I may have missed - please feel free to educate me!
>
> Otherwise if there are no glaring issues that anyone can see in
supporting this - I’d be happy to look into cleaning up my POC
patch and post it for detailed review.
>
> Obviously, there are trade-offs in doing this: A repo would most
likely need to use a specified window size throughout its
lifetime, and would not be beneficial for every use-case. This is
why we propose to keep it as a config option. Some more details of
our use-case and some rough numbers that illustrate the benefits
follow.
>
> Our use case is that we commonly have very large files that see
small changes over time, and a large xdelta window size would
benefit such use-cases greatly. For us - the growth pace of our
repositories is quite staggering due to this amplification, and
we’d be more than happy to trade memory usage (especially, but
also processing time to an extent) to be able to keep this in
check. We’ve not done extensive measurements on how this impacts
runtime yet.
>
> To demonstrate the effect - we generate a random and fairly
large (~1.3 G, ~30 million lines) XML file, commit it, then make
random changes to it (from 300 lines to ~1% of lines), committed
those, and then looked at the file size generated by the commit in
repo/db/revs/. We then repeat this, but with configuring a 100x
larger window size (10240000). The most dramatic results are for
small changes (this is somewhat intuitive) where the size of the
revision with changes is ~40x smaller (10k vs 430k for a 1.3Gig
file) when using a larger window size. For different patterns of
changes the difference is not that dramatic but still large
(5-10x). I am happy to share more details if people are interested.
>
> We believe we’d see a lot of benefit from this option, as would
others in the community, and are very much committed to Subversion
in the long run, so would love to hear what people think about
something like this.
That sounds quite interesting. However, I'm a bit worried about
compatibility.
The thread you metioned has one reply [1] which points to another
thread about a possible backward compatibility issue [2].
In that thread, Daniel Berlin wrote:
> I also think we should up our default window size, but due to some
> silliness in how this is currently implemented, changing the
window size
> is backwards incompatible!
>
> (This is because the code currently relies on knowing whether
there are
> more windows waiting to consume by comparing the window length
against
> the default window size, instead of by seeing if there is an EOF
:( )
This is confirmed by Branko Čibej in another reply [3].
It's not clear to me whether those limitations still hold, but
assuming they do there are some possible issues:
Hi Johan - thanks for a quick response first of all!
I believe the limitation mentioned in the email still holds. I believe
this is referring to [1]. However in our testing iirc, clients built
with a hardcoded, larger window size had no issues cloning a repo with
a smaller (aka default) window size. I will confirm what is actually
going on here.
- fsfs.conf settings can be changed at any time during the
lifetime of the repository. So the first 1000 revisions may have a
different window size than later ones. The code reading those
revisions cannot handle that (if I interpret the above threads
correctly).
At least in our current POC - the assumption is that this would not
change throughout the lifetime of the repository. Is this a problem in
practice? If an older client attempted to do operations on a repo with
a version higher AND a non-default window size it would fail to do so.
- A vanilla SVN server (or a client accessing the repository with
file://) should at least be able to read the repository, even it
was constructed with a different window size. Here too I fear this
won't work (even if the window size would be a "immutable setting
at creation time").
Yes this would be potentially a problem but only for the repositories
that do have a modified, non-default window size value. However let me
experiment a bit more with this and confirm soon.
It all hinges on whether or not the current FSFS code can read
data compressed with different window sizes (different from its
own setting). Maybe things have changed since 2008 (there were
several new versions of the fsfs format since then), maybe they
haven't.
From fsfs POV - the POC we have currently proposes to store the window
size (if modified, or defaulted to DELTA_WINDOW_SIZE) on svn_fs_t
struct as a new member (and the code is changed to use that and thread
it where necessary) so *I think* this would work but the caveat that
it has to be constant for the lifetime of the repository still holds.
Perhaps you can perform some experiments to test for the above
issues, as a form of further exploration.
I will revert back in a few days after digging some more into
compatibility questions, especially with code built without the
proposed changes! Thanks again for looking into this!
We don't encode the window size in the delta data, and the code relies
on knowing the window size as a pre-defined constant. Changing the
window size would make existing repositories unreadable, it doesn't
matter if this is a compile-time or run-time constant. You'd have to
encode the window size in the delta data, changing the format and making
it backward-incompatible. That's OK for a major release, as long as the
old format can still be read.
But decoding deltas is the least of your problems. The delta combiner
also relies on knowing the window size, and assumes that all deltas it
combines have the same window size. Removing that limitation is a lot of
work. One way to simplify it might be to ensure that every file's
history uses a single window size, on all its branches. I don't know how
complex that would be, I've not looked at the code in ages and there are
probably many, many assumptions in the implementation.
And then, of course, there's the client-server protocol, which also uses
deltas. The clients also assume a fixed, predefined window size. So
you're looking at either extending the protocol or making sure that no
matter what's on disk, data is sent to the client with the current
window size. That's easier than it looks, since the server never (IIRC)
sends on-disk data to the client but always re-encodes it.
In short -- it's not a 5-minute hack...
-- Brane