On Sat, Dec 7, 2024 at 11:14 PM Johan Corveleyn <jcor...@gmail.com> wrote:
> On Fri, Dec 6, 2024 at 12:27 PM Nikola Dipanov < > ndipa...@hudson-trading.com> wrote: > > > > Hi all, > > > > A quick summary first: we’ve found that for certain use cases - > increasing SVN_DELTA_WINDOW_SIZE yields very significant storage savings > on the size of the repo (~10x). We have a POC patch to make it configurable > (currently only via fsfs.conf and using libsvn_ra_svn and libsvn_ra_local) > and would be interested in working with the community to see if the changes > could be improved, with the ultimate goal to have them accepted into the > mainline. > > > > I have found some previous discussions on this from a good while back: > https://svn.haxx.se/users/archive-2008-02/0547.shtml. but not much else. > If there’s any additional information on this problem that I may have > missed - please feel free to educate me! > > > > Otherwise if there are no glaring issues that anyone can see in > supporting this - I’d be happy to look into cleaning up my POC patch and > post it for detailed review. > > > > Obviously, there are trade-offs in doing this: A repo would most likely > need to use a specified window size throughout its lifetime, and would not > be beneficial for every use-case. This is why we propose to keep it as a > config option. Some more details of our use-case and some rough numbers > that illustrate the benefits follow. > > > > Our use case is that we commonly have very large files that see small > changes over time, and a large xdelta window size would benefit such > use-cases greatly. For us - the growth pace of our repositories is quite > staggering due to this amplification, and we’d be more than happy to trade > memory usage (especially, but also processing time to an extent) to be able > to keep this in check. We’ve not done extensive measurements on how this > impacts runtime yet. > > > > To demonstrate the effect - we generate a random and fairly large (~1.3 > G, ~30 million lines) XML file, commit it, then make random changes to it > (from 300 lines to ~1% of lines), committed those, and then looked at the > file size generated by the commit in repo/db/revs/. We then repeat this, > but with configuring a 100x larger window size (10240000). The most > dramatic results are for small changes (this is somewhat intuitive) where > the size of the revision with changes is ~40x smaller (10k vs 430k for a > 1.3Gig file) when using a larger window size. For different patterns of > changes the difference is not that dramatic but still large (5-10x). I am > happy to share more details if people are interested. > > > > We believe we’d see a lot of benefit from this option, as would others > in the community, and are very much committed to Subversion in the long > run, so would love to hear what people think about something like this. > > That sounds quite interesting. However, I'm a bit worried about > compatibility. > > The thread you metioned has one reply [1] which points to another thread > about a possible backward compatibility issue [2]. > > In that thread, Daniel Berlin wrote: > > I also think we should up our default window size, but due to some > > silliness in how this is currently implemented, changing the window size > > is backwards incompatible! > > > > (This is because the code currently relies on knowing whether there are > > more windows waiting to consume by comparing the window length against > > the default window size, instead of by seeing if there is an EOF :( ) > > This is confirmed by Branko Čibej in another reply [3]. > > It's not clear to me whether those limitations still hold, but assuming > they do there are some possible issues: > Hi Johan - thanks for a quick response first of all! I believe the limitation mentioned in the email still holds. I believe this is referring to [1]. However in our testing iirc, clients built with a hardcoded, larger window size had no issues cloning a repo with a smaller (aka default) window size. I will confirm what is actually going on here. - fsfs.conf settings can be changed at any time during the lifetime of the > repository. So the first 1000 revisions may have a different window size > than later ones. The code reading those revisions cannot handle that (if I > interpret the above threads correctly). > > At least in our current POC - the assumption is that this would not change throughout the lifetime of the repository. Is this a problem in practice? If an older client attempted to do operations on a repo with a version higher AND a non-default window size it would fail to do so. - A vanilla SVN server (or a client accessing the repository with file://) > should at least be able to read the repository, even it was constructed > with a different window size. Here too I fear this won't work (even if the > window size would be a "immutable setting at creation time"). > > Yes this would be potentially a problem but only for the repositories that do have a modified, non-default window size value. However let me experiment a bit more with this and confirm soon. > It all hinges on whether or not the current FSFS code can read data > compressed with different window sizes (different from its own setting). > Maybe things have changed since 2008 (there were several new versions of > the fsfs format since then), maybe they haven't. > > >From fsfs POV - the POC we have currently proposes to store the window size (if modified, or defaulted to DELTA_WINDOW_SIZE) on svn_fs_t struct as a new member (and the code is changed to use that and thread it where necessary) so *I think* this would work but the caveat that it has to be constant for the lifetime of the repository still holds. > Perhaps you can perform some experiments to test for the above issues, as > a form of further exploration. > > I will revert back in a few days after digging some more into compatibility questions, especially with code built without the proposed changes! Thanks again for looking into this! Best, Nikola [1] https://github.com/apache/subversion/blob/b2ae6efaeeda78383961ae13151574a6df58734c/subversion/libsvn_delta/text_delta.c#L369