Re: A two-part vision for Subversion and large binary objects.

Evgeny Kotkov Fri, 27 Aug 2021 06:56:00 -0700

Karl Fogel <kfo...@red-bean.com> writes:

> 1) Make pristine text-base files optional.  See issue #525 for
> details.  In summary: currently, every large file uses twice the
> storage on the client side, and yet for most of these files
> there's little benefit.  They're usually not plaintext, so 'svn
> diff' against the pristine base is pointless (unless you have some
> specialized diff tool for the particular binary format, but that's
> rare), and 'svn commit' likewise just sends up the whole working
> file.  The only thing a local base gets you is local 'svn revert',
> which can be nice, but many of us would happily give it up for
> large files to avoid the 2x local storage cost.


A proof-of-concept implementation of one possible approach for making
the text-bases optional is available on the branch:

  https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand

While still being rough on the edges, it passes the whole test suite in
my environment and seems to work quite well in practice.

A few notes on the current state:

- The implementation includes a bump of the working copy format.

- Although this would most certainly have to change in the future, for now
  any working copy of the new format is considered to always have optional
  text-bases (for easier testing, etc.).

- There is no UI and no configuration yet, just the first cut of the core part.

The core idea is that we maintain the following invariant: only the modified
files have their pristine text-base files available on the disk.

- To avoid having to access the text-base, the "is the file modified?" check
  is performed by calculating the checksum of a file and comparing that to
  what's recorded in the working copy.

- A text-base of the unmodified file is the file itself, appropriately
  detranslated.

- To get into the appropriate state at the beginning of the operation, we walk
  through the current text-base info in the db and check if the corresponding
  working files are modified.  The missing text-bases are fetched using the
  svn_ra layer.  The operations also include a final step during which the
  no longer required text-bases are removed from disk.

- The operations that don't need to access the text-bases (such as "svn ls"
  or the updated "svn st") do not perform this walk and do not synchronize
  the text-base state.

The pros and cons of the approach could probably be summarized as follows:

Pros:

 1. A working copy without modifications does not store the text-bases,
    thus avoiding the 2x local storage cost that could be significant
    in case of large files.  A working copy with modifications only stores
    the text-bases for the modified files.

 2. The fetched text-bases for modified files are reused across operations.

    For example, calling "svn diff" twice is only going to fetch the required
    text-bases once.  In case of a third-party client that makes such calls
    N times under the assumption that they are "local", the fetch is going to
    only happen once, rather than N times.

 3. With a generic approach like this, we might have fewer issues with
    introducing any kind of new working copy functionality that depends on
    the text-bases.

Cons:

 1. The worst-case scenario for this approach (where every single file in the
    working copy is modified) would work by fetching all of the text-bases.

    And while that is going to take just as much storage as it currently does,
    there’s the added cost of the fetch itself.

 2. The missing text-bases will be fetched before the operation. Fetching very
    large text-bases may take a significant amount of time and may have an
    impact on the user experience, although that might be alleviated by
    providing a UI of some kind.

 3. Even without the fetch, the "synchronize the text-base state" part has
    some performance penalty.  This penalty will only occur in the working
    copies with optional text-bases and only if the operation synchronizes
    the text-base state.

    The added cost is going to be of the order of two "svn st" calls for
    the target of the operation.

Overall, I tend to think that this approach should work reasonably well even
despite these cons, and that it is going to solve the issue in those cases
where the 2x local storage requirement is a problem.

So, how does that sound in general?

If we were to try to get this to a production-ready state, it would probably
make sense to also:

 1. Complete the work on ^/subversion/branches/multi-wc-format so that the
    client would work with both the new and old working copy formats, for
    a seamless user experience and better compatibility.

 2. For the new working copy format, incorporate a switch to a different
    checksum type without known collisions instead of SHA-1.

 3. Fix the minor issues written down as TODOs in the code.


Thanks,
Evgeny Kotkov

Re: A two-part vision for Subversion and large binary objects.

Reply via email to