Karl Fogel <kfo...@red-bean.com> writes: > 1) Make pristine text-base files optional. See issue #525 for > details. In summary: currently, every large file uses twice the > storage on the client side, and yet for most of these files > there's little benefit. They're usually not plaintext, so 'svn > diff' against the pristine base is pointless (unless you have some > specialized diff tool for the particular binary format, but that's > rare), and 'svn commit' likewise just sends up the whole working > file. The only thing a local base gets you is local 'svn revert', > which can be nice, but many of us would happily give it up for > large files to avoid the 2x local storage cost.
A proof-of-concept implementation of one possible approach for making the text-bases optional is available on the branch: https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand While still being rough on the edges, it passes the whole test suite in my environment and seems to work quite well in practice. A few notes on the current state: - The implementation includes a bump of the working copy format. - Although this would most certainly have to change in the future, for now any working copy of the new format is considered to always have optional text-bases (for easier testing, etc.). - There is no UI and no configuration yet, just the first cut of the core part. The core idea is that we maintain the following invariant: only the modified files have their pristine text-base files available on the disk. - To avoid having to access the text-base, the "is the file modified?" check is performed by calculating the checksum of a file and comparing that to what's recorded in the working copy. - A text-base of the unmodified file is the file itself, appropriately detranslated. - To get into the appropriate state at the beginning of the operation, we walk through the current text-base info in the db and check if the corresponding working files are modified. The missing text-bases are fetched using the svn_ra layer. The operations also include a final step during which the no longer required text-bases are removed from disk. - The operations that don't need to access the text-bases (such as "svn ls" or the updated "svn st") do not perform this walk and do not synchronize the text-base state. The pros and cons of the approach could probably be summarized as follows: Pros: 1. A working copy without modifications does not store the text-bases, thus avoiding the 2x local storage cost that could be significant in case of large files. A working copy with modifications only stores the text-bases for the modified files. 2. The fetched text-bases for modified files are reused across operations. For example, calling "svn diff" twice is only going to fetch the required text-bases once. In case of a third-party client that makes such calls N times under the assumption that they are "local", the fetch is going to only happen once, rather than N times. 3. With a generic approach like this, we might have fewer issues with introducing any kind of new working copy functionality that depends on the text-bases. Cons: 1. The worst-case scenario for this approach (where every single file in the working copy is modified) would work by fetching all of the text-bases. And while that is going to take just as much storage as it currently does, there’s the added cost of the fetch itself. 2. The missing text-bases will be fetched before the operation. Fetching very large text-bases may take a significant amount of time and may have an impact on the user experience, although that might be alleviated by providing a UI of some kind. 3. Even without the fetch, the "synchronize the text-base state" part has some performance penalty. This penalty will only occur in the working copies with optional text-bases and only if the operation synchronizes the text-base state. The added cost is going to be of the order of two "svn st" calls for the target of the operation. Overall, I tend to think that this approach should work reasonably well even despite these cons, and that it is going to solve the issue in those cases where the 2x local storage requirement is a problem. So, how does that sound in general? If we were to try to get this to a production-ready state, it would probably make sense to also: 1. Complete the work on ^/subversion/branches/multi-wc-format so that the client would work with both the new and old working copy formats, for a seamless user experience and better compatibility. 2. For the new working copy format, incorporate a switch to a different checksum type without known collisions instead of SHA-1. 3. Fix the minor issues written down as TODOs in the code. Thanks, Evgeny Kotkov