This is an investigation into changing the "pristines-on-demand" approach to follow a principle that each operation would only fetch the pristines it really needs.
I have begun a "user guide" ( notes/i525/i525-user-guide.md ), with the aim of explaining the principle of operation of the current approach, along with its expectations and limitations. Note well that the current approach is based on a *different* principle from "each operation only fetches the pristines it really needs". As a reminder, the present design is based on a fetching paradigm that is up-front and pessimistic: before any operation that *might* need pristines, it ensures it fetched sufficient (but perhaps more than necessary) pristines. After that fetching phase (see 'svn_client__textbase_sync'), it then runs the original operation code path, assured that the operation will run correctly in its existing form, without needing to be modified to support fetching via a deep (point of use) callback. Online vs offline operations ---------------------------- I want to draw a distinction, which may or may not help here, between operations that were already "online" (required contacting the repo) and those that were previously "offline" (local only). The previously "online" operations include "update" of course, and "switch" and "checkout --force" (both being sisters of update), and "merge", and the forms of "diff" that compare base to repository. Any online operation is going to connect to the repository anyway, in its normal (previous) operation. When the current design deems that such an operation needs to hydrate the pristines before it starts, this "need" is more of a "uses in its current implementation". In principle we could change its implementation to move the fetching of pristines down the call stack to the point where it actually needs them, and so ensure optimal fetching – in the sense of fetching only those it really needs, and only when it really needs them. This change would cause an increase in network traffic whenever a needed pristine is missing; but only an increase. Because these operations are already online, it would not cause any substantial qualitative difference to the user experience or to the high level client software's need to handle repository connection and authentication. Now contrast this with the previously "offline" operations. If we change a previously "offline" operation (local diff, revert, etc.) to fetch only the pristines it actually needs, by pushing the fetch callbacks down the call stack to the point of use, that would lead to a qualitatively different user experience and high level client software usage pattern. (Previously discussed. In short: the callback and need for authentication, which may require user input, may come at any point after the operation has started, where for example a GUI tool may be in the middle of displaying a series of file diffs.) I do not know how much of an issue that might be, but some people have expressed concern. Perhaps a useful compromise could be: - for the "online" operations only, fetch at the point of use (optimal: only fetching the pristines they actually need); and - retain the pessimistic up-front sync paradigm for the "offline" operations (so avoiding the callback awkwardness for them). That's just for consideration, not a strong recommendation. Now, let us take a look at "update" in particular, because it came up as a problem in a primary use case that prompted me to file issue #4892. Why and how does "update" currently require pristines? ------------------------------------------------------ Note that update involves TWO pristines for each file: the old one that corresponds to the old base revision before the update, and the new one that corresponds to the new base revision after the update. Update currently uses pristines in two distinct ways: - [deltas] The update code reports the needed update in terms of a delta against the (old) base revision, on the assumption that the client has a pristine copy of the base revision. The repository duly sends such a delta. The WC layer then attempts to apply the delta it receives, and at that point attempts to open and read the old pristine, in order to apply the delta to create the new pristine. - [restore] The update code also looks for files that are missing on disk (if the 'restore_files' option is passed, which it usually is), and restores them by reading and translating their pristines. It restores files on the reporting side (in svn_wc_crawl_revisions5), before reporting the state of each file. What would it take to modify "update" to fetch at point of use? --------------------------------------------------------------- For the Deltas: --------------- The relevant sub-case is a file with local modifications. (For an unmodified file it can reconstruct the pristine on the fly.) If the working file has local modifications, then after the base is updated, there is a 3-way merge to update the working file, which needs to read both the old pristine and the new pristine. Possible approach: - If the working file is *unmodified* and the pristine is missing, on the reporting side, report that the current version is empty (whatever the appropriate incantation is for that), to request the server to send the whole new file (a.k.a. delta against empty). The receiver (apply-delta) will then not need to read the old pristine, and will store the result as the new pristine, as usual. No 3-way merge is needed to update the working file; instead, translate the new pristine. - If the working file is *locally modified* and the pristine is missing, on the reporting side, first fetch its current (old) pristine. Then everything proceeds as before: report the current (old) base revision, thereby asking the server to send a delta against that pristine. That (old) pristine will be available for use in the 3-way merge. For the Restores: ----------------- We would need to do this: - If a file needs to be restored and its pristine is missing, first fetch it via callback. - Don't leave it in the pristine store afterwards, because by definition this is a case where the file is unmodified. We might implement this most simply as the sequence: fetch pristine, then translate into working file, then clean up the pristine later. Or we might want to optimise it into a single pass, streaming straight from the repository through the translation into the working file, so there is no time when disk space is needed for both the pristine copy and the working copy simultaneously. - To be checked: For a file that ends up being updated later in the update operation, it may be being restored unnecessarily at this step. If that is the case, perhaps we can optimise by eliminating that. But that seems to be an orthogonal optimisation, not dependent on i525. Conclusions: ------------ It is certainly possible that we could modify "update" and the other "online" operations, at least, and the previously "offline" operations too if we want, to make them fetch pristines at point-of-use in this way. Such modifications are not trivial. There is the need to run additional RA requests in between the existing ones, perhaps needing an additional RA session to be established in parallel, or taking care with inserting RA requests into an existing session. There is the boilerplate version-bumping (revving) of the APIs to pass callbacks down to the points of use. There is probably more.