Issue #525/#4892: on only fetching the pristines we really need

Julian Foad Thu, 10 Mar 2022 13:27:51 -0800

This is an investigation into changing the "pristines-on-demand"
approach to follow a principle that each operation would only fetch the
pristines it really needs.


I have begun a "user guide" ( notes/i525/i525-user-guide.md ), with the
aim of explaining the principle of operation of the current approach,
along with its expectations and limitations. Note well that the current
approach is based on a *different* principle from "each operation only
fetches the pristines it really needs".

As a reminder, the present design is based on a fetching paradigm that
is up-front and pessimistic: before any operation that *might* need
pristines, it ensures it fetched sufficient (but perhaps more than
necessary) pristines. After that fetching phase (see
'svn_client__textbase_sync'), it then runs the original operation code
path, assured that the operation will run correctly in its existing
form, without needing to be modified to support fetching via a deep
(point of use) callback.


Online vs offline operations
----------------------------

I want to draw a distinction, which may or may not help here, between
operations that were already "online" (required contacting the repo) and
those that were previously "offline" (local only).

The previously "online" operations include "update" of course, and
"switch" and "checkout --force" (both being sisters of update), and
"merge", and the forms of "diff" that compare base to repository.

Any online operation is going to connect to the repository anyway, in
its normal (previous) operation. When the current design deems that such
an operation needs to hydrate the pristines before it starts, this
"need" is more of a "uses in its current implementation". In principle
we could change its implementation to move the fetching of pristines
down the call stack to the point where it actually needs them, and so
ensure optimal fetching – in the sense of fetching only those it really
needs, and only when it really needs them.

This change would cause an increase in network traffic whenever a needed
pristine is missing; but only an increase. Because these operations are
already online, it would not cause any substantial qualitative
difference to the user experience or to the high level client software's
need to handle repository connection and authentication.

Now contrast this with the previously "offline" operations.

If we change a previously "offline" operation (local diff, revert, etc.)
to fetch only the pristines it actually needs, by pushing the fetch
callbacks down the call stack to the point of use, that would lead to a
qualitatively different user experience and high level client software
usage pattern. (Previously discussed. In short: the callback and need
for authentication, which may require user input, may come at any point
after the operation has started, where for example a GUI tool may be in
the middle of displaying a series of file diffs.) I do not know how much
of an issue that might be, but some people have expressed concern.

Perhaps a useful compromise could be:

  - for the "online" operations only, fetch at the point of use
(optimal: only fetching the pristines they actually need); and
  - retain the pessimistic up-front sync paradigm for the "offline"
operations (so avoiding the callback awkwardness for them).

That's just for consideration, not a strong recommendation.

Now, let us take a look at "update" in particular, because it came up as
a problem in a primary use case that prompted me to file issue #4892.


Why and how does "update" currently require pristines?
------------------------------------------------------

Note that update involves TWO pristines for each file: the old one
that corresponds to the old base revision before the update, and the new
one that corresponds to the new base revision after the update.

Update currently uses pristines in two distinct ways:

  - [deltas] The update code reports the needed update in terms of a
delta against the (old) base revision, on the assumption that the client
has a pristine copy of the base revision. The repository duly sends such
a delta. The WC layer then attempts to apply the delta it receives, and
at that point attempts to open and read the old pristine, in order to
apply the delta to create the new pristine.

  - [restore] The update code also looks for files that are missing on
disk (if the 'restore_files' option is passed, which it usually is), and
restores them by reading and translating their pristines. It restores
files on the reporting side (in svn_wc_crawl_revisions5), before
reporting the state of each file.

What would it take to modify "update" to fetch at point of use?
---------------------------------------------------------------

For the Deltas:
---------------

The relevant sub-case is a file with local modifications. (For an
unmodified file it can reconstruct the pristine on the fly.)

If the working file has local modifications, then after the base is
updated, there is a 3-way merge to update the working file, which needs
to read both the old pristine and the new pristine.

Possible approach:

  - If the working file is *unmodified* and the pristine is missing,
on the reporting side, report that the current version is empty
(whatever the appropriate incantation is for that), to request the
server to send the whole new file (a.k.a. delta against empty). The
receiver (apply-delta) will then not need to read the old pristine, and
will store the result as the new pristine, as usual. No 3-way merge is
needed to update the working file; instead, translate the new pristine.

  - If the working file is *locally modified* and the pristine is
missing, on the reporting side, first fetch its current (old) pristine.
Then everything proceeds as before: report the current (old) base
revision, thereby asking the server to send a delta against that
pristine. That (old) pristine will be available for use in the 3-way merge.

For the Restores:
-----------------

We would need to do this:

  - If a file needs to be restored and its pristine is missing, first
fetch it via callback.

  - Don't leave it in the pristine store afterwards, because by
definition this is a case where the file is unmodified. We might
implement this most simply as the sequence: fetch pristine, then
translate into working file, then clean up the pristine later. Or we
might want to optimise it into a single pass, streaming straight from
the repository through the translation into the working file, so there
is no time when disk space is needed for both the pristine copy and the
working copy simultaneously.

  - To be checked: For a file that ends up being updated later in the
update operation, it may be being restored unnecessarily at this step.
If that is the case, perhaps we can optimise by eliminating that. But
that seems to be an orthogonal optimisation, not dependent on i525.


Conclusions:
------------

It is certainly possible that we could modify "update" and the other
"online" operations, at least, and the previously "offline" operations
too if we want, to make them fetch pristines at point-of-use in this way.

Such modifications are not trivial. There is the need to run additional
RA requests in between the existing ones, perhaps needing an additional
RA session to be established in parallel, or taking care with inserting
RA requests into an existing session. There is the boilerplate
version-bumping (revving) of the APIs to pass callbacks down to the
points of use. There is probably more.

Issue #525/#4892: on only fetching the pristines we really need

Reply via email to